What can word count analysis tell us?

So I have been playing with python lately, writing stuff without any initially set purpose, at least not so explicitly stated. I have decided to experiment with word count algorithms and build a tool, which could analyse word count of particular news agency online store front, where all the gossip is flourishing. I thought this could be interesting, not just because experimentation and playfulness is the holy grail of all discoveries, but because I could learn something from this experience.

So I started with a couple of libraries lxml and BeautifulSoup, which unfortunately did not deliver as expected. Lxml is not a perfect html parsing tool and beatifulsoup also did not provide me what I was looking for, i.e. visual text of the web page, no tags, no html gibberish stuff. Later I have switched to html5lib, which was a bit better, but far from perfect and a bit slower. Since none of these tools provided me with simple text format web page, I had to write some of the manual string manipulations that could do it for me, since when working with data where suppose to tell a story, priority should always be quality, otherwise the story might be quite misleading.

After soving some of the technical issues, there was another challenge, i.e. specific language aspects that are relevant for particular website. For example words such as “the,, “is”, “at”, or verbs that do not really have any meaning without the knowledge of nouns that surround them. So for this first code iteration, I did keep in mind possible noun, verb analysis, but focused on nouns by eliminating verbs that do not tell the story alone. However since I was analysis a couple of websites – one in English (www.dailymail.co.uk) and one in Lithuanian language (www.lrytas.lt), I have realized that different languages pose different challenges. For example, in Lithuanian language words like “namas”, “namui”, “name” mean the same thing – i.e. house or “namas”. The reason why endings of these words change is the specific focus of the sentence. For Example if you would say what house, where is the house, or who has the house, in Lithuanian language all these same nouns would have different endings duo to its direct relation to verbs. Anyway, I’m not going to do analysis on language aspects much here, except that highlighting some of the aspects that code need to be aware of, otherwise both names such as “namas” and “namui” would be counted as separate, which could potentially ruin the analysis.

Once I have set the minimum required dictionaries and mappings for English and Lithaunian languages, I have developed an algorith which allows you to enter the website and see specific statistics about its word count. For data visualisation I have used bokeh library, which is really great ongoing development, probably opening some new horizons for open source data visualisation applications.

Ok, so bellow are the results for www.lrytas.lt:
lrytas

 

And for www.dailymail.co.uk:
dailymail

In my next blog post I will try to continue answering the question about what this data can tell us, however although we can see the trend in terms of focus within both of these distinct mediums (i.e. lrytas focus on Lithuania, Vilnius – generally concepts of community structures, dailymail focus on celebrity, star, dress etc – individuals with status), there are still technical questions to be resolved. Primarily verbs and nouns. In this example there are numerous verbs that were removed based on my subjective judgement. I will probably aim to separate all nouns from verbs and them try to analyse data in terms of what follows noun or precede it. This should be done with the knowledge of language, so still research to be done.

So far lessons learned:

Distinct languages have different features that need to be taken into account. (e.g. issue with Lithuanian language word endings)

Not all words alone tell a story (e.g. verbs)

Still concerns with python html to text libraries, not perfect, still need to do a bit coding to get to that close to perfect state, i.e. that is when library could deal with everything you would through at it, regardless if html code was written properly.

Also perhaps plunging language databases to algorithm would help to deal with mapping same meaning words.

Data sampling issues. One sample might not take a full story, since news are updated throughout the day. Perhaps sampling stories throughout the week on specific times during the day and aggregating the results would potentially yield something else.

Cleaning “your” private data: gmail inbox

So you have decided its time to clean “your” gmail inbox because it is becoming highly cluttered with long time forgotten or marketing emails which are just obfuscating real messages out there from real people. Or maybe you are concerned with your privacy and would like to reclaim some of your data. Whatever the reason, google has a function which allows you to download your full inbox file in the zip archive format. Later this file could be quite easily opened using for example open source Mozilla Thunderbird email client.

Bellow is the google link to raise the request:
https://takeout.google.com/settings/takeout

Make sure you do select gmail in the list and deselect all the other services.

1

Once you have downloaded all of your data (might take a while for google to prepare your archive) and moved it into the safe place, you can easily go on and delete all your emails from gmail inbox. To do that, find folder called “all mail” and mark a checkbox on the left corner, click on the popup link to select all emails in this box and click delete. After that, do the same for the rubbish bin and its all clean.

Now, you might be wondering why did I put quotes on “your” word? Well, the promise of gmail service is very straight forward, you get free reliable service, i.e. email storage, email address, in an exchange for your privacy – i.e. you have to share your data for the purposes of marketing/advertising.

If you are someone who is concerned with the privacy and would like to use a more robust solution, try Protonmail which uses public key/private key encryption and is based in Switzerland. Public key/private key encryption makes sure that only you and receiver at the other end (using same encryption method) have capability to read email that is being exchanged. This service use well know business model where premium accounts with extra storage pay for the service of all free accounts as well as service accept donations. I must warn you though, if you decide to switch from gmail to Protonmail, the trade off will be – convenience and possibly availability of service due to possible attacks from governments.

Python html data scrambling (indeed.com example)

I’m surprised how great python is, and what you can do with this programming language. Its not just useful for data analytics or data science or statistics, but always for various other types of data related activities such as pulling data from website for variety of analyses. In this example I have targeted Indeed job board website – a very nicely written job board application. The purpose of the code is to demonstrate how python can be used to automatically get specific data from this website, in this case – a list of job titles and company names. The data at first is being fetched using specific library in the code and html request and stored into an object. Later being converted to long string and then analysed in order to filter out what is needed and what is not. This is all happening through a couple of iterations. The specific library (lxml) that was used to get html page has some functionality to target specific xpath tags, which unfortunately didn’t function as expected, so to save time I have simply used string manipulation functions and a bit of html string analysis to achieve the desired result. The final product of this code is a csv file with long list (depending on parameters) of rows with job and company column. Empty database.csv file has to be created locally for this to work. The code bellow and ipython jupyter notebook attached:

1
2

Bellow is all data in the CSV file (LibreOffice Calc/MS Excel):
3

Next step could be getting additional data parameters like salary, posting date etc. This could potentially produce interesting data discovery insights. Also, linkages to other sites like glassdoor or similar could help to get more value. Although it was only an experimentation, code could potentially help to build job board aggregate system, fetching data from various job boards and presenting in one place. The challenge would be to analyse the code of each specific web site and tailor it, so that data comes up as clean as possible, as well as accurate