So I have been playing with python lately, writing stuff without any initially set purpose, at least not so explicitly stated. I have decided to experiment with word count algorithms and build a tool, which could analyse word count of particular news agency online store front, where all the gossip is flourishing. I thought this could be interesting, not just because experimentation and playfulness is the holy grail of all discoveries, but because I could learn something from this experience.
So I started with a couple of libraries lxml and BeautifulSoup, which unfortunately did not deliver as expected. Lxml is not a perfect html parsing tool and beatifulsoup also did not provide me what I was looking for, i.e. visual text of the web page, no tags, no html gibberish stuff. Later I have switched to html5lib, which was a bit better, but far from perfect and a bit slower. Since none of these tools provided me with simple text format web page, I had to write some of the manual string manipulations that could do it for me, since when working with data where suppose to tell a story, priority should always be quality, otherwise the story might be quite misleading.
After soving some of the technical issues, there was another challenge, i.e. specific language aspects that are relevant for particular website. For example words such as “the,, “is”, “at”, or verbs that do not really have any meaning without the knowledge of nouns that surround them. So for this first code iteration, I did keep in mind possible noun, verb analysis, but focused on nouns by eliminating verbs that do not tell the story alone. However since I was analysis a couple of websites – one in English (www.dailymail.co.uk) and one in Lithuanian language (www.lrytas.lt), I have realized that different languages pose different challenges. For example, in Lithuanian language words like “namas”, “namui”, “name” mean the same thing – i.e. house or “namas”. The reason why endings of these words change is the specific focus of the sentence. For Example if you would say what house, where is the house, or who has the house, in Lithuanian language all these same nouns would have different endings duo to its direct relation to verbs. Anyway, I’m not going to do analysis on language aspects much here, except that highlighting some of the aspects that code need to be aware of, otherwise both names such as “namas” and “namui” would be counted as separate, which could potentially ruin the analysis.
Once I have set the minimum required dictionaries and mappings for English and Lithaunian languages, I have developed an algorith which allows you to enter the website and see specific statistics about its word count. For data visualisation I have used bokeh library, which is really great ongoing development, probably opening some new horizons for open source data visualisation applications.
Ok, so bellow are the results for www.lrytas.lt:
And for www.dailymail.co.uk:
In my next blog post I will try to continue answering the question about what this data can tell us, however although we can see the trend in terms of focus within both of these distinct mediums (i.e. lrytas focus on Lithuania, Vilnius – generally concepts of community structures, dailymail focus on celebrity, star, dress etc – individuals with status), there are still technical questions to be resolved. Primarily verbs and nouns. In this example there are numerous verbs that were removed based on my subjective judgement. I will probably aim to separate all nouns from verbs and them try to analyse data in terms of what follows noun or precede it. This should be done with the knowledge of language, so still research to be done.
So far lessons learned:
Distinct languages have different features that need to be taken into account. (e.g. issue with Lithuanian language word endings)
Not all words alone tell a story (e.g. verbs)
Still concerns with python html to text libraries, not perfect, still need to do a bit coding to get to that close to perfect state, i.e. that is when library could deal with everything you would through at it, regardless if html code was written properly.
Also perhaps plunging language databases to algorithm would help to deal with mapping same meaning words.
Data sampling issues. One sample might not take a full story, since news are updated throughout the day. Perhaps sampling stories throughout the week on specific times during the day and aggregating the results would potentially yield something else.