Cleaning “your” private data: gmail inbox

So you have decided its time to clean “your” gmail inbox because it is becoming highly cluttered with long time forgotten or marketing emails which are just obfuscating real messages out there from real people. Or maybe you are concerned with your privacy and would like to reclaim some of your data. Whatever the reason, google has a function which allows you to download your full inbox file in the zip archive format. Later this file could be quite easily opened using for example open source Mozilla Thunderbird email client.

Bellow is the google link to raise the request:
https://takeout.google.com/settings/takeout

Make sure you do select gmail in the list and deselect all the other services.

1

Once you have downloaded all of your data (might take a while for google to prepare your archive) and moved it into the safe place, you can easily go on and delete all your emails from gmail inbox. To do that, find folder called “all mail” and mark a checkbox on the left corner, click on the popup link to select all emails in this box and click delete. After that, do the same for the rubbish bin and its all clean.

Now, you might be wondering why did I put quotes on “your” word? Well, the promise of gmail service is very straight forward, you get free reliable service, i.e. email storage, email address, in an exchange for your privacy – i.e. you have to share your data for the purposes of marketing/advertising.

If you are someone who is concerned with the privacy and would like to use a more robust solution, try Protonmail which uses public key/private key encryption and is based in Switzerland. Public key/private key encryption makes sure that only you and receiver at the other end (using same encryption method) have capability to read email that is being exchanged. This service use well know business model where premium accounts with extra storage pay for the service of all free accounts as well as service accept donations. I must warn you though, if you decide to switch from gmail to Protonmail, the trade off will be – convenience and possibly availability of service due to possible attacks from governments.

Python html data scrambling (indeed.com example)

I’m surprised how great python is, and what you can do with this programming language. Its not just useful for data analytics or data science or statistics, but always for various other types of data related activities such as pulling data from website for variety of analyses. In this example I have targeted Indeed job board website – a very nicely written job board application. The purpose of the code is to demonstrate how python can be used to automatically get specific data from this website, in this case – a list of job titles and company names. The data at first is being fetched using specific library in the code and html request and stored into an object. Later being converted to long string and then analysed in order to filter out what is needed and what is not. This is all happening through a couple of iterations. The specific library (lxml) that was used to get html page has some functionality to target specific xpath tags, which unfortunately didn’t function as expected, so to save time I have simply used string manipulation functions and a bit of html string analysis to achieve the desired result. The final product of this code is a csv file with long list (depending on parameters) of rows with job and company column. Empty database.csv file has to be created locally for this to work. The code bellow and ipython jupyter notebook attached:

1
2

Bellow is all data in the CSV file (LibreOffice Calc/MS Excel):
3

Next step could be getting additional data parameters like salary, posting date etc. This could potentially produce interesting data discovery insights. Also, linkages to other sites like glassdoor or similar could help to get more value. Although it was only an experimentation, code could potentially help to build job board aggregate system, fetching data from various job boards and presenting in one place. The challenge would be to analyse the code of each specific web site and tailor it, so that data comes up as clean as possible, as well as accurate