Python html data scrambling ( example)

I’m surprised how great python is, and what you can do with this programming language. Its not just useful for data analytics or data science or statistics, but always for various other types of data related activities such as pulling data from website for variety of analyses. In this example I have targeted Indeed job board website – a very nicely written job board application. The purpose of the code is to demonstrate how python can be used to automatically get specific data from this website, in this case – a list of job titles and company names. The data at first is being fetched using specific library in the code and html request and stored into an object. Later being converted to long string and then analysed in order to filter out what is needed and what is not. This is all happening through a couple of iterations. The specific library (lxml) that was used to get html page has some functionality to target specific xpath tags, which unfortunately didn’t function as expected, so to save time I have simply used string manipulation functions and a bit of html string analysis to achieve the desired result. The final product of this code is a csv file with long list (depending on parameters) of rows with job and company column. Empty database.csv file has to be created locally for this to work. The code bellow and ipython jupyter notebook attached:


Bellow is all data in the CSV file (LibreOffice Calc/MS Excel):

Next step could be getting additional data parameters like salary, posting date etc. This could potentially produce interesting data discovery insights. Also, linkages to other sites like glassdoor or similar could help to get more value. Although it was only an experimentation, code could potentially help to build job board aggregate system, fetching data from various job boards and presenting in one place. The challenge would be to analyse the code of each specific web site and tailor it, so that data comes up as clean as possible, as well as accurate


