I’ve recently launched the Malaysia Air Pollution Index (haze.net.my), a site which scrapes data from the Malaysia Department of Environment (DOE) website and tries to make the data more relevant and meaningful, hopefully creating something much more worthwhile to the citizens.
Scraping the DOE site with Hpricot
Before I began even working on the Rails application I needed to know if it was possible to scrape the DOE site.
Test for One Page
Using Hpricot I took the latest available data, and wrote a little Ruby script to parse it and stored all data into a YAML file.
The YAML file would be have the same structure as the fixtures used by Rails to run the unit and functional tests.
I toyed around with having each reading (morning and afternoon) as its own record, and the other option is having them as a single record containing both datasets together for an area.
I opted for the 2nd option, as the readings are not incremented in a linear fashion. Morning reading is at 11am, and the evening reading at 5pm. 6 hour interval, then another 18 hour interval. Also it makes it easier to compare the data for that particular area and date if it was joined. The average of the two readings can be stored in the same row. Worst comes to worse I can split up the information at a later date if required.
While scraping it I sanitized the area name and state by trimming the representative strings. So I initially expanded the String class to include a sanitize function via Ruby’s Open Classes feature.
Feeling it was a successful run, I needed to run it on all the dates. Before that, I needed to know the url for each specified date.
Finding the Pages to Parse
I wrote another script which looked into the listings for the various years (2005, 2006, and 2007) and gave me a three files (one for each year) which listed together the dates and their respective links.
Sadly alot of problems were encountered in this, which resulted in:
- Alot of duplicate links and dates
- Links with missing dates
- Dates with missing links
- Missing dates and links
The problem arose because the content isn’t represented properly by the HTML. Which resulted in a semi useful file.
I solved a small portion of it by running it through the uniq command line app, removing all the duplicates. But I still had to go through it manually and load each page to see if it represented the correct date.
If it didn’t. Change it. If there was a missing date. Find it. This was done by changing the the url parameters serially going up or down, as it may not be in the listings at all!
Initially I only did it for the 2007 dataset as it was more recent, thus seeming more relevant to me.
Test for Many Pages
Knowing which dates represent which links, before scraping the content, I pulled down each page and stored it on my hard drive making it easier to work with if I had to change my scraping algorithm in anyway.
Satisfied, I executed my scraping script and produced a half years worth of data. Each day stored into a different file.
Most excellent.
Feeling satisfied and having a good dataset to experiment with, I thought it would be good to start creating my Rails project.
Stay Tuned for Part 2