Archive for July, 2007

Remove Firebug JavaScript Console Calls on Deployment

One thing I keep forgetting to do when deploying a Rails application is to remove any Firebug JavaScript console calls I use for debugging.

You know, those:

Or something like that.

When committing back to the repository, you simply forget to remove them. Thus when deploying your new code, for those people who don’t have Firebug installed, the script will end in a premature death.

Oh noes!

This definitely isn’t cool, and its just a minor thing you forgot to do…and its causing a hell lot of problems.

Lucky for us we can hook into the Capistrano after update code callback to comment/strip out those lines from the JavaScript file.

To Comment Out:

To Strip Out:

So instead of removing those debugging calls from the file, just leave them there and let Capistrano and sed strip them out for ya!

Talk about an easy life :3.

Note: This only works in the top level javascripts directory.

Comments

Building the Malaysia Air Pollution Index: Part 1 - Scraping the DOE

I’ve recently launched the Malaysia Air Pollution Index (haze.net.my), a site which scrapes data from the Malaysia Department of Environment (DOE) website and tries to make the data more relevant and meaningful, hopefully creating something much more worthwhile to the citizens.

Scraping the DOE site with Hpricot

Before I began even working on the Rails application I needed to know if it was possible to scrape the DOE site.

Test for One Page

Using Hpricot I took the latest available data, and wrote a little Ruby script to parse it and stored all data into a YAML file.

The YAML file would be have the same structure as the fixtures used by Rails to run the unit and functional tests.

I toyed around with having each reading (morning and afternoon) as its own record, and the other option is having them as a single record containing both datasets together for an area.

I opted for the 2nd option, as the readings are not incremented in a linear fashion. Morning reading is at 11am, and the evening reading at 5pm. 6 hour interval, then another 18 hour interval. Also it makes it easier to compare the data for that particular area and date if it was joined. The average of the two readings can be stored in the same row. Worst comes to worse I can split up the information at a later date if required.

While scraping it I sanitized the area name and state by trimming the representative strings. So I initially expanded the String class to include a sanitize function via Ruby’s Open Classes feature.

Feeling it was a successful run, I needed to run it on all the dates. Before that, I needed to know the url for each specified date.

Finding the Pages to Parse

I wrote another script which looked into the listings for the various years (2005, 2006, and 2007) and gave me a three files (one for each year) which listed together the dates and their respective links.

Sadly alot of problems were encountered in this, which resulted in:

  • Alot of duplicate links and dates
  • Links with missing dates
  • Dates with missing links
  • Missing dates and links

The problem arose because the content isn’t represented properly by the HTML. Which resulted in a semi useful file.

I solved a small portion of it by running it through the uniq command line app, removing all the duplicates. But I still had to go through it manually and load each page to see if it represented the correct date.

If it didn’t. Change it. If there was a missing date. Find it. This was done by changing the the url parameters serially going up or down, as it may not be in the listings at all!

Initially I only did it for the 2007 dataset as it was more recent, thus seeming more relevant to me.

Test for Many Pages

Knowing which dates represent which links, before scraping the content, I pulled down each page and stored it on my hard drive making it easier to work with if I had to change my scraping algorithm in anyway.

Satisfied, I executed my scraping script and produced a half years worth of data. Each day stored into a different file.

Most excellent.

Feeling satisfied and having a good dataset to experiment with, I thought it would be good to start creating my Rails project.

Stay Tuned for Part 2

Comments (1)