Scraping the New Zealand Whitepages with Ruby

A post by Peter Hollows about web scraping and useful gems. Posted 9 months ago.

In New Zealand, telcos don’t expose their directories sensibly to the public, so if your script needs to look up numbers for a given name there is no RESTful API. Instead, these companies provide us the data in a challenging HTML format; this is because they are nice and want to give us a fun scripting project.

Web-scraping means parsing web resources like HTML sites and extracting the desired information reliably, usually while remaining undetected (it adds to the fun).

Enter Mechanize & Nokogiri

Ruby has a great web-scraping tool called Mechanize, which will take care of pretending we’re a browser, it’ll run Nokogiri (another great library) in the background to parse returned pages.

You can install these gems by running:

sudo gem install nokogiri mechanize

Implementation

The code is pretty simple. We pretend to be IE6 loading the WhitePages website. Once loaded, we find the form and fill it in using Mechanize and submit. Nokogiri parses the results and we’re left with a nice wrapper.

require 'nokogiri'
require 'open-uri'
require 'mechanize'

class WhitePages
  def self.search(what, where)
    # Fetch using mechanize.
    agent = WWW::Mechanize.new
    agent.user_agent_alias = 'Windows IE 6'
    form = agent.get('http://whitepages.co.nz/').forms.last
    form.what = what
    form.where = where
    page = agent.submit(form)
    
    # Parse using Nokogiri.
    page = Nokogiri.HTML(page.content)
    results = []
    page.css('table#searchResultsTbl tbody.resultBody').each do |html|
      results << {
        :name => html.css('tr.firstRow td.datarow a').first.content,
        :address => html.css('tr dd.bizAddr').first.content,
        :number => html.search('tr dd.phone span.phoneStatic').first.content.gsub(/\D/, '')
      }
    end
    results
  end
end

What does it do?

# Running the following:
WhitePages.search 'P Hollows', 'Christchurch'

# Returns:
[{:name    => "Hollows I & P",
  :number  => "033771403",
  :address => "2/28 Gloucester St Christchurch"},
 {:name    => "Manley T & Hollows P", # old listing
  :number  => "039805473",
  :address => "15 George St Riccarton Christchurch"}]

Disclaimer

Be careful with what you publish when scraping information from publicly available resources as it’s usually under protection of copyright.