Monday, May 9, 2011

Simple Ruby Scraping Script and storing the result in database

In my previous post Simple Ruby Screen Scraper using Mechanize, Hpricot and XPath , I explained how to do screen scraping by writing the code on irb (Interactive Ruby Shell) console. Instead of writing the script line by line on console we can create a Ruby script that can be executed which will make our task much easier. So, in this post you will find how to write a Ruby script, how to execute a Ruby script, also how to scrap and store the result in table.

Here I will be scraping all the links given on the left panel under "My Blogs" section on my website "http://www.kumarritesh.com/" with XPath and will store the result in table in db. To store the result in table, let us create the table first in MySql. I named the db as 'scraping_db' and table as 'scraps'. Here is the code to create the db and table

CREATE DATABASE /*!32312 IF NOT EXISTS*/`scraping_db` /*!40100 DEFAULT CHARACTER SET latin1 */;
USE `scraping_db`;

/*Table structure for table `scraps` */

DROP TABLE IF EXISTS `scraps`;

CREATE TABLE `scraps` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `links` varchar(255) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1;

Now, I created the script file named 'screen-scraper.rb' and wrote the following code:

require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'active_record'

ActiveRecord::Base.establish_connection(
    :adapter  => 'mysql',
    :host     => 'localhost',
    :username => 'root',
    :password => '',
    :database => 'scraping_db')

class Scrap < ActiveRecord::Base
  @url      = "http://www.kumarritesh.com"
  @response = ''
  open(@url, "User-Agent" => "Ruby/#{RUBY_VERSION}"  ) 
{ |f|
    @response = f.read
  }

  doc   = Hpricot(@response)
  links = (doc/"/html/body/div/div/div[2]/div[2]/div[3]/div/ul/li/a").innerHTML
  puts "#{links} "

  links.each do |link|
    scraplink = Scrap.new(:links => link)
    scraplink.save
    puts "#{link} "
  end
end

Browse to the folder where this script is saved. Execute the above code by command "ruby screen-scraper.rb"
It will print "Quality AssuranceSearch Engine OptimizationRuby on RailsBlog Links" twice on the console and will create one record with data in the links column as "Quality AssuranceSearch Engine OptimizationRuby on RailsBlog Links".

So, now you know how to write a Ruby script, how to execute a Ruby script, also how to scrap and store the result in table.

No comments:

Post a Comment