Never been to DZone Snippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Scraping Google Search Results with Hpricot (See related posts)

// snagged from http://g-module.rubyforge.org/

require 'rubygems'
require 'cgi'
require 'open-uri'
require 'hpricot'

q = %w{meine kleine suchanfrage}.map { |w| CGI.escape(w) }.join("+")
url = "http://www.google.com/search?q=#{q}"
doc = Hpricot(open(url).read)
lucky_url = (doc/"div[@class='g'] a").first["href"]
system 'open #{lucky_url}'

Comments on this post

rubyminer posts on Jun 13, 2007 at 01:31
The same code with scRUBYt! (http://scrubyt.org) - this one also crawls to the next page, yielding 20 results.

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
   fetch 'http://www.google.com/ncr'
   fill_textfield 'q', 'ruby'
   submit

   link "Ruby Programming Language/@href"
   next_page "Next", :limit => 2
end

puts google_data.to_xml 


Result:

<root>
   <link>http://www.ruby-lang.org/</link>
   <link>http://www.ruby-lang.org/en/20020101.html</link>
   <link>http://en.wikipedia.org/wiki/Ruby_programming_language</link>
   <link>http://en.wikipedia.org/wiki/Ruby</link>
   <link>http://www.rubyonrails.org/</link>
   <link>http://www.rubycentral.com/</link>
   <link>http://www.rubycentral.com/book/</link>
   <link>http://www.w3.org/TR/ruby/</link>
   <link>http://www.zenspider.com/Languages/Ruby/QuickRef.html</link>
   <link>http://poignantguide.net/</link>
   <link>http://www.rubynz.com/</link>
   <link>http://www.ruby-doc.org/</link>
   <link>http://tryruby.hobix.com/</link>
   <link>http://www.rubycentral.org/</link>
   <link>http://www.gemstone.org/ruby.html</link>
   <link>http://whytheluckystiff.net/ruby/pickaxe/</link>
   <link>http://intertwingly.net/blog/</link>
   <link>http://lotusmedia.org/</link>
   <link>http://rubyforge.org/frs/?group_id=167</link>
   <link>http://www.oreillynet.com/ruby/</link>
</root> 


For those who think this is not robust (it isn't indeed, since if you change the search query, it breaks), scRUBYt! is able to export a production extractor:

require 'rubygems'
require 'scrubyt'

google_data = Scrubyt::Extractor.define do
  fetch("http://www.google.com/ncr")
  fill_textfield("q", "anything else")
  submit

  link "/html/body/div/div/div/a"
  next_page "Next", :limit => 2
end

puts google_data.to_xml
rubyminer posts on Aug 15, 2007 at 15:11
Here is a detailed tutorial for google scraping:

http://scrubyt.org/scrapin-google-in-no-sec/

You need to create an account or log in to post comments to this site.


Click here to browse all 4858 code snippets

Related Posts