Never been to DZone Snippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

Scrape an XHTML document using Ruby (See related posts)

A simple Ruby script to scrape an XHTML file with the selected content being saved to an xml file ready for transformation into an RSS feed. This example uses the XHTML file from http://newsgang.net/audio/ which is then saved locally as 'thegang.xml'.

#!/usr/bin/ruby
# file: thegang.rb

require 'rexml/document'
include REXML

class TheGang
  def initialize()
  end
  
  def rssify()
    file = File.new('thegang.xml','r')
    doc = Document.new(file)
    rss_doc = Document.new
    root = Element.new('rss')
    rss_doc.add_element(root)
    
    doc.root.elements.each("body/div/ul/li/h2/a") do |node|    
      o_rssitem = Element.new('item')
      o_li = node.parent.parent
      
      o_rsstitle = Element.new('title')
      o_rsstitle.text = node.text.gsub(/[\n,' ']/,'')
      o_rssitem.add_element(o_rsstitle)
      
      o_rsshref_audio = Element.new('href_audio')
      o_rsshref_audio.text = node.attributes.get_attribute('href').to_s.gsub('amp;&','')      
      o_rssitem.add_element(o_rsshref_audio)
      
      o_rsshref = Element.new('href')
      o_rsshref.text = o_rsshref_audio.text.gsub('&from=audio','')      
      o_rssitem.add_element(o_rsshref)
      
      o_rssdate = Element.new('date')
      o_rssdate.text = "#{o_li.elements["p/span[1]"].text} #{o_li.elements["p/span[2]"].text}"
      o_rssitem.add_element(o_rssdate)
      rss_doc.root.add_element(o_rssitem)
      
    end

    file = File.new('thegang_rss.xml','w')
    file.puts rss_doc
    file.close
  end
end


if __FILE__ == $0
  gang = TheGang.new
  gang.rssify
end


see also: www.dapper.net

output (extract)
<rss>
  <item><title>TheGangXII-II</title><href_audio>/gangitem/id=6501&amp;from=audio</href_audio><href>/gangitem/id=6501</href><date>Jan 25</date></item>
  <item><title>TheGangXII-I</title><href_audio>/gangitem/id=6499&amp;from=audio</href_audio><href>/gangitem/id=6499</href><date>Jan 25</date></item>
  <item><title>NewsGangLive01.24.08</title><href_audio>/gangitem/id=6445&amp;from=audio</href_audio><href>/gangitem/id=6445</href><date>Jan 24</date></item>
  <item><title>NewsGangLiveII</title><href_audio>/gangitem/id=6377&amp;from=audio</href_audio><href>/gangitem/id=6377</href><date>Jan 23</date></item>
  ...
</rss>

You need to create an account or log in to post comments to this site.


Click here to browse all 4834 code snippets

Related Posts