<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DZone Snippets: crawler code</title>
    <link>http://snippets.dzone.com/posts</link>
    <pubDate>Fri, 16 May 2008 22:59:08 GMT</pubDate>
    <description>DZone Snippets: crawler code</description>
    <item>
      <title>Download all xkcd.com comics</title>
      <link>http://snippets.dzone.com/posts/show/4658</link>
      <description>This goes through all the first 329 (you might want to change this) pages, downloading the comic strips.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;for i in `seq 1 329`&lt;br /&gt;do&lt;br /&gt;	wget http://xkcd.com/$i/&lt;br /&gt;	wget `grep http://imgs.xkcd.com/comics/ index.html | head -1 | cut -d\" -f2`&lt;br /&gt;	rm index.html&lt;br /&gt;done&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Mon, 15 Oct 2007 18:05:51 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/4658</guid>
      <author>scvalex (Alexandru Scvortov)</author>
    </item>
    <item>
      <title>tamilbeat.com mp3 crawler</title>
      <link>http://snippets.dzone.com/posts/show/3668</link>
      <description>create a file called links.dat and put the songs link from tamilbeat.com&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#!/usr/bin/env ruby&lt;br /&gt;require 'net/http'&lt;br /&gt;require 'socket'&lt;br /&gt;                                                                                                                            &lt;br /&gt;Thread.abort_on_exception = true&lt;br /&gt;threads = []&lt;br /&gt;                                                                                                                            &lt;br /&gt;line = File.open("links.dat")&lt;br /&gt;IO.foreach("links.dat") {|line|&lt;br /&gt;  if %r{http://([^/]+)/([^/]+/+.+)}i =~ line&lt;br /&gt;    domain,path = $1, $2&lt;br /&gt;  end&lt;br /&gt;  web = TCPSocket.new(domain,"http")&lt;br /&gt;  web.print "GET /"+path+" HTTP/1.0\n\n"&lt;br /&gt;  web.print "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2"&lt;br /&gt;  answer = web.gets(nil)&lt;br /&gt;  web.close&lt;br /&gt;                                                                                                                            &lt;br /&gt;  # for mp3&lt;br /&gt;  arr = answer.scan(/http:\/\/www.+mp3/)&lt;br /&gt;                                                                                                                            &lt;br /&gt;  arr.each do |e|&lt;br /&gt;    threads &lt;&lt; Thread.new(e){|mp3|&lt;br /&gt;      if %r{http://([^/]+)/([^/]+/+.+)/(.+mp3)}i =~ mp3&lt;br /&gt;        website,song,name = $1, $2, $3&lt;br /&gt;      end&lt;br /&gt;      a=Net::HTTP.new(website,80)&lt;br /&gt;      song_get = "/"+song+"/"+name&lt;br /&gt;      puts "Fetching #{website}#{song_get}"&lt;br /&gt;      resp, data = a.get(song_get,nil)&lt;br /&gt;      puts "Got #{website}#{song_get}: #{resp.message}"&lt;br /&gt;      open(name,'w'){|f| f.write(data)}&lt;br /&gt;    }&lt;br /&gt;  end&lt;br /&gt;}&lt;br /&gt;threads.each {|aThread| aThread.join}&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Wed, 14 Mar 2007 07:53:45 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/3668</guid>
      <author>mbchandar (balachandar)</author>
    </item>
    <item>
      <title>Banning bad bots</title>
      <link>http://snippets.dzone.com/posts/show/1935</link>
      <description>The following code is the contents of /banme/index.php.  This file is linked to from my main website but invisible to web browsers and disallowed in robots.txt.  Therefore, only bad bots will ever follow this link and when they do so they will get banned in .htaccess and their ip address will be emailed to webmaster@example.com.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;?php&lt;br /&gt;$i = getenv('REMOTE_ADDR');&lt;br /&gt;$handle = fopen("../.htaccess", "a");&lt;br /&gt;fwrite($handle, "Deny from $i\n");&lt;br /&gt;fclose($handle);&lt;br /&gt;echo "You've just got $i banned from this domain.  You are a very bad person.";&lt;br /&gt;mail("webmaster@example.com", "Banned IP", "Deny from $i");&lt;br /&gt;?&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Tue, 18 Apr 2006 15:49:13 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/1935</guid>
      <author>lordrich ()</author>
    </item>
    <item>
      <title>Another perl crawler</title>
      <link>http://snippets.dzone.com/posts/show/1895</link>
      <description>Again found in my old source folder, it may not fully work.&lt;br /&gt;&lt;br /&gt;This Perl script reads in the existing links from links.dat into the array @bigarray.  It then loops through the array reading in each link and appending the new links it finds to links.dat.  If the script were run in a loop it would add every single web address it can find to links.dat.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;#!/usr/bin/perl &lt;br /&gt;use IO::Socket; &lt;br /&gt;use URI; &lt;br /&gt; &lt;br /&gt;open(LINKS, "&lt;&lt; links.dat"); &lt;br /&gt;@bigarray = (); &lt;br /&gt;while (&lt;LINKS&gt;) { &lt;br /&gt;        chomp; &lt;br /&gt;        push(@bigarray, $_); &lt;br /&gt;} &lt;br /&gt;close(LINKS); &lt;br /&gt; &lt;br /&gt;foreach $uri (@bigarray) { &lt;br /&gt;        ($domain = URI-&gt;new($uri)-&gt;authority) =~ s/^www\.//i; &lt;br /&gt;        $socket = IO::Socket::INET-&gt;new(PeerAddr &lt;br /&gt;                                =&gt; $domain, &lt;br /&gt;                                PeerPort =&gt; 80, &lt;br /&gt;                                Proto =&gt; 'tcp', &lt;br /&gt;                                Type =&gt; SOCK_STREAM) &lt;br /&gt;        or die "Couldn't connect"; &lt;br /&gt;        print $socket "GET / HTTP/1.0\n\n"; &lt;br /&gt;        #$page = &lt;$socket&gt;; &lt;br /&gt;        open(LINKS, "&gt;&gt; links.dat"); &lt;br /&gt;        while (defined($line = &lt;$socket&gt;)) { &lt;br /&gt;                $line =~ m{href="(.*?)"}ig; &lt;br /&gt;                print LINKS "$1"; &lt;br /&gt;            } &lt;br /&gt;        close(LINKS); &lt;br /&gt;        close($socket); &lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Tue, 11 Apr 2006 20:41:05 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/1895</guid>
      <author>lordrich ()</author>
    </item>
    <item>
      <title>PHP Web Crawler</title>
      <link>http://snippets.dzone.com/posts/show/1894</link>
      <description>Example output:&lt;br /&gt;&lt;br /&gt;-bash-2.05b$ php asp.php&lt;br /&gt;http://www.example.com&lt;br /&gt;http://www.rfc-editor.org/rfc/rfc2606.txt&lt;br /&gt;No links.&lt;br /&gt;&lt;br /&gt;-bash-2.05b$ cat links.dat&lt;br /&gt;http://www.example.com&lt;br /&gt;http://www.rfc-editor.org/rfc/rfc2606.txt&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;?php&lt;br /&gt;$datafile = "links.dat"; // file to keep the list of links in&lt;br /&gt;$regex = "/&lt;\s*a\s+[^&gt;]*href\s*=\s*[\"']?([^\"' &gt;]+)[\"' &gt;]/isU";  // regex to search for hrefs&lt;br /&gt;&lt;br /&gt;$handle = fopen($datafile, "r"); // open the data file&lt;br /&gt;$buffer = fgets($handle, 4096);&lt;br /&gt;$oldlinks[] = $buffer; // read the first link into an array&lt;br /&gt;while (!feof($handle)) {&lt;br /&gt;	$buffer = fgets($handle, 4096);&lt;br /&gt;	array_push($oldlinks,$buffer); // read the rest of the links into an array&lt;br /&gt;}&lt;br /&gt;fclose($handle); // close the data file&lt;br /&gt;&lt;br /&gt;foreach($oldlinks as $value) { // for every link in the array&lt;br /&gt;	print $value; // print it out&lt;br /&gt;	$remote = fopen(trim($value), "r") or die(); //open it or fail nicely&lt;br /&gt;	while (!feof($remote)) {&lt;br /&gt;		$html = fread($remote, 8192); // read in the remote page&lt;br /&gt;	}&lt;br /&gt;	fclose($remote); // close it&lt;br /&gt;	if (preg_match_all($regex, $html, $links)) { // if we find new links&lt;br /&gt;		$local = fopen($datafile, "a+"); // open the data file&lt;br /&gt;		foreach($links[1] as $value) { // for every new link&lt;br /&gt;			$value.="\n"; // append a new line&lt;br /&gt;			if(!in_array($value,$oldlinks)) { // if we haven't seen it before (nb - case sensitive)&lt;br /&gt;				print($value); // print it out&lt;br /&gt;				fwrite($local, $value); // and write it to file&lt;br /&gt;			}&lt;br /&gt;		}&lt;br /&gt;		fclose($local); // close the data file&lt;br /&gt;	}&lt;br /&gt;	else {&lt;br /&gt;		print("No links."); // we didn't find any links in the new file&lt;br /&gt;	}&lt;br /&gt;}&lt;br /&gt;?&gt;&lt;br /&gt;&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Tue, 11 Apr 2006 20:39:36 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/1894</guid>
      <author>lordrich ()</author>
    </item>
    <item>
      <title>Ruby web crawler</title>
      <link>http://snippets.dzone.com/posts/show/1893</link>
      <description>NB. Again, this script was found in my old source code folder, it may not be fully working.&lt;br /&gt;&lt;br /&gt;This Ruby script reads in a list of links from links.dat, it then picks out the ones it can easily spider and gets a list of URLs from each page listed in links.dat.  Every new URL it finds will be added to newlinks.dat for later spidering by another bot running along side this one. &lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;require 'socket' &lt;br /&gt;links = File.open("links.dat") &lt;br /&gt;while links.gets do &lt;br /&gt;        #domain = ($_ =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/); &lt;br /&gt;        if %r{http://([^/]+)/([^/]+)}i =~ $_ &lt;br /&gt;                domain,path = $1, $2 &lt;br /&gt;        end &lt;br /&gt;        if proto="http" &lt;br /&gt;                begin &lt;br /&gt;                        t = TCPSocket.new(domain, 'www') &lt;br /&gt;                rescue &lt;br /&gt;                        puts "error: #{$!}" &lt;br /&gt;                else &lt;br /&gt;                        t.print "GET /"+path+" HTTP/1.0\n\n" &lt;br /&gt;                        answer = t.gets(nil) &lt;br /&gt;                        t.close &lt;br /&gt;                end &lt;br /&gt; &lt;br /&gt;                if %r{&lt;a\s+href="(\w+)://([^"]+)"[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt;}i =~ answer &lt;br /&gt;                        proto, url, text = $1, $2, $3 &lt;br /&gt;                end &lt;br /&gt; &lt;br /&gt;                print proto+"://"+url+"\n" &lt;br /&gt;                old = File.open("newlinks.dat") &lt;br /&gt;                new = File.open("links.dat.tmp", File::WRONLY|File::TRUNC|File::CREAT) &lt;br /&gt;                while old.gets do &lt;br /&gt;                        if $_ != proto+"://"+url &lt;br /&gt;                                new.print $_ &lt;br /&gt;                        end &lt;br /&gt;                end &lt;br /&gt;                new.print proto+"://"+url &lt;br /&gt;                old.close &lt;br /&gt;                new.close &lt;br /&gt;                File.rename("newlinks.dat", "links.dat.orig") &lt;br /&gt;                File.rename("links.dat.tmp", "newlinks.dat") &lt;br /&gt;        end &lt;br /&gt;end &lt;br /&gt;links.close&lt;br /&gt;&lt;/code&gt;</description>
      <pubDate>Tue, 11 Apr 2006 20:36:29 GMT</pubDate>
      <guid>http://snippets.dzone.com/posts/show/1893</guid>
      <author>lordrich ()</author>
    </item>
  </channel>
</rss>
