Never been to DZone Snippets before?

Snippets is a public source code repository. Easily build up your personal collection of code snippets, categorize them with tags / keywords, and share them with the world

« Newer Snippets
Older Snippets »
Showing 1-6 of 6 total  RSS 

Download all xkcd.com comics

This goes through all the first 329 (you might want to change this) pages, downloading the comic strips.

#!/bin/bash

for i in `seq 1 329`
do
	wget http://xkcd.com/$i/
	wget `grep http://imgs.xkcd.com/comics/ index.html | head -1 | cut -d\" -f2`
	rm index.html
done

tamilbeat.com mp3 crawler

create a file called links.dat and put the songs link from tamilbeat.com

#!/usr/bin/env ruby
require 'net/http'
require 'socket'
                                                                                                                            
Thread.abort_on_exception = true
threads = []
                                                                                                                            
line = File.open("links.dat")
IO.foreach("links.dat") {|line|
  if %r{http://([^/]+)/([^/]+/+.+)}i =~ line
    domain,path = $1, $2
  end
  web = TCPSocket.new(domain,"http")
  web.print "GET /"+path+" HTTP/1.0\n\n"
  web.print "User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20070220 Firefox/2.0.0.2"
  answer = web.gets(nil)
  web.close
                                                                                                                            
  # for mp3
  arr = answer.scan(/http:\/\/www.+mp3/)
                                                                                                                            
  arr.each do |e|
    threads << Thread.new(e){|mp3|
      if %r{http://([^/]+)/([^/]+/+.+)/(.+mp3)}i =~ mp3
        website,song,name = $1, $2, $3
      end
      a=Net::HTTP.new(website,80)
      song_get = "/"+song+"/"+name
      puts "Fetching #{website}#{song_get}"
      resp, data = a.get(song_get,nil)
      puts "Got #{website}#{song_get}: #{resp.message}"
      open(name,'w'){|f| f.write(data)}
    }
  end
}
threads.each {|aThread| aThread.join}

Banning bad bots

The following code is the contents of /banme/index.php. This file is linked to from my main website but invisible to web browsers and disallowed in robots.txt. Therefore, only bad bots will ever follow this link and when they do so they will get banned in .htaccess and their ip address will be emailed to webmaster@example.com.

<?php
$i = getenv('REMOTE_ADDR');
$handle = fopen("../.htaccess", "a");
fwrite($handle, "Deny from $i\n");
fclose($handle);
echo "You've just got $i banned from this domain.  You are a very bad person.";
mail("webmaster@example.com", "Banned IP", "Deny from $i");
?>

Another perl crawler

Again found in my old source folder, it may not fully work.

This Perl script reads in the existing links from links.dat into the array @bigarray. It then loops through the array reading in each link and appending the new links it finds to links.dat. If the script were run in a loop it would add every single web address it can find to links.dat.

#!/usr/bin/perl 
use IO::Socket; 
use URI; 
 
open(LINKS, "<< links.dat"); 
@bigarray = (); 
while (<LINKS>) { 
        chomp; 
        push(@bigarray, $_); 
} 
close(LINKS); 
 
foreach $uri (@bigarray) { 
        ($domain = URI->new($uri)->authority) =~ s/^www\.//i; 
        $socket = IO::Socket::INET->new(PeerAddr 
                                => $domain, 
                                PeerPort => 80, 
                                Proto => 'tcp', 
                                Type => SOCK_STREAM) 
        or die "Couldn't connect"; 
        print $socket "GET / HTTP/1.0\n\n"; 
        #$page = <$socket>; 
        open(LINKS, ">> links.dat"); 
        while (defined($line = <$socket>)) { 
                $line =~ m{href="(.*?)"}ig; 
                print LINKS "$1"; 
            } 
        close(LINKS); 
        close($socket); 
}

PHP Web Crawler

Example output:

-bash-2.05b$ php asp.php
http://www.example.com
http://www.rfc-editor.org/rfc/rfc2606.txt
No links.

-bash-2.05b$ cat links.dat
http://www.example.com
http://www.rfc-editor.org/rfc/rfc2606.txt

<?php
$datafile = "links.dat"; // file to keep the list of links in
$regex = "/<\s*a\s+[^>]*href\s*=\s*[\"']?([^\"' >]+)[\"' >]/isU";  // regex to search for hrefs

$handle = fopen($datafile, "r"); // open the data file
$buffer = fgets($handle, 4096);
$oldlinks[] = $buffer; // read the first link into an array
while (!feof($handle)) {
	$buffer = fgets($handle, 4096);
	array_push($oldlinks,$buffer); // read the rest of the links into an array
}
fclose($handle); // close the data file

foreach($oldlinks as $value) { // for every link in the array
	print $value; // print it out
	$remote = fopen(trim($value), "r") or die(); //open it or fail nicely
	while (!feof($remote)) {
		$html = fread($remote, 8192); // read in the remote page
	}
	fclose($remote); // close it
	if (preg_match_all($regex, $html, $links)) { // if we find new links
		$local = fopen($datafile, "a+"); // open the data file
		foreach($links[1] as $value) { // for every new link
			$value.="\n"; // append a new line
			if(!in_array($value,$oldlinks)) { // if we haven't seen it before (nb - case sensitive)
				print($value); // print it out
				fwrite($local, $value); // and write it to file
			}
		}
		fclose($local); // close the data file
	}
	else {
		print("No links."); // we didn't find any links in the new file
	}
}
?>

Ruby web crawler

NB. Again, this script was found in my old source code folder, it may not be fully working.

This Ruby script reads in a list of links from links.dat, it then picks out the ones it can easily spider and gets a list of URLs from each page listed in links.dat. Every new URL it finds will be added to newlinks.dat for later spidering by another bot running along side this one.

require 'socket' 
links = File.open("links.dat") 
while links.gets do 
        #domain = ($_ =~ /http:\/\/.*\.([0-9a-zA-Z\-]+\.com|net|org)/); 
        if %r{http://([^/]+)/([^/]+)}i =~ $_ 
                domain,path = $1, $2 
        end 
        if proto="http" 
                begin 
                        t = TCPSocket.new(domain, 'www') 
                rescue 
                        puts "error: #{$!}" 
                else 
                        t.print "GET /"+path+" HTTP/1.0\n\n" 
                        answer = t.gets(nil) 
                        t.close 
                end 
 
                if %r{<a\s+href="(\w+)://([^"]+)"[^>]*>([^<]*)</a>}i =~ answer 
                        proto, url, text = $1, $2, $3 
                end 
 
                print proto+"://"+url+"\n" 
                old = File.open("newlinks.dat") 
                new = File.open("links.dat.tmp", File::WRONLY|File::TRUNC|File::CREAT) 
                while old.gets do 
                        if $_ != proto+"://"+url 
                                new.print $_ 
                        end 
                end 
                new.print proto+"://"+url 
                old.close 
                new.close 
                File.rename("newlinks.dat", "links.dat.orig") 
                File.rename("links.dat.tmp", "newlinks.dat") 
        end 
end 
links.close
« Newer Snippets
Older Snippets »
Showing 1-6 of 6 total  RSS