Pages: [1]
  Print  
Author Topic: Useful perl modules for writing a basic webspider  (Read 1934 times)
rork
Guest
« on: November 06, 2011, 11:35:25 am »

'Currently I'm developing my own webspider, the primary use is to index sites I want to follow but that don't offer RSS feeds. It's currently in a simple state where it downloads a page and extract the links. Yet for retrieving the page in a nice way and parsing the contexts I use a few modules I'd like to share.

To retrieve the pages I use LWP::UserAgent and HTTP::Request. I aim the webspider to be nice so I check the robots.txt with WWW::RobotRules and the robot meta tags with HTML::TokeParser.

Retrieving the page
Of course it's easy enough to use LWP::Simple to retrieve a page but with LWP::UserAgent I can retrieve better error messages, use cookies and add identification information. For the request is constructed with HTTP::Request it also allows to send POST requests.

Initialize the user agent and add information:

use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
      $ua->agent('ExBot/alpha/' . $ua->_agent);
      $ua->cookie_jar({ file => "$ENV{HOME}/.tmp/cookies.txt" });


A standard GET request

use HTTP::Request;

my $url = "http://www.example.com/?fu=bar";
my $request = HTTP::Request->new("GET", $url);


A HEAD request

use HTTP::Request;

my $url = "http://www.example.com/";
my $param = "fu=bar";
my $request = HTTP::Request->new("HEAD", $url);
      $request->content_type('application/x-www-form-urlencoded');
      $request->content($param);


Retrieving the page and check for errors

my $response = $ua->request($request);

if ($response->is_success()) {
  print $response->content;
}
else {
  print "An error occured: " . $response->status_line() . "\n";
}


Checking the robots.txt
To check the robots.txt I have to download it and check the url I want to visit against it. To download I use the code above but I have to find it first. The location is easy to guess: http://<domain>/robots.txt, I use URI to construct the url. Once I have the robots.txt I use WWW::RobotRules to check the url I want to download against it.

Constructing the url

use URI;

my $url = "http://www.example.com/?fu=bar";
# url of the page I want to index
my $p_uri = URI->new($url);
# url of the robots.txt
my $r_uri = URI->new();

$r_uri->scheme("http");
$r_uri->host($p_uri->host());
$r_uri->path("robots.txt");


Checking the url against the rules

use WWW::RobotRules;

my $url = "http://www.example.com/?fu=bar";
my $robots_txt = "user-agent: *\ndisallow: /tmp\n";

if (defined($robots_txt)) {
  my $rules = WWW::RobotRules->new("ExBot");
        $rules->parse($r_uri->as_string, $robots_txt);
 
   print $rules->allowed($url);
}
else {
  print "Error retrieving robots.txt\n";
}

A disadvantage of using WWW::RobotRules is that it prints an error if a line is not a valid robots.txt line, this might result in a lot of errors.

Checking the robot metatags
HTML::TokeParser jumps through a HTML page from tag to tag, this might be inefficient if you want to retrieve specific information from a specific page but if you want to analyze a specific tag from any page it's a really easy solution.

Retrieve the robot meta tags

use HTML::TokeParser;

my $stream = HTML::TokeParser->new(\$content);

while (my $tag = $stream->get_tag("meta")) {
  if (exists($tag->[1]{name}) and lc($tag->[1]{name}) eq "robots") {
    if (exists($tag->[1]{content})) {
      my $content = $tag->[1]{content};
      if ($content =~ m/noindex/i) {
        print "Not allowed to index this page\n";
      }
      if ($content =~ m/nofollow/i) {
        print "Not allowed to follow links on this page\n";
      }
      last;
    }
  }
 }


Other interesting modules
These are the modules I currently use but there is some other interesting stuff out there. For example LWP::RobotUA combines LWP::UserAgent with automatically checking the robots.txt and adds timeouts between visiting pages. Another advanced module to retrieve pages is WWW::Mechanize and there are a number of modules for HTML parsing that I want to look into later.'
Logged
Pages: [1]
  Print  
 
Jump to: