dear Perl-Experts,

I'm trying to write a very simple Spider for web crawling. Here's the code:
note - it determines whether $url is in @visited. with a duplicate check using a hash:

and it fetches Urls
but i have to tinker it a bit: i want to fetch content. -that need to be a little tailoring.

finally i want to store all in a file.
or - even better in a CSV - formate.

see the code.

use strict;
use warnings;
use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

open my $file1,">>", ("links.txt");

my @urls = ('');
my %visited;  # The % sigil indicates it's a hash
my $browser = LWP::UserAgent->new();

while (@urls) {
  my $url = shift @urls;

  # Skip this URL and go on to the next one if we've
  # seen it before
  next if $visited{$url};

  my $request = HTTP::Request->new(GET => $url);
  my $response = $browser->request($request);

  # No real need to invoke printf if we're not doing
  # any formatting
  if ($response->is_error()) {print $response->status_line, "\n";}
  my $contents = $response->content();

  # Now that we've got the url's content, mark it as
  # visited
  $visited{$url} = 1;

  my ($page_parser) = HTML::LinkExtor->new(undef, $url);
  my @links = $page_parser->links;

  foreach my $link (@links) {
    print "$$link[2]\n";
    push @urls, $$link[2];
  sleep 60;
The results look like this:

and so forth. Well - this is not what is wanted. I want to fetch the content of the page... with the links

and when fetched the data of the first page - then i want to switch to the second page. and so forth...

i want to get the content of the pages that links point to. I have to fetch the content - and i have to do this:
Once i got the URLs i have to fetch the content. i want to do this with the current approach - then i have to take care that we fetch the content
.... I will try to achive that.

in other words: i want to get the content of $url, - this is in $contents; i want to fetch the content of pages that links point to, so i fetch those - like i got $url,
and once got their URLs (the links) the i need to fetch the content.

- one last note: surely this can be achieved while using Mechanize: surely we can use WWW::Mechanize instead and "follow" the links. i have had a closer look at the Link Methods

Follows a specified link on the page. You specify the match to be found using the same parms that find_link() uses.

Here some examples:

3rd link called "download"

$mech->follow_link( text => 'download', n => 3 );

first link where the URL has "download" in it, regardless of case:

$mech->follow_link( url_regex => qr/download/i );


$mech->follow_link( url_regex => qr/(?i:download)/ );

3rd link on the page

$mech->follow_link( n => 3 );

the link with the url

$mech->follow_link( url => '/other/page' );


$mech->follow_link( url => '' );

Returns the result of the GET method (an HTTP::Response object) if a link was found. If the page has no links, or the specified link couldn't be found, returns undef.
$mech->find_link( ... )

Finds a link in the currently fetched page. It returns a WWW::Mechanize::Link object which describes the link. (You'll probably be most interested in the url() property.) If it fails to find a link it returns undef.

You can take the URL part and pass it to the get() method. If that's your plan, you might as well use the follow_link() method directly, since it does the get() for you automatically.

Note that <FRAME SRC="..."> tags are parsed out of the the HTML and treated as links so this method works with them.

You can select which link to find by passing in one or more of these key/value pairs:

text => 'string', and text_regex => qr/regex/,

text matches the text of the link against string, which must be an exact match. To select a link with text that is exactly "download", use

$mech->find_link( text => 'download' );

text_regex matches the text of the link against regex. To select a link with text that has "download" anywhere in it, regardless of case, use
and so forth ....