#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2017
    Posts
    7
    Rep Power
    0

    tinker & tailoring a little parser script


    hello dear experts,

    I'm pretty new to Programming and OO programming especially. Nonetheless, I'm trying to get done a very simple Spider for web crawling.
    Here's what i get to work - but ...wait:

    i want to modify the script a bit - tailoring and tinkering is the way to learn.
    I want to fetch urls with a certain content in the URL-string


    Code:
    #!C:\Perl\bin\perl 
     
    use strict; # You always want to include both strict and warnings 
    use warnings; 
     
    use LWP::Simple; 
    use LWP::UserAgent; 
    use HTTP::Request; 
    use HTTP::Response; 
    use HTML::LinkExtor; 
     
    # There was no reason for this to be in a BEGIN block (and there 
    # are a few good reasons for it not to be) 
    open my $file1,"+>>", ("links.txt"); 
    select($file1);   
     
    #The Url I want it to start at; 
    # Note that I've made this an array, @urls, rather than a scalar, $URL 
    my @urls = ('https://the url goes in here'); 
    my %visited;  # The % sigil indicates it's a hash 
    my $browser = LWP::UserAgent->new(); 
    $browser->timeout(5); 
     
    while (@urls) { 
      my $url = shift @urls; 
     
      # Skip this URL and go on to the next one if we've 
      # seen it before 
      next if $visited{$url}; 
    	 
      my $request = HTTP::Request->new(GET => $url); 
      my $response = $browser->request($request); 
     
      # No real need to invoke printf if we're not doing 
      # any formatting 
      if ($response->is_error()) {print $response->status_line, "\n";} 
      my $contents = $response->content(); 
     
      # Now that we've got the url's content, mark it as 
      # visited 
      $visited{$url} = 1; 
     
      my ($page_parser) = HTML::LinkExtor->new(undef, $url); 
      $page_parser->parse($contents)->eof; 
      my @links = $page_parser->links; 
     
      foreach my $link (@links) { 
    	print "$$link[2]\n"; 
    	push @urls, $$link[2]; 
      } 
      sleep 60; 
    }
    i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string


    Code:
    "http://www.foo.com/bar"

    in other words: what is aimed, i need to fetch all the urls that contains the term " /bar"
    - then i want to extract the "bar" so that it remains the url: Foo.com
    -


    is this doable?

    love to hear from you
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2017
    Posts
    7
    Rep Power
    0
    hello dear all

    i want to modify the script a bit - tailoring and tinkering is the way to learn. I want to fetch urls with a certain content in the URL-string


    Code:
    "http://www.foo.com/bar"

    well this could be the regex

    Code:
    my $url =~s|/bar$||;
    The "my" causes a new $url to be created.
    What we want is to modify the old $url.


    in other words: what is aimed, i need to fetch all the urls that contains the term " /bar"
    - then i want to extract the "bar" so that it remains the url: Foo.com
    -


    is this doable?

    love to hear from you
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2017
    Location
    Minnesota, USA
    Posts
    13
    Rep Power
    0
    my $url = "http://www.foo.com/bar";

    if ( $url =~ m/(.+)\/bar/ ) {
    print $1;
    }
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2007
    Location
    US
    Posts
    18
    Rep Power
    0
    That is the Swiss Army chainsaw of scripting languages!
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2017
    Posts
    7
    Rep Power
    0
    good day gaming, good day rpural,


    many thanks for your quick replies.


    youre right: that is great for parsing.

    i am in front of the "Fetching" part.


    first: fechting
    then; parsing
    finally writing the results in a file or
    storing the data in a db

    well - the fetching is the first task that i have to manage..
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2017
    Location
    Minnesota, USA
    Posts
    13
    Rep Power
    0
    Originally Posted by gibraltar
    youre right: that is great for parsing.

    i am in front of the "Fetching" part.

    first: fechting
    then; parsing
    finally writing the results in a file or
    storing the data in a db

    well - the fetching is the first task that i have to manage..
    I'm not sure what your question is, then. My thought was to just add the match regex into your fetch logic to cull down the matched URLs to ones that met the specific criteria you were looking for.

IMN logo majestic logo threadwatch logo seochat tools logo