Page 2 of 2 First 12
  • Jump to page:
    #16
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    364
    Rep Power
    0
    hello Keath,

    many many thnks for the reply it is very convincing what i see. I will dig into it.


    This is really preliminary. It just grabs the basic text from the threads and doesn't handle the quoted text right yet. I don't think that would be hard to fix. There are many parsing approaches that can be taken in perl, I just don't have more time tonight.
    You obviously also have to set up a database to capture information you want to store. Additionally, I just looped over the first index page, I didn't set up a loop to grab each of the index pages but I consider that trivial.

    Continue with perl, or use some other language. There will not be a ready made product to take exactly what you want from the web. You will have to make a little effort no matter what method you use.
    your demonstration is very imressive - and makes me thinking that Perl is very very powerful. I will try to harvest this category of the Forum (note those both categories are of my interest nothing more:

    http://=http://www.nukeforums.com/fo...wforum.php?f=3

    http://=http://www.nukeforums.com/fo...forum.php?f=17



    i will come back here and let you know how i get things done.


    Keath you write:
    There are many parsing approaches that can be taken in perl, I just don't have more time tonight.
    guess that i have do dig into the perl-techique. If there is anything i should know for doing the job plz let me know.


    look forward ;-)


    greetings
    metabo.



    again: many thanks for any and all help - the allready given advices and all you have done so far.
  2. #17
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    364
    Rep Power
    0

    recursively running this code against the bulletin board


    Hello Keath, super reply

    this is obviously a great idea that is written here.

    Now my question is - can i apply the code on the part of the board. In order to get a "Copy" of the board with category 17 and category 3 ....

    http://=http://www.nukeforums.com/fo...forum.php?f=17

    http://=http://www.nukeforums.com/fo...wforum.php?f=3

    Keath [and all the other readers from here i look forward to hear from you


    Originally Posted by keath
    I don't use python. I can easily imagine that the parser is quite good.

    Probably no different than what is available in perl. Let's talk about perl since this is a perl forum:
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use LWP::RobotUA;
    use HTML::LinkExtor;
    use HTML::TokeParser;
    use URI::URL;
    
    use Data::Dumper; # for show and troubleshooting
    
    my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";
    my $ua = LWP::RobotUA->new;
    my $lp = HTML::LinkExtor->new(\&wanted_links);
    
    my @links;
    get_threads($url);
    
    foreach my $page (@links) { # this loops over each link collected from the index
    	my $r = $ua->get($page);
    	if ($r->is_success) {
    		my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
    		# just printing what was collected
    		print Dumper get_thread($stream);
    		# would instead have database insert statement at this point
    	 } else {
    		warn $r->status_line;
    	 }
    }
    
    sub get_thread {
    	my $p = shift;
    	my ($title, $name, @thread);
    	while (my $tag = $p->get_tag('a','span')) {
    		if (exists $tag->[1]{'class'}) {
    			if ($tag->[0] eq 'span') {
    				if ($tag->[1]{'class'} eq 'name') {
    					$name = $p->get_trimmed_text('/span');
    				} elsif ($tag->[1]{'class'} eq 'postbody') {
    					my $post = $p->get_trimmed_text('/span');
    					push @thread, {'name'=>$name, 'post'=>$post};
    				}
    			} else {
    				if ($tag->[1]{'class'} eq 'maintitle') {
    					$title = $p->get_trimmed_text('/a');
    				}
    			}
    		}
    	}
    	return {'title'=>$title, 'thread'=>\@thread};
    }
    
    sub get_threads {
    	my $page = shift;
    	my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])});
    	# Expand URLs to absolute ones
    	my $base = $r->base;
    	return [map { $_ = url($_, $base)->abs; } @links];
    }
    
    sub wanted_links {
    	my($tag, %attr) = @_;
    	return unless exists $attr{'href'};
    	return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
    	push @links, values %attr;
    }
    If you have the necessary modules installed, and run it from the command line you'll see output such as the following:
    Code:
    $VAR1 = {
              'thread' => [
                            {
                              'post' => 'Hello, I\'m pretty new to PHPNuke. I\'ve got my site up and running great! I\'m now starting to make modifications, add modules etc. I\'m using the most recent RavenPHP76. I want to display the 5 most recent forum posts at the top of the forum page. I\'m not sure if this functionality is built in, if so, how to activate. Or if there is a module or block made to do this. I looked at Raven\'s Collapsing Forum block but wasn\'t crazy about the format, and I don\'t want it to be collapsable. Thanks! mopho',
                              'name' => 'mopho'
                            },
                            {
                              'post' => 'hi there',
                              'name' => 'sail'
                            },
                            {
                              'post' => 'thanks for asking this; :not very sure if i got you right; Do you want to have a feed of the last forumthreads? guess the easiest way is to go to raven and ask how he did it. hth sail.',
                              'name' => 'sail'
                            },
                            {
                              'post' => 'Thanks. i found what I was looking for. It wasn\'t so easy to find! It\'s called glance_mod. mopho',
                              'name' => 'mopho'
                            },
                            {
                              'post' => 'hi there thx',
                              'name' => 'sail'
                            },
                            {
                              'post' => 'it sound interesting - i will have also a look i google after it - and try to find out more regards sailor',
                              'name' => 'sail'
                            }
                          ],
              'title' => 'Recent Forum Posts Module'
            };
    This is really preliminary. It just grabs the basic text from the threads and doesn't handle the quoted text right yet. I don't think that would be hard to fix. There are many parsing approaches that can be taken in perl, I just don't have more time tonight.

    You obviously also have to set up a database to capture information you want to store.

    Additionally, I just looped over the first index page, I didn't set up a loop to grab each of the index pages but I consider that trivial.

    Continue with perl, or use some other language. There will not be a ready made product to take exactly what you want from the web. You will have to make a little effort no matter what method you use.


    again Keath, this is a super reply: this is obviously a great idea that is written here. Now my question is - can i apply the code on the part of the board. In order to get a "Copy" of the board with category 17 and category 3 ....



    http://=http://www.nukeforums.com/fo...forum.php?f=17

    http://=http://www.nukeforums.com/fo...wforum.php?f=3

    Can this be done with the code written above?!






    well i am very happy,

    the demonstration is very imressive - and makes me thinking that Perl is very very powerful.
    I will try to harvest this category of the Forum (note those both categories are of my
    interest nothing more:

    http://=http://www.nukeforums.com/fo...wforum.php?f=3

    http://=http://www.nukeforums.com/fo...forum.php?f=17



    i want to discuss a little change here. The minimal change consists of changing



    Code:
    my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";
    my $ua = LWP::RobotUA->new;
    my $lp = HTML::LinkExtor->new(\&wanted_links);
    
    my @links;
    get_threads($url);
    
    foreach my $page (@links) {
        ...
    }
    to


    Code:
    my $ua = LWP::RobotUA->new;
    my $lp = HTML::LinkExtor->new(\&wanted_links);
    
    my @links;
    
    foreach my $forum_id (17, 3) {
        my $url = "http://www.nukeforums.com/forums/viewforum.php?f=$forum
    +_id";
        @links = ();  # yuck!
        my $links = get_threads($url);
        foreach my $page (@$links) {
            ...
        }
    }


    As i want to show, i change the use of the global variable @links.
    We're forced to provide and initialize a variable that should be local to get_threads. Here's the fix:


    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use LWP::RobotUA;
    use HTML::LinkExtor;
    use HTML::TokeParser;
    use URI::URL;
    
    use Data::Dumper; # for show and troubleshooting
    
    my $ua = LWP::RobotUA->new();
    
    foreach my $forum_id (17, 3) {
        my $url = "http://www.nukeforums.com/forums/viewforum.php?f=$forum
    +_id";
        my $links = get_threads($url);
        foreach my $page (@$links) {
            ...
        }
    }
    
    sub get_thread {
        ...
    }
    
    sub get_threads {
        my $page = shift;
    
        my @links;
        my $lp = HTML::LinkExtor->new(sub {
            my($tag, %attr) = @_;
            return unless exists $attr{'href'};
            return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
            push @links, values %attr;
        });
    
        my $request = HTTP::Request->new(GET => $url);
        my $response = $ua->request($request, sub {$lp->parse($_[0])});
    
        # Expand URLs to absolute ones
        my $base = $response->base;
        return [ map { url($_, $base)->abs } @links ];


    Discussion:

    with that changes i am able to run the code agains the full category.


    http://=http://www.nukeforums.com/fo...wforum.php?f=3

    http://=http://www.nukeforums.com/fo...forum.php?f=17


    Question - am i able to get the results of the above mentionde forum categories - and can i get
    the forum threads that are stored in the two above forums....

    i love to hear from you. Keath [and all the other readers from here] i look forward to hear from you


    regards
    metabo
  4. #18
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    364
    Rep Power
    0
    hello Keath,

    many thanks for the posting - you like parsing - i see. And you provided a great solution. I love it.


    Originally Posted by keath
    I don't use python. I can easily imagine that the parser is quite good. Probably no different than what is available in perl. Let's talk about perl since this is a perl forum:
    This is really preliminary. It just grabs the basic text from the threads and doesn't handle the quoted text right yet. I don't think that would be hard to fix. There are many parsing approaches that can be taken in perl, I just don't have more time tonight.
    Additionally, I just looped over the first index page, I didn't set up a loop to grab each of the index pages but I consider that trivial.
    i had a look and i guess that this will help to run as we want to: if we want to 'loop' over the URLs, we could either run the spider multiple times, or put a 'foreach' loop around the main body of our program. What do you think about the following code?

    Code:
    my @urls = ("http://www.example.com/first.html",
    "http://www.example.com/second.html");
    
    foreach my $url (@urls) {
    # main code
    }
    What do you think about the above mentioned code? Is this fitting the needs here!? We would wrap you code see postng 15 with that].

    You obviously also have to set up a database to capture information you want to store.
    yes - that is the other story - that could be the next thing were i need to figure out how to get this solved.


    cheers
    metabo
  6. #19
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    364
    Rep Power
    0

    Smile


    hello keath

    close but no cigar -

    i have testet the script here (with some of my friends) and we run in great troubles here - since the USERAGENT does not function as we thought it does.

    Hmm there are some confusing things. keath - are you sure that the code runs like this description states.

    look forward to hear from you

    cheers
    metabo

    [QUOTE=keath]I don't use python. I can easily imagine that the parser is quite good.

    Probably no different than what is available in perl. Let's talk about perl since this is a perl forum:
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use LWP::RobotUA;
    use HTML::LinkExtor;
    use HTML::TokeParser;
    use URI::URL;
    
    use Data::Dumper; # for show and troubleshooting
    
    my $url = "http://www.nukeforums.com/forums/viewforum.php?f=17";
    my $ua = LWP::RobotUA->new;
    my $lp = HTML::LinkExtor->new(\&wanted_links);
    
    my @links;
    get_threads($url);
    
    foreach my $page (@links) { # this loops over each link collected from the index
    	my $r = $ua->get($page);
    	if ($r->is_success) {
    		my $stream = HTML::TokeParser->new(\$r->content) or die "Parse error in $page: $!";
    		# just printing what was collected
    		print Dumper get_thread($stream);
    		# would instead have database insert statement at this point
    	 } else {
    		warn $r->status_line;
    	 }
    }
    
    sub get_thread {
    	my $p = shift;
    	my ($title, $name, @thread);
    	while (my $tag = $p->get_tag('a','span')) {
    		if (exists $tag->[1]{'class'}) {
    			if ($tag->[0] eq 'span') {
    				if ($tag->[1]{'class'} eq 'name') {
    					$name = $p->get_trimmed_text('/span');
    				} elsif ($tag->[1]{'class'} eq 'postbody') {
    					my $post = $p->get_trimmed_text('/span');
    					push @thread, {'name'=>$name, 'post'=>$post};
    				}
    			} else {
    				if ($tag->[1]{'class'} eq 'maintitle') {
    					$title = $p->get_trimmed_text('/a');
    				}
    			}
    		}
    	}
    	return {'title'=>$title, 'thread'=>\@thread};
    }
    
    sub get_threads {
    	my $page = shift;
    	my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->parse($_[0])});
    	# Expand URLs to absolute ones
    	my $base = $r->base;
    	return [map { $_ = url($_, $base)->abs; } @links];
    }
    
    sub wanted_links {
    	my($tag, %attr) = @_;
    	return unless exists $attr{'href'};
    	return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
    	push @links, values %attr;
    }

    hmm - are you sure that this code runs.

    Well i look forward to hear from you

    cheers
    metabo
  8. #20
  9. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,274
    Rep Power
    0
    It appears the script worked for you in your earlier posts, and now it doesn't. That's because you went from what I initally posted to what I edited it to later.

    Post 15 in this thread, at the bottom it says:
    Last edited by keath : August 22nd, 2006 at 03:28 AM. Reason: seemed to be using the wrong UA
    After posting the working script, I went into work and realized I had been using LWP::UserAgent instead of the RobotUA I had intended. I edited the entry but got it wrong. When using RobotUA, it is necessary to identify your robot. The agent and from identifiers are required.
    Code:
    my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com');
    $ua->delay(20/60);
    As in the RobotUA documentation.

    The default robot delay is 1 minute, so you'll want to set it to less (I used 20 seconds here, as the site robots.txt file appears to suggest).
  10. #21
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2004
    Posts
    364
    Rep Power
    0

    Smile


    hello Keath many many thanks - i am happy to hear from you.

    first some great thanks to you for the corrections and
    second - one question bout a loop:


    Originally Posted by keath
    It appears the script worked for you in your earlier posts, and now it doesn't. That's because you went from what I initally posted to what I edited it to later.
    .
    right - i remember that you told me something about corrections and some fixes. But i forgotten the details and we [a friend of me and I ] had some issues with the original code of posting 15.


    Post 15 in this thread, at the bottom it says: After posting the working script, I went into work and realized I had been using LWP::UserAgent instead of the RobotUA I had intended. I edited the entry but got it wrong. When using RobotUA, it is necessary to identify your robot. The agent and from identifiers are required.

    Code:
    my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com');
    $ua->delay(20/60);
    As in the RobotUA documentation.

    The default robot delay is 1 minute, so you'll want to set it to less (I used 20 seconds here, as the site robots.txt file appears to suggest).
    many many thainks - that is a great posting. It helps me alot and shows me how to work with RobotUA.

    Thanks for replying and for the code here:

    Code:
    my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com');
    this works for us here.


    second part- a question

    i had a look and i guess that this will help to run as we want to: if we want to 'loop' over the URLs, we could either run the spider multiple times, or put a 'foreach' loop around the main body of our program. What do you think about the following code?

    Code:
    my @urls = ("http://www.example.com/first.html",
    "http://www.example.com/second.html");
    
    foreach my $url (@urls) {
    # main code
    }
    what do you think about this - and am i able to use this code in addition.

    thx for a short answer!


    metabo
Page 2 of 2 First 12
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo