|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Stay one step ahead of the competition. Evaluate and give feedback
on some of the hottest web development tools on the market today.
Make your opinion heard! Click
Here
|
|
#1
|
||||
|
||||
|
Does anyone know how I can use perl to get the html source for a given URL?
For example...I want to get the html line for line from say http://www.yahoo.com and print it out to the screen. |
|
#2
|
||||
|
||||
|
check out the usage of the LWP module. this works great for this type of application. Once you have gathered the data from the remote web site, it's a matter then of extracting the data or strings you want out of the HTML by regrex.
Cheers, Mickalo
__________________
Thunder Rain Internet Publishing Custom Programming & Database development Providing Personal/Business Internet Solutions that work! |
|
#3
|
||||
|
||||
|
yea, i had luck doing it using LWP. but the main box i am using does not have the module and i don't have root permission
![]() any other ideas? |
|
#4
|
|||
|
|||
|
Download the module from cpan.org (you can search for it at http://search.cpan.org ) and just upload it to your directory. Then in your script, add this:
Code:
push(@INC, '.'); use LWP::Simple; Make sure to keep the LWP directory structure intact. Also notice that the pus() comes before the use/require. |
|
#5
|
||||
|
||||
|
err. thank you guys for the help. i had tried using push(@INC,'.') but i get an error saying what INC includes, and my directory is not one of them. i even tried typing in my complete directory path. any ideas?
|
|
#6
|
||||
|
||||
|
Quote:
Try using: Code:
BEGIN {
unshift(@INC, "/path/to/folder");
}
use LWP::Simple;
Then put the module in the path specified above. Mickalo |
|
#7
|
|||
|
|||
|
You shouldn't need root in order to get the source for the page. Using the LWP::UserAgent in addition to the HTTP::Request, HTTP::Response will get the job done.
Pretty much what happens here is that you write an HTTP request that fools the remote_server into thinking you are a browser making a valid GET request. So, it returns the source. Take a look: Code:
#!/usr/bin/perl
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use URI::Heuristic;
use strict;
my $raw_url = "http://www.oreilly.com"; # URL of site you want source for
#my $url = URI::Heuristic::uf_urlstr($raw_url); # expands partial URLs
# only needed for urls
# like 'www.oreilly.com'
# not needed here
my $ua = LWP::UserAgent->new(); # Creates a virtual browser
$ua->agent("Mozilla/4.0"); # the browser type
my $req = HTTP::Request->new( GET => $raw_url ); # creates GET request but does not send
$req->referer("http://hi.there.com"); # obviously not a real URL - part of the HTTP headers
my $res = $ua->request($req); # request is sent
my $src = $res->content(); # gets the source and assign to var...normally you'd want to check for
# errors with the 'is_error' function like so:
# '$res->is_error()'
# $res is your response object
print $src, "\n"; # print the source code
This worked on the dev server at work and I don't have root anywhere much less http://www.oreilly.com. Hope that helps.
__________________
- dsb - ![]() Perl Guy |
|
#8
|
|||
|
|||
|
You you been paying attention to what we've been saying :P? I'm guessing not, since we're trying to get the LWP module working for him.
I'll be more spacific with my answer from above (since it was right in the first place ).Code:
push(@INC, '.');
use LWP::Simple;
@data = get('http://www.whatever.com');
print @data;
Now to make this work, you need to download the module from cpan.org, un tar/gz it, and the upload it to the directory your script is in. Ok -- this is the important part -- your directory structure should now look something like this: ./ <-- main directory, with the script trying to run. ./LWP <-- Directory containing ALL the lwp stuff. If you still get the @INC error, that means that your structure is still not right. Mess around with it. |
|
#9
|
|||
|
|||
|
Settle down there tiger...and apparently your way isn't working for him. Besides...TMTOWTDI...
![]() BTW...in the docs for LWP::Simple, it suggests using UserAgent for more control over what's going on. [Edited by dsb on 03-01-2001 at 01:48 PM] |
|
#10
|
||||
|
||||
|
thanks for the help all. i just got the sysadmin to install pwl. thank you all for the posts.
|
|
#11
|
|||
|
|||
|
If my method wasn't working, it's because he wasn't doing it right. _Your_ method used the exact same module (set) as mine... which is what I was trying to show him how to install.
And yes, the ::UserAgent will provide more control, but if a person don't know how to get ::Simple working, then why would they know how to use the UserAgent method? Besides, I didn't mean what I said before in a negative way, so you don't need to tell me the settle down, tiger. |
|
#12
|
|||
|
|||
|
JonLed,
There is no doubt you know what you are talking about. I didn't realize he didn't have the library installed. I was merely suggesting a different method. One by which he might be able to understand how the HTTP protocol functions a little better. But, of course, if he doesn't have the library installed, nothing is going to work .Didn't mean to step on anyone's toes. Humble apologies. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Perl Programming > get html source |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|