ColdFusion Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - MoreColdFusion Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old December 22nd, 2004, 09:45 AM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Recursive GET request?

Guys,
my program performs a get request on a web page, and retrieves href links to jobs and ignores all other links. Now what i now need to do is perform a recursive call on each of the links i have retrieved like a web crawler/spider, is this possible? At the moment Im at a bit of a dead end and need a bit of help. Any responses much appreciated.

Reply With Quote
  #2  
Old December 22nd, 2004, 09:57 AM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
Sure it's possible, this is what recursive functions are for. In pseudocode:

function recurseCrawlLinks( urlToCrawl ) {
execute http call to arguments.urlToCrawl;
parse links founc in resulting http content;
for each link found in the http content recall this function recursively {
recurseCrawlLinks( new link found in http content );
}
}
__________________
Ask if you have a question, but also help answer questions that you have knowledge of! Thanks, Brian.
How to Post a Question in the Forums

Reply With Quote
  #3  
Old December 23rd, 2004, 03:17 PM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Kiteless thanks for the reply but I have changed my approach slightly. I wonder if you can help?
Now I have decided to create a CF template for each site to scrape. The template consists of a list of parameters such as URL to scrape, Text that a link to a job always contains and what html surrounds each element we are looking for. The parameter values are different for every site.
I then want to pass these values to my main code (which is at the start of thread) where they can be used as parameters for my web crawling.
How can I pass these values across one at a time?
Here is my template for one of the sites:
<!---
Filename: RSPBTemplate.cfm
Purpose: A template for scraping the RSPB.org
--->

<!--- Post the parameter data for crawler---->
<cfhttp method="post" url="http://localhost:8500\Project\RSPB.cfm">

<!--- Template Parameters --->
<cfhttpparam name="URLToScrape" type="URL" value="http://www.rspb.org.uk/vacancies/index.asp">
<cfhttpparam name="LinkText" type="URL" value="http://www.rspb.org.uk/vacancies/index.asp/?id=">
<cfparam name="Title" type="string" default="<h2 id=vol-title>$Title$</h2>">
<cfparam name="Location" type="string" default="<p id=location>">
<cfparam name="Salary" type="string" default="<h3>Salary</h3>">
<cfparam name="HoursContactInfo" type="string" default="<h3>Hours & contract information</h3>">
<cfparam name="CosingDate" type="string" default="<h3>Closing date & interviews</h3>Closing
date:$ClosingDate$<br>Interview date:">
</cfhttp>
<cfoutput>
#cfhttp.FileContent#
</cfoutput>

Reply With Quote
  #4  
Old December 23rd, 2004, 03:42 PM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
I'm still not really sure what you're trying to do. You could pass an array of links to your recursion function. The function would loop over the links in the array. Is that what you are asking about?

However you do it, it doesn't really change the fact that you'll want to write a recursive function that calls itself with new information (a URL to crawl for example). What happens when the http content is retrieved or what elements you want to parse from the resulting http content is a separate issue.

Reply With Quote
  #5  
Old December 27th, 2004, 01:41 PM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Sorry for the confusion Kiteless Ill try to explain.
I have three seperate cf templates for three seperate websites to scrape(example code is in my 2nd post). All the templates have a list of cfparam values that are relevant to that website.

I would then like to post the cfparam values to a seperate cf file. The file then reads the values and performs a series of GET requests based on the cfparam values, eg GET http://www.rspb.org.uk/vacancies/index.asp, then go to the next cfparam and GET http://www.rspb.org.uk/vacancies/index.asp/?id=

My question is, is it possible to send the template cfparam values to another cffile, and use the values to perform a sort of web crawl? If it is where the hell do I start????

Sorry if I am frustrating you.

Reply With Quote
  #6  
Old December 27th, 2004, 01:54 PM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
Yes, you can send variables when you call the CF template...IF the param values are for COOKIE, URL, or FORM variables. If there are some other scope then you can't "send" them to the template you want to call with CFHTTP...you'll have to run a request that sets the values and then runs the template so that the variables are there for the template.

You're not frustrating me but I am still confused about what you are trying to do and whether or not there is simply an easier way to do it. Maybe if you step away from the code and just explain what you are trying to do in general terms it may shed some light. I mean I know you are trying to crawl some pages. How many pages? Just 3? Or is it a variable number of pages? Do you know the URLs of the pages you want to crawl in advance? Are they always the same URLs? Why do you want to crawl them? What part do these cfparam values play? etc...

Reply With Quote
  #7  
Old December 27th, 2004, 02:21 PM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
OK ill start from the begining. I am aiming to automatically take content (in this case job advertisement details) from websites and add it to a database. Then the database can be free text searched for job details.

I need to be able to provide a facility that will read the information required to scrape, from a template. Therefore for every new site I need to scrape, all i do is create a new template for the site.

At the moment I have three URL's to scrape but this will increase, these are:
http://www.ccw.gov.uk/vacancies/index.cfm?Action=Vacancies&lang=en
http://www.english-nature.org.uk/news/jobs.asp
http://www.rspb.org.uk/vacancies/index.asp

Each of the above URL's are job index pages and they contain a list of job links, not the actual job info I need. The info I need can be accessed via the individual job links on the index pages(hence the need for a crawler).

My thinking was that I could pass the parameters in each template to a crawler. The crawler then looks for each of the parameters until it finds a match, then moves onto the next one.

Reply With Quote
  #8  
Old December 27th, 2004, 02:47 PM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
OK that makes some sense. What "parameters" do you want to pass into the crawler? Is this how you are telling it what kind of link to look for?

Reply With Quote
  #9  
Old December 27th, 2004, 04:10 PM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Yes, I hoped to pass the URLToScrape parameter first as this will be find the job index page, eg http://www.rspb.org.uk/vacancies/index.asp.
Then once the index page is found the LinkText parameter will look for URL's that point to jobs the specific jobs eg, http://www.rspb.org.uk/vacancies/index.asp/?id=
From there the html that surrounds the job title, salary and location etc will be passed.
Am i still making sense?

Reply With Quote
  #10  
Old December 27th, 2004, 04:46 PM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
It makes sense but it seems really awkward. Is there no way you can contact the job sites and have them create a text version of the job postings, or possibly a web service or XML/RSS feed to give you the postings?

The method you are considering to scrape the jobs is extremely brittle...any change to the link structure or HTML markup at any of the sites will break your application.

Reply With Quote
  #11  
Old December 29th, 2004, 09:57 AM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Yes I know its a bit brittle but this is the way I need to do it. They dont have XML/RSS feeds and Ill have to change the templates to look for different HTML if it changes.
Without having XML/RSS or text versions how would you approach this? Taking into account how I have approached it.

Reply With Quote
  #12  
Old December 29th, 2004, 10:12 AM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
First I would really push them to give you someting, if nothing else just a text version of the content, this would only take them a few minutes to create if they are pulling this from a database. If they are interested in doing business with you or having their jobs filled I would think they would consider this. If not that, then having them embed something into the source code that would make it easier for you to identify the jobs (a special HTML comment, or even tilde or something maybe).

Other than that you don't have any choice but to try and scrape the pages as you have already noted. I'd create an array of structures that holds the root site URL as well as the format of the links being searched for each root site URL. Loop over that array and for each iteration of the loop execute your recursive HTTP request and build up an array of job postings as each site is recursed.

Reply With Quote
  #13  
Old December 30th, 2004, 06:52 AM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Kiteless, could I incorporate the use of the templates this?

Reply With Quote
  #14  
Old December 30th, 2004, 07:55 AM
kiteless kiteless is offline
Moderator
Dev Shed Expert (3500 - 3999 posts)
 
Join Date: Jun 2002
Location: Raleigh, NC
Posts: 3,661 kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level)kiteless User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 4 Days 14 h 23 m 22 sec
Reputation Power: 53
I don't understand your last question...can you explain?

Reply With Quote
  #15  
Old December 30th, 2004, 08:04 AM
samb1 samb1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 67 samb1 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 1 h 42 m 42 sec
Reputation Power: 4
Sorry, I explained to you that I'd be using individual templates for each site to scrape. Would the array structures replace the use of the templates for each site?

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreColdFusion Development > Recursive GET request?


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread: