|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Recursive GET request?
Guys,
my program performs a get request on a web page, and retrieves href links to jobs and ignores all other links. Now what i now need to do is perform a recursive call on each of the links i have retrieved like a web crawler/spider, is this possible? At the moment Im at a bit of a dead end and need a bit of help. Any responses much appreciated. |
|
#2
|
|||
|
|||
|
Sure it's possible, this is what recursive functions are for. In pseudocode:
function recurseCrawlLinks( urlToCrawl ) { execute http call to arguments.urlToCrawl; parse links founc in resulting http content; for each link found in the http content recall this function recursively { recurseCrawlLinks( new link found in http content ); } }
__________________
Ask if you have a question, but also help answer questions that you have knowledge of! Thanks, Brian. How to Post a Question in the Forums |
|
#3
|
|||
|
|||
|
Kiteless thanks for the reply but I have changed my approach slightly. I wonder if you can help?
Now I have decided to create a CF template for each site to scrape. The template consists of a list of parameters such as URL to scrape, Text that a link to a job always contains and what html surrounds each element we are looking for. The parameter values are different for every site. I then want to pass these values to my main code (which is at the start of thread) where they can be used as parameters for my web crawling. How can I pass these values across one at a time? Here is my template for one of the sites: <!--- Filename: RSPBTemplate.cfm Purpose: A template for scraping the RSPB.org ---> <!--- Post the parameter data for crawler----> <cfhttp method="post" url="http://localhost:8500\Project\RSPB.cfm"> <!--- Template Parameters ---> <cfhttpparam name="URLToScrape" type="URL" value="http://www.rspb.org.uk/vacancies/index.asp"> <cfhttpparam name="LinkText" type="URL" value="http://www.rspb.org.uk/vacancies/index.asp/?id="> <cfparam name="Title" type="string" default="<h2 id=vol-title>$Title$</h2>"> <cfparam name="Location" type="string" default="<p id=location>"> <cfparam name="Salary" type="string" default="<h3>Salary</h3>"> <cfparam name="HoursContactInfo" type="string" default="<h3>Hours & contract information</h3>"> <cfparam name="CosingDate" type="string" default="<h3>Closing date & interviews</h3>Closing date:$ClosingDate$<br>Interview date:"> </cfhttp> <cfoutput> #cfhttp.FileContent# </cfoutput> |
|
#4
|
|||
|
|||
|
I'm still not really sure what you're trying to do. You could pass an array of links to your recursion function. The function would loop over the links in the array. Is that what you are asking about?
However you do it, it doesn't really change the fact that you'll want to write a recursive function that calls itself with new information (a URL to crawl for example). What happens when the http content is retrieved or what elements you want to parse from the resulting http content is a separate issue. |
|
#5
|
|||
|
|||
|
Sorry for the confusion Kiteless Ill try to explain.
I have three seperate cf templates for three seperate websites to scrape(example code is in my 2nd post). All the templates have a list of cfparam values that are relevant to that website. I would then like to post the cfparam values to a seperate cf file. The file then reads the values and performs a series of GET requests based on the cfparam values, eg GET http://www.rspb.org.uk/vacancies/index.asp, then go to the next cfparam and GET http://www.rspb.org.uk/vacancies/index.asp/?id= My question is, is it possible to send the template cfparam values to another cffile, and use the values to perform a sort of web crawl? If it is where the hell do I start???? Sorry if I am frustrating you. |
|
#6
|
|||
|
|||
|
Yes, you can send variables when you call the CF template...IF the param values are for COOKIE, URL, or FORM variables. If there are some other scope then you can't "send" them to the template you want to call with CFHTTP...you'll have to run a request that sets the values and then runs the template so that the variables are there for the template.
You're not frustrating me but I am still confused about what you are trying to do and whether or not there is simply an easier way to do it. Maybe if you step away from the code and just explain what you are trying to do in general terms it may shed some light. I mean I know you are trying to crawl some pages. How many pages? Just 3? Or is it a variable number of pages? Do you know the URLs of the pages you want to crawl in advance? Are they always the same URLs? Why do you want to crawl them? What part do these cfparam values play? etc... |
|
#7
|
|||
|
|||
|
OK ill start from the begining. I am aiming to automatically take content (in this case job advertisement details) from websites and add it to a database. Then the database can be free text searched for job details.
I need to be able to provide a facility that will read the information required to scrape, from a template. Therefore for every new site I need to scrape, all i do is create a new template for the site. At the moment I have three URL's to scrape but this will increase, these are: http://www.ccw.gov.uk/vacancies/index.cfm?Action=Vacancies&lang=en http://www.english-nature.org.uk/news/jobs.asp http://www.rspb.org.uk/vacancies/index.asp Each of the above URL's are job index pages and they contain a list of job links, not the actual job info I need. The info I need can be accessed via the individual job links on the index pages(hence the need for a crawler). My thinking was that I could pass the parameters in each template to a crawler. The crawler then looks for each of the parameters until it finds a match, then moves onto the next one. |
|
#8
|
|||
|
|||
|
OK that makes some sense. What "parameters" do you want to pass into the crawler? Is this how you are telling it what kind of link to look for?
|
|
#9
|
|||
|
|||
|
Yes, I hoped to pass the URLToScrape parameter first as this will be find the job index page, eg http://www.rspb.org.uk/vacancies/index.asp.
Then once the index page is found the LinkText parameter will look for URL's that point to jobs the specific jobs eg, http://www.rspb.org.uk/vacancies/index.asp/?id= From there the html that surrounds the job title, salary and location etc will be passed. Am i still making sense? |
|
#10
|
|||
|
|||
|
It makes sense but it seems really awkward. Is there no way you can contact the job sites and have them create a text version of the job postings, or possibly a web service or XML/RSS feed to give you the postings?
The method you are considering to scrape the jobs is extremely brittle...any change to the link structure or HTML markup at any of the sites will break your application. |
|
#11
|
|||
|
|||
|
Yes I know its a bit brittle but this is the way I need to do it. They dont have XML/RSS feeds and Ill have to change the templates to look for different HTML if it changes.
Without having XML/RSS or text versions how would you approach this? Taking into account how I have approached it. |
|
#12
|
|||
|
|||
|
First I would really push them to give you someting, if nothing else just a text version of the content, this would only take them a few minutes to create if they are pulling this from a database. If they are interested in doing business with you or having their jobs filled I would think they would consider this. If not that, then having them embed something into the source code that would make it easier for you to identify the jobs (a special HTML comment, or even tilde or something maybe).
Other than that you don't have any choice but to try and scrape the pages as you have already noted. I'd create an array of structures that holds the root site URL as well as the format of the links being searched for each root site URL. Loop over that array and for each iteration of the loop execute your recursive HTTP request and build up an array of job postings as each site is recursed. |
|
#13
|
|||
|
|||
|
Kiteless, could I incorporate the use of the templates this?
|
|
#14
|
|||
|
|||
|
I don't understand your last question...can you explain?
|
|
#15
|
|||
|
|||
|
Sorry, I explained to you that I'd be using individual templates for each site to scrape. Would the array structures replace the use of the templates for each site?
|
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > ColdFusion Development > Recursive GET request? |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |