September 28th, 2012, 09:10 PM
PHP Website Scraping, crawling.... etc.
I have a question about what I want to do before I put a lot of time into learning about it and implementing it.
Lets say that I want to extract data from a public website that has many pages. Each page that I want to extract from has a very similar format, the only difference being the number of rows in the table that I hope to extract from.
How easily would I be able to write a script to search, crawl, or whatever you want to call it through all these pages and extract the data from the cells in the table in each page? I already tried using Inspyder and it looks promising but I don't know if just using a script would be easier.
Any thoughts? Would I have to use curl or another language?
Thanks for the help
September 28th, 2012, 10:13 PM
It depends on the HTML, really. But first have you checked for an API? What kind of data from what kind of site?
September 29th, 2012, 12:05 AM
Well I can't say I'm absolutely sure what you mean by an API. Is that something that would make accessing all the data easier? Something that was released by the company? Because no, there isn't an API for this website. It is simple tables with names and marks for high jump results. There are links in the tables. Nothing very special about them, just a lot of them.
Originally Posted by requinix
September 29th, 2012, 06:01 PM
Anyone have any suggestions?
September 29th, 2012, 07:23 PM
An api is a piece of software provided by the website to use the website by thirdparty websites. For instance facebook has an api that allows you to implement the like-stuff in your own website.
Originally Posted by jel5363
My first guess in your case would be to use curl. Btw in my country and if you are in Europe, you might want to think about the legal aspects of copying/using other websites data. It can very well be protected and you can get in quite some financial trouble (Assuming they put time and effort in assembling the data in that manner)
September 29th, 2012, 09:38 PM
Well the exact data is on two completely different websites and it is user input. Anyways, I've been looking around and what about "file_get_contents"? Is there anyway I can, if i have the list of urls I need to search, loop through with a while statement and search each $string from file_get_contents and output what I want?
Originally Posted by aeternus
Also, how can I parse the string to only show what I want? Thanks.
September 30th, 2012, 01:12 AM
Rather than coding something yourself have a look at OutWit Hub it may do what you want.
Comments on this post
September 30th, 2012, 09:17 PM
I bought it and its exactly what I needed, thanks! Gunna save me hundreds of hours of tedious work.... so you know I appreciate it
Originally Posted by Valkrider
October 1st, 2012, 12:09 AM
Glad it worked out for you.
Originally Posted by jel5363