#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    40
    Rep Power
    3

    PHP Website Scraping, crawling.... etc.


    I have a question about what I want to do before I put a lot of time into learning about it and implementing it.

    Lets say that I want to extract data from a public website that has many pages. Each page that I want to extract from has a very similar format, the only difference being the number of rows in the table that I hope to extract from.

    How easily would I be able to write a script to search, crawl, or whatever you want to call it through all these pages and extract the data from the cells in the table in each page? I already tried using Inspyder and it looks promising but I don't know if just using a script would be easier.

    Any thoughts? Would I have to use curl or another language?
    Thanks for the help
  2. #2
  3. Transforming Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,245
    Rep Power
    9400
    It depends on the HTML, really. But first have you checked for an API? What kind of data from what kind of site?
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    40
    Rep Power
    3
    Originally Posted by requinix
    It depends on the HTML, really. But first have you checked for an API? What kind of data from what kind of site?
    Well I can't say I'm absolutely sure what you mean by an API. Is that something that would make accessing all the data easier? Something that was released by the company? Because no, there isn't an API for this website. It is simple tables with names and marks for high jump results. There are links in the tables. Nothing very special about them, just a lot of them.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    40
    Rep Power
    3
    Anyone have any suggestions?
  8. #5
  9. For POny!
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    Amsterdam
    Posts
    416
    Rep Power
    115
    Originally Posted by jel5363
    Anyone have any suggestions?
    An api is a piece of software provided by the website to use the website by thirdparty websites. For instance facebook has an api that allows you to implement the like-stuff in your own website.

    My first guess in your case would be to use curl. Btw in my country and if you are in Europe, you might want to think about the legal aspects of copying/using other websites data. It can very well be protected and you can get in quite some financial trouble (Assuming they put time and effort in assembling the data in that manner)
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    40
    Rep Power
    3
    Originally Posted by aeternus
    An api is a piece of software provided by the website to use the website by thirdparty websites. For instance facebook has an api that allows you to implement the like-stuff in your own website.

    My first guess in your case would be to use curl. Btw in my country and if you are in Europe, you might want to think about the legal aspects of copying/using other websites data. It can very well be protected and you can get in quite some financial trouble (Assuming they put time and effort in assembling the data in that manner)
    Well the exact data is on two completely different websites and it is user input. Anyways, I've been looking around and what about "file_get_contents"? Is there anyway I can, if i have the list of urls I need to search, loop through with a while statement and search each $string from file_get_contents and output what I want?

    Also, how can I parse the string to only show what I want? Thanks.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2012
    Posts
    3
    Rep Power
    0
    Rather than coding something yourself have a look at OutWit Hub it may do what you want.

    Comments on this post

    • jel5363 agrees : Woohoo! It works!
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2012
    Posts
    40
    Rep Power
    3
    Originally Posted by Valkrider
    Rather than coding something yourself have a look at OutWit Hub it may do what you want.
    I bought it and its exactly what I needed, thanks! Gunna save me hundreds of hours of tedious work.... so you know I appreciate it
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2012
    Posts
    3
    Rep Power
    0
    Originally Posted by jel5363
    I bought it and its exactly what I needed, thanks! Gunna save me hundreds of hours of tedious work.... so you know I appreciate it
    Glad it worked out for you.

IMN logo majestic logo threadwatch logo seochat tools logo