Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0

    Question How To Open More Than One Webpage Simultaneously In The Background With many Threads


    Php Foks,

    When we used to build bots (.exe) with Ubot, we could open many threads in the background that download many pages simultaneously.
    Guessing our ram was enough to open 100 pages in the background.
    Cannot php do the same ? Open 100 threads and get cURL to load 100 pages in the background simultaneously so the user only sees one page loading on his screen while the other 99 are out of sight.
    When you submit your url to my SE (searchengine), it will first crawl the link you submitted and when it finds more links on the page like 50 then I want to get it to open 50 threads in order to load all those 50 pages simultaneously in the background (to save time) and scrape their contents (like meta tags, links, etc.).
    Open many pages simultaneously or visit many links simultaneously (as you would put it) for spidering purpose.

    And so, do you fine folks mind showing us newbies one code snippet on how to open many threads and another snippet to show us how to load many pages in the background out of the user's sight to scrape the pages ?
    And finally, the last snippet that does the 2 things mentioned above ?

    Thanks!
    Last edited by UniqueIdeaMan; May 26th, 2018 at 12:57 PM.
  2. #2
  3. Code Monkey V. 0.9
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2005
    Location
    A Land Down Under
    Posts
    2,472
    Rep Power
    2105
    Originally Posted by UniqueIdeaMan
    Cannot php do the same ? Open 100 threads and get cURL to load 100 pages in the background simultaneously so the user only sees one page loading on his screen while the other 99 are out of sight.
    When you submit your url to my SE (searchengine), it will first crawl the link you submitted and when it finds more links on the page like 50 then I want to get it to open 50 threads in order to load all those 50 pages simultaneously in the background (to save time) and scrape their contents (like meta tags, links, etc.).
    Open many pages simultaneously or visit many links simultaneously (as you would put it) for spidering purpose.
    Honestly - you wouldn't. Use the right tool for the right job. PHP is not the right language to do things like this. It doesn't deal with threads much at all, and isn't built to do anything like that. If you're doing that you'd want a separate system running to do the indexing and parsing completely separate from your PHP front end.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jul 2003
    Posts
    4,472
    Rep Power
    653
    Yes, this is a job better suited for python or maybe perl but python is probably an easier tool for multi-threading. In any case this is far beyond the abilities of this particular OP.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  6. #4
  7. Banned (not really)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 1999
    Location
    Caro, Michigan
    Posts
    14,961
    Rep Power
    4575
    Originally Posted by gw1500se
    In any case this is far beyond the abilities of this particular OP.
    Truth.
    -- Cigars, whiskey and wild, wild women. --
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by Catacaustic
    Honestly - you wouldn't. Use the right tool for the right job. PHP is not the right language to do things like this. It doesn't deal with threads much at all, and isn't built to do anything like that. If you're doing that you'd want a separate system running to do the indexing and parsing completely separate from your PHP front end.
    Are you sure about that ? So, when google gets a list of million links they get their crawler to crawl each page one by one ? No, threads involved ?
    I can always build a .exe crawler. Know how to do that. And can get that crawler to open 100 threads to open 100 pages simultaneously to do the indexing. But, I don't want to to be having my home computer on for the crawler to crawl millions or billions of pages.
    Therefore, need to build a php one so it runs on my vps 24/7.
    Understand ?
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by gw1500se
    Yes, this is a job better suited for python or maybe perl but python is probably an easier tool for multi-threading. In any case this is far beyond the abilities of this particular OP.
    I read Google crawler built with Python.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by Sepodati
    Truth.
    I will see if I can come-up with a workaround. But, I'd like to see you build one and claim it is not beyond your capacity. Can you prove that ?
    I did not think so.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    The work-around is to get the main script open similar crawling scripts to many popups and get those popups (scripts) to scrape the other pages.
    So, if the main crawler is in the main homepage and it scrapes 50 links then it will open 50 small popups (all opening to these 50 links) all with the same/similar crawler script and scrape the pages for more links.
    That way, the 50 tiny popups can act as the 50 threads. But, this looks messy.
    Let us try a less messy solution.
    How-about one popup opens 50 iframes where each iframe open to the 50 proxied links scraped on the homepage ? (The found links get proxified before getting opened for crawling). That way, popups don't fill-up the screen.
    The 50 proxied pages will have the crawler script opening in their headers which will scrape the links found on the pages opened in the iframes. Good idea!
    What do you think, Kicken ?
    Cato and Sepo, can you find any flaws on this one ?
    Mmm. Let me try. The 50 iframes won't all open simultaneously and will open one by one and so best just crawl each 50 page one by one without opening any iframes or popups. Plus, proxy pages take time to load and so this solution is worst than all. Right ?
    Oh well. I tried. So, what is your solution ? I tried 2 solutions on the spot be they crap or good. Now you try atleast one Cat & Sepo.
    I wonder what Kicken and Barand would come-up with. Mmm!
    Last edited by UniqueIdeaMan; May 26th, 2018 at 04:16 PM.
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jul 2003
    Posts
    4,472
    Rep Power
    653
    When your only tool is a hammer, the world looks like a nail.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  18. #10
  19. Wiser? Not exactly.
    Devshed God 2nd Plane (6000 - 6499 posts)

    Join Date
    May 2001
    Location
    Bonita Springs, FL
    Posts
    6,274
    Rep Power
    4193
    If you want to build a crawler that indexes stuff, you should be building a CLI based script, not somethings browser based.

    cURL does support downloading multiple things more or less simultaneously, but the processing of pages would be single-threaded and slow things down.

    You could run several instances of your script in parallel to increase processing speed. For example if you have a 4 core processor, run 4 instances of the script, each one downloading and processing different urls. You can have another script to monitor those processes and relaunch them if they exit.

    I've used this approach before when scraping data from an API. It works ok, and with some proper design can scale reasonably well.
    Recycle your old CD's



    If I helped you out, show some love with some reputation, or tip with Bitcoins to 1N645HfYf63UbcvxajLKiSKpYHAq2Zxud
  20. #11
  21. Banned (not really)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 1999
    Location
    Caro, Michigan
    Posts
    14,961
    Rep Power
    4575
    Originally Posted by UniqueIdeaMan
    I will see if I can come-up with a workaround. But, I'd like to see you build one and claim it is not beyond your capacity. Can you prove that ?
    I did not think so.
    Can you **** off? I do not think so.
    -- Cigars, whiskey and wild, wild women. --
  22. #12
  23. Banned (not really)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 1999
    Location
    Caro, Michigan
    Posts
    14,961
    Rep Power
    4575
    can you find any flaws on this one ?
    Just the author.
    -- Cigars, whiskey and wild, wild women. --
  24. #13
  25. Code Monkey V. 0.9
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2005
    Location
    A Land Down Under
    Posts
    2,472
    Rep Power
    2105
    Originally Posted by UniqueIdeaMan
    Are you sure about that ? So, when google gets a list of million links they get their crawler to crawl each page one by one ? No, threads involved ?
    I can always build a .exe crawler. Know how to do that. And can get that crawler to open 100 threads to open 100 pages simultaneously to do the indexing. But, I don't want to to be having my home computer on for the crawler to crawl millions or billions of pages.
    Therefore, need to build a php one so it runs on my vps 24/7.
    Understand ?
    I do understand. You don't.

    Google didn't write their crawlers in PHP. They used a programming language that was suitable. And as has been said, while you can do it in PHP, it's a really bad idea and will not scale well.

    I know this because I've done data scraping before, and it takes time, both for writing and processing. A single VPS will not handle what you want to do with it.

    And.. if you can write an .exe that can do this, then go and do it. It will work better. Forget PHP For this one. Be smart about it (for once).
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by gw1500se
    When your only tool is a hammer, the world looks like a nail.
    Care to explain a little bit more what you meant ? I hope you did not mean that, when I got my hammer out everyone's heads look like something to hammer at! Lol!
    Is your head baldie ? It will be a nice target! No offense. Just a joke!
    Why not provide some worthy inputs, my old man ? Rather than sarcacism every single time ? Look at Sepodati. He used to be like you at first. But now-adays, his sarcacism or criticism always comes with a hint to solve my php problems.
    You're welcome to do likewise rather than make silly remarks before I start imagining your head got no hair and shining like a snooker ball when I got my hammer out. Ouch! Lol!

    Take care silly billy!
  28. #15
  29. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    845
    Rep Power
    0
    Originally Posted by kicken
    If you want to build a crawler that indexes stuff, you should be building a CLI based script, not somethings browser based.

    cURL does support downloading multiple things more or less simultaneously, but the processing of pages would be single-threaded and slow things down.

    You could run several instances of your script in parallel to increase processing speed. For example if you have a 4 core processor, run 4 instances of the script, each one downloading and processing different urls. You can have another script to monitor those processes and relaunch them if they exit.

    I've used this approach before when scraping data from an API. It works ok, and with some proper design can scale reasonably well.
    Mmm. Thanks.
    Don't know what CLI is and so googling.
    I was thinking of running many instances of the same script too but I thought it won't match the effectiveness of the threading but now you prove it can. Good.
    Anyway, the other day I thought of some workaround half. The other half of the idea is starting to generate in my idea generating head of mine. Once it's finished downloading in my head from some unknown source, I might keep you all updated.
    I hope it's enough good idea to get Catacaustic all worked-up! Lol! And better, Sepo! Lol
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo