Page 2 of 2 First 12
  • Jump to page:
    #16
  1. No Profile Picture
    Contributing User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jul 2003
    Posts
    4,451
    Rep Power
    652
    Originally Posted by UniqueIdeaMan
    Care to explain a little bit more what you meant ?
    You just don't get it. PHP is your hammer so every task looks like a nail to you. PHP is the wrong tool for that job. How much more explicit can we get?
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  2. #17
  3. Wiser? Not exactly.
    Devshed God 2nd Plane (6000 - 6499 posts)

    Join Date
    May 2001
    Location
    Bonita Springs, FL
    Posts
    6,265
    Rep Power
    4193
    Originally Posted by UniqueIdeaMan
    I was thinking of running many instances of the same script too but I thought it won't match the effectiveness of the threading
    When it comes to execution of the code, there's not really a difference between a thread and a process. A process is just a single thread essentially. Processes have slightly more overhead to start and stop, but unless you're constantly starting and stopping them that doesn't really matter.

    If you have for example a Ryzen 5 1600 which is listed as a 6-core/12-thread processor then you could run either one process with 12 concurrent threads, or 12 concurrent single-threaded processes.
    Recycle your old CD's



    If I helped you out, show some love with some reputation, or tip with Bitcoins to 1N645HfYf63UbcvxajLKiSKpYHAq2Zxud
  4. #18
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    830
    Rep Power
    0
    Originally Posted by gw1500se
    You just don't get it. PHP is your hammer so every task looks like a nail to you. PHP is the wrong tool for that job. How much more explicit can we get?
    I understand now. Because I only know php a little, I assume it's best for all my jobs. When it is not.
  6. #19
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    830
    Rep Power
    0
    Originally Posted by kicken
    When it comes to execution of the code, there's not really a difference between a thread and a process. A process is just a single thread essentially. Processes have slightly more overhead to start and stop, but unless you're constantly starting and stopping them that doesn't really matter.

    If you have for example a Ryzen 5 1600 which is listed as a 6-core/12-thread processor then you could run either one process with 12 concurrent threads, or 12 concurrent single-threaded processes.
    Mmm. Processes. Mmm. I have come across processes when I kill some apps from running in the background. And I kill via Task Manager.
    Processes. Threads. Getting a little technical here. But, they'e all good stuff to know the basics of computing.
  8. #20
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    830
    Rep Power
    0
    Originally Posted by kicken
    If you want to build a crawler that indexes stuff, you should be building a CLI based script, not somethings browser based.

    cURL does support downloading multiple things more or less simultaneously, but the processing of pages would be single-threaded and slow things down.

    You could run several instances of your script in parallel to increase processing speed. For example if you have a 4 core processor, run 4 instances of the script, each one downloading and processing different urls. You can have another script to monitor those processes and relaunch them if they exit.

    I've used this approach before when scraping data from an API. It works ok, and with some proper design can scale reasonably well.
    https://en.wikipedia.org/wiki/Command-line_interface

    So, I should be MS Dos-ing and the like ? Mmm. Since, I'm gonna be hiring vps then might aswell use some Linux CLI. I'm not used to Linux or CLI be it Linux or Windows.
    But thanks for the suggestion.
    One question though, even if I use Linux CLI from my vps server or dedicated server. I can just get the process running and then shut-down my home computer and it won't affect the processes running in the webhost's vps/dedicated server. Right ?
  10. #21
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    830
    Rep Power
    0
    Originally Posted by Catacaustic
    I do understand. You don't.

    Google didn't write their crawlers in PHP. They used a programming language that was suitable. And as has been said, while you can do it in PHP, it's a really bad idea and will not scale well.

    I know this because I've done data scraping before, and it takes time, both for writing and processing. A single VPS will not handle what you want to do with it.

    And.. if you can write an .exe that can do this, then go and do it. It will work better. Forget PHP For this one. Be smart about it (for once).
    Ok. I was just gonna convince you guys (to get in your good books) that I will use a CLI one or a .exe one to crawl the web. But in reality, I was gonna continue crawling the web with my Php crawler by running it on my rented vps as that way I don't have to keep my home computer on 24/7. That is the real issue here. Keeping the flaming computer on 24/7.
    And all this because I do not want to be having my home computer on 24/7 to run the .exe or CLI crawler.
    But now, I'm really looking into your suggestion to use the .exe crawler because I'm struggling with php to build a basic crawler and so how on earth am I gonna build one as smart as google bot that deals with keywords, key-phrases, synonyms, page ranks, etc. ?
    As for building a .exe cralwer. Because, I'm gonna use Ubot Studio (GUI bot programming tool) it should not be much of a problem since I have experience 2011-2016.
    And, I don't have to keep my home computer on 24/7 because I can now run the CLI or the .exe crawler from the rented vps. When things get very trafficky then I can rent a dedicated server afterwards. Good idea. I think you will like this plan.
    Kicken suggested CLI. If I remain in this profession (web crawling) then I can look into building a linux CLI one. Windows one most likely will be slow. Just like MS Windows.
    As of now, not gonna look into CLI programming as I'm already struggling with one lang (php) as it is.
    One day, I will get into Python. And so, I might pass the CLI idea and move-onto the python crawler idea.
    On the other hand, I'm not planning to be jumping into Python anytime soon and so you never know I might learn a thing or 2 about CLI programming (already came across pUTTy while learning how to install the OS on the rented vps) and may need to get more into terminal usage. And so, bit by bit, you never know I might just acquire enough knowledge to build my own CLI one before I migrate to Python. And so, not dismissing Kicken's suggestion.
    I had a look at Python coding the other day and it did my head-in. Seemed more complicated than Php. Nah! Best not jump into Python now. Since, I was struggling with Php, I was suggested to quit and migrate to Python few mnths ago but I felt bad that I won't be able to capitalize on the $_GET by quitting php and so still holding on. Best to crawl on forward rather than quit and turn back. Plus, I got you guys helping me out more often now to learn a thing or two about Php.
    Even those old duffers Sepodati and Gw1500se will feel bad if i suddenly quit. I know you, Kicken, Barand, Dsmabismad would be very disappointed if I suddenly say bye bye half-way.
    Last edited by UniqueIdeaMan; May 29th, 2018 at 01:45 PM.
  12. #22
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2017
    Posts
    830
    Rep Power
    0
    Kicken,

    You remember few posts above, I said few days ago a php crawler idea got generated in my idea generating brains of mine HALF and the other HALF I left for some other day.
    Well today, while answering some other threads of mine, subconsciously I was multi-tasking (threading ) and half of the other HALF's idea came into my head. Still left a QUARTER to go before the idea is 100% complete. And so, have not gone there quite yet.
    Gonna mention my 3 QUARTERed idea here to see if you guys can improve on my idea and complete the remaining QUARTER.

    Ok, the other day. I was wondering to build a CLIENT crawler. Now, web crawlers are usually server based and not client.
    Software, audio, video, etc. downloading websites are server based too and not client. But, look at kazAa and Napster. They are dis-centralized. They don't have a central base or head-quarter. No downloading website.
    Instead, every user is a contributor and their own computers (clients) become the uploading/download site (so to speak) and so the uploading/downloading bandwidth of the network is never throttled.
    I got thinking. Google has a massive datacentre to do crawling tasks and SERP serving tasks. I can't be baby sitting a data centre like that.
    And so, why not make the users' home computers the datacentre network (so to speak) ? Why not use their bandwidths to do the crawling ?
    Maybe, I give them a .php crawler they can run on their LAN (wamp/xamp/etc.) and once the crawling is complete (by using their bandwidth & time) then the data can be forwarded or submitted by their clients (php crawlers) to my site's mysql ?
    Now, there is a security issue here as the users can manipulate the crawled data and do keywords stuffing. And so, this is one issue I have to ponder.
    The other day, regarding another issue, I was gonna ask how some webscript vendors encrypt their source codes. Maybe, I can do that so any Tom fool having a sneak peek at my php crawler's source code will not understand a single line to do the messing about ?

    Few yrs back, half a dozen or a dozen yrs back, I thought I'll build a searchengine where my users would have my crawled index on their side. Client side. That way, they can search my searchengine not on my website (central head quarter) but on their client side. That way, my website does not get clogged and to prevent traffic clogging I won't have to rent clusters all across the globe like Google.
    I'll compete with Google alright as their days are numbered. But, I will be smarter than them. They will work harder, burn more electricity, burn more bandwidth, spend more money on hiring employees and hiring premises (data centres), etc. and I'll just work smarter by doing none of that. I will use the public's (my users') resources. Their computers, their bandwidths, their money (internet connection costs) and get them to build my searchengine by getting their computers to do all the crawling and even the search result presenting. Their home computers (client terminals) will download my index and perform KWs searches on it.
    Yeah, I know. Catacaustic is rolling his eyes and just about to reply that, it's gonna take ages for each of my users' to download my Index. But, I'm smart and already thought about this downside. I'm gonna use an algorithm to compress my Index. My own Algorithm and then zip it for others to download it.
    Now, I got to build-up on this idea. Remember, at the beginning I said that, this idea is 75% complete ? And so, keep that in mind. When it is 100% complete even Sepodati will be pleased!

    But first things first. First, I need to build the php crawler. Then, turn it into a client-side crawler. And only then build the php searchengine. And, turn it into a client-side searchengine. I'm not too worried about on how to build the search feature of my searchengine.
    Another idea came into my blessed head. The searchengine I was talking about. The search feature I mean. It was gonna be a webpage with the search box that queries my Index residing on each of my user's hdd. It was supposed to be a php search feature. Now I'm thinking, how-about I build a .exe one too ? That way, each user can have a copy of both the .php and the .exe search tool which I call the search features or searchengines.
    Just like they can have a php crawler and a .exe crawler to crawl their websites and submit their crawled data to my website's mysql db.
    They can also have a .php and a .exe searchengine too that present copies of my Index from my head quarter (my website).
    Good idea!
    Last edited by UniqueIdeaMan; May 29th, 2018 at 04:45 PM.
Page 2 of 2 First 12
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo