#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8

    Question How Did You Structure Your Index Mysql Tbl ?


    Php Buddies,


    I'm going to jump into building my own searchengine now.
    Starting off with the search box, then the index and then finally the crawler.
    Not too worried about the crawler. Gonna make use of cURL and implode,explode, etc. php functions.
    It's the Index that I quite can't get my head around.

    To build the Index, should I have the structure of the mysql tbl like this .... ?

    Option 1

    Columns
    Url|Keywords

    Or, should I make the structure like this instead ..... ?

    Option 2

    Columns
    Keyword|Urls


    Option 1a Example

    Columns
    Url                         |             Keywords
    -----------------------------------------------------------------------
    devshed.com        |             forum, programming, php

    Option 1b Example

    Columns
    Keyword                                |         Urls
    --------------------------------------------------------------------------------------------------------------------------
    forum                                     |         devshed.com, blackhat.com, warriorforum.com
    ---------------------------------------------------------------------------------------------------------------------------
    php                                        |         devshed.com/forum/php.htm, sitepoint.com/forum/php.php



    Option 2a Example

    Columns
    Urls                               |             Keyword
    -----------------------------------------------------------------------
    devshed.com              |             forum
    -----------------------------------------------------------------------
    devshed.com              |             programming
    -----------------------------------------------------------------------
    warriorforum.com        |             money
    -----------------------------------------------------------------------
    warriorforum.com        |             forum



    Option 2b Example

    Columns
    Keywords                                |         Urls
    --------------------------------------------------------------------------------------------------------------------------
    forum                                     |         devshed.com/forum
    --------------------------------------------------------------------------------------------------------------------------
    forum                                     |         blackhat.com/forum
    ---------------------------------------------------------------------------------------------------------------------------
    php                                        |         devshed.com/forum/php.htm
    ---------------------------------------------------------------------------------------------------------------------------
    php                                        |         sitepoint.com/forum/php.php


    Question 1:
    I have a feeling you won't like Option 1a or 1b atall.
    But, let's assume you need to do it out of them 2 options. Which one would you choose ?

    Question 2:
    I have a feeling you will like Option 2a or 2b.
    Which one would you choose ?
    Or, if you don't like any of them 2. Then, let's assume you need to do it out of them 2 options. Which one would you choose ?

    Question 3:
    If you don't like the structure of any of the 4 options then which structure would you yourself use or have used ?
    Best to show an example like I did.


    Btw, I know that, if I structure my tbl around the way I showed in my examples then users would only be able to make queires for a single keyword and not a phrase. But, dealing with phrases get complicated and so for the time being, as a beginner, let's concentrate one thing at a time. Concentrate on the very first basic of indexing a url.

    Thanks
    Last edited by UniqueIdeaMan; January 9th, 2018 at 09:46 AM.
  2. #2
  3. Lord of the Dance
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Oct 2003
    Posts
    4,134
    Rep Power
    2011
    Have you considered the Apache Lucene and the projects based around that?
    Originally Posted by https://en.wikipedia.org/wiki/Apache_Lucene
    Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8
    Originally Posted by MrFujin
    Have you considered the Apache Lucene and the projects based around that?
    Just googled and it seems it's in Java.
    I want to build my own php searchengine to make it better than google in terms of fetching results.
    I have to learn the basics of building a searchengine with php. Google's results have dropped 50% in quality for about 5yrs now.
    Anyway, how about answering my questions in my original post ?
    Last edited by UniqueIdeaMan; January 10th, 2018 at 01:11 AM.
  6. #4
  7. Code Monkey V. 0.9
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2005
    Location
    A Land Down Under
    Posts
    2,344
    Rep Power
    2063
    Originally Posted by UniqueIdeaMan
    I want to build my own php searchengine to make it better than google in terms of fetching results.
    The right tool for the right job.

    PHP/MySQL is not the right tool for a search engine back-end indexing and results retrieval system. It's way to slow, and just can't handle the data that way that you would need it to.

    That's why MrFujin was trying to point you towards a different solution that actually has a chance of working.

    If you really want to make the wrong decision and persist in PHP, try both types with large data sets, and see which one works faster. That's the only real way to figure it out. And remember that for something like this, unless you can scale your system to millions of individual data servers, it' will be uesless after indexing about 0.00000000000000000000000000000001% of the internet.
  8. #5
  9. Lord of the Dance
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Oct 2003
    Posts
    4,134
    Rep Power
    2011
    Talking about tools, one in that bucket is called pre-analysis.

    Just out of curiosity I found this interesting article/blog:
    https://www.linkedin.com/pulse/20140...ne-like-google
    (Don't forget to read the comments)

    Comments on this post

    • UniqueIdeaMan agrees : Rep for link.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8
    Originally Posted by Catacaustic
    The right tool for the right job.

    PHP/MySQL is not the right tool for a search engine back-end indexing and results retrieval system. It's way to slow, and just can't handle the data that way that you would need it to.

    That's why MrFujin was trying to point you towards a different solution that actually has a chance of working.

    If you really want to make the wrong decision and persist in PHP, try both types with large data sets, and see which one works faster. That's the only real way to figure it out. And remember that for something like this, unless you can scale your system to millions of individual data servers, it' will be uesless after indexing about 0.00000000000000000000000000000001% of the internet.
    Thanks for explaining what Mr Fujin was driving at.
    But, last year, I dropped the idea to learn java when I saw I was struggling with one language (php). Wanted to learn java to build Android apps. Since I'm gonna try learning python for web development (because I think I won't become an expert in php for web development as it's too messy) then I thought I'll substitute python with java since with python you can also build desktop softwares (.exe) and Mobile Apps (Android apps) aswell as web apps or web scripts (cgi).
    In other words, you learn php and java for web development (but php better and so I got int php), you learn c/c++, c#, java, etc. to build desktop softwares and you learn java to build mobile apps. Seemed like with java you can build 3 types of apps and so was gonna get into it. But, I found a new and modern boy in town (python) and it is much easier to learn. And so, I quit the idea to learn java for python.
    Is not google written in python ? Python is slowly grabbing the market. I read in articles php would be exttinct in 10-15yrs and python would be the new master just like Usa and Soviet Union became new masters after ww2 and The British, French, German, Italian and the Spanish (European colonising powers) became extinct.
    Hail python! Lol!
    But, I have not heard that python is faster than java and so I'll give Mr Fujin and you some credits here thinking you guys know well than me that java is fastest of all.

    Yes, I know, a compiler as much faster than an Interpreter. Know the difference between them too.
    I once came across someone build a web search feature with c or c++ but that was nearly a dacade ago or so. I know, with compiling langs you need to write around 10 times more code than an Interpreter lang.
    I know different langs are built for different purposes. And, they say python is designed for all 3 purposes (web apps, desktop apps and mobile apps). I also know, they never sued to say "web apps" but "web scripts" because I used to use that terminology since around 2000. It's facebook that started using the word "web apps" and everyone started following suit around 3-4yrs now.
    See, I do know a little history of programming. Am new in learning programming but ain't new in fiddling with the codes (modify them to suit my purpose). Started all this fiddling with html first, then perl cgi around 2000 and then quit.
    I tried learning perl around 2000 or 2002 and quit as was too complicated. Tried learning basic then basic liberty too and quit them too as they were too much like the English language or human language and since you can say the same thing in the human English language more than one way I used to remember things in one human language way but few days later get confused which human language way I learnt it. Therefore, thought an algebraic type or "little less human lang" way would be better. Php suited me fine. It uses brackets "{" and "()" and so those parts were easy to learn.

    And so, this is heart-lightening news for you guys as you now understand I have a very little head start over a complete newbie.

    Btw, I'm getting the hint from you Catacaustic that, php is faster than java when the simultaneous queries are no more than a 100k. Otherwise, java is faster. How well did I do in guessing your hint ?
    Last edited by UniqueIdeaMan; January 12th, 2018 at 04:56 AM.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8
    Mr Fujin & Catacaustic,

    You forgot to tell me which option I should stick to:

    1a
    1b
    2a
    2c.

    In another thread for another topic, I got the wind from Kicken that I should not do it the way 1a or 1b. I got this hint from others too outside this forum.
    So now, which one you 2 prefer, 2a or 2c ? I still have not got this answer from anywhere (in this world). Might have to ask the aliens in mars. Lol!
    Remember, I want to build my own searchengine. Whether it rocks and dominates the world or not is not important. What is important is that I should learn how to build one with php. A php project, it is.
  14. #8
  15. Banned (not really)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 1999
    Location
    Caro, Michigan
    Posts
    14,819
    Rep Power
    4554
    It makes no difference which one you choose as you've consistently demonstrated you have no idea what you're doing. Make a decision on your own for once.
    -- Cigars, whiskey and wild, wild women. --
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8
    Originally Posted by MrFujin
    Talking about tools, one in that bucket is called pre-analysis.

    Just out of curiosity I found this interesting article/blog:
    https://www.linkedin.com/pulse/20140...ne-like-google
    (Don't forget to read the comments)
    Thanks for the link. Checked it out now. Mmm seems like Google does not use Mysql or Sql atall.

    Also, googled:
    https://www.google.com/search?q=how+...hrome&ie=UTF-8

    Anyway, check this out:
    https://www.google.com/search?q=what...hrome&ie=UTF-8
    Last edited by UniqueIdeaMan; January 13th, 2018 at 08:12 AM.
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8
    Originally Posted by Sepodati
    It makes no difference which one you choose as you've consistently demonstrated you have no idea what you're doing. Make a decision on your own for once.
    To me both 2a & 2b seems the same in terms of sql speed. Now, I have to decide which would be simpler for me in terms of coding.
    I forgot to mention before. I know google uses cluster servers.
    This img has been helpful. Look at the far right column. I thought at first, it lists how many times a keyword occurs on the webpage. I had forgotten to build such a column. Good thing I found the img. Anyway, enlargening the img I see this is not the case but it lists which document lists the said keywords. nevermind. A misunderstanding made me remembere to build a required column.
    Continuing my research. I'll keep this thread updated.
    Last edited by UniqueIdeaMan; January 13th, 2018 at 08:21 AM.
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2017
    Posts
    410
    Rep Power
    8
    Catacaustic & Sedopati,

    To prove that I am a genious thinker good enough to defeat google in terms of clever search features, consider this ....

    You search on google, you get presented links and page descriptions, you like the description of the 2nd link and click it. You now HAVE to wade through all that content to find that exact phrase that got you interested in the SERP.
    However, on my searchengine, the moment you click the link you'll get auto scrolled down to that particular spot on the page where your "interested phrase" occurs.
    That is just one service my searchengine will have an edge over my competitions. And, I still have not revealed all of them here. Nor have I finished thinking up all the hot features to embarasse all the existing searchengines.
    I hope by now you 2 will become my faithful audience.

    For the time being, have considered these columns:

    keyword, url, keyword count (on page),
    internal link count (on page), internal link count (on page) with keyword (sought keyword),
    external link count (on page),external link count (on page) with keyword (sought keyword),
    img count (on page), img count (on page) with keyword (sought keyword).

    The words in brackets for your explanation. They won't be part of the column name.
    My algorithm would use these data (counts) to determine the ranking.
    And no, it won't be like this where the more a keyword occurs on a page then the more score in ranking the page.
    I will teach the algorithm when to decide keyword stuffing has occured and how much bad ranking to give for that.
    If you have any better ideas then let me know. I might teach it to my crawler.

    Yeah, I know. From what I learnt from Kicken's suggestion is that, the less columns are related to each other (dependancy) the better. I forgot about that. Suddenly just remembered.
    Mmm.

    Thanks
    Last edited by UniqueIdeaMan; January 13th, 2018 at 09:00 AM.
  22. #12
  23. Lord of the Dance
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Oct 2003
    Posts
    4,134
    Rep Power
    2011
    To prove that I am a genious thinker good enough to defeat google in terms of clever search features, consider this ....
    Consider this ...
    name one advantage how MySQL helps your project, which can't be solved by other solutions?
    What are the disadvantages?
  24. #13
  25. Code Monkey V. 0.9
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2005
    Location
    A Land Down Under
    Posts
    2,344
    Rep Power
    2063
    Originally Posted by UniqueIdeaMan
    To prove that I am a genious thinker good enough to defeat google in terms of clever search features, consider this ....
    OK...

    Originally Posted by UniqueIdeaMan
    You search on google, you get presented links and page descriptions, you like the description of the 2nd link and click it. You now HAVE to wade through all that content to find that exact phrase that got you interested in the SERP.
    However, on my searchengine, the moment you click the link you'll get auto scrolled down to that particular spot on the page where your "interested phrase" occurs.
    So what happens when there's multiple occurances of that keyword/phrase? Which one do you scroll to? What if the first one is in the title, but the one that I actually want to see is 700 words down the page? What happens when the page doesn't contain the actual phrase that I searched for, but another phrase with the same meaning or intent (Google does this a lot)? What happens when the phrase is part of an image?

    Originally Posted by UniqueIdeaMan
    For the time being, have considered these columns:

    keyword, url, keyword count (on page),
    internal link count (on page), internal link count (on page) with keyword (sought keyword),
    external link count (on page),external link count (on page) with keyword (sought keyword),
    img count (on page), img count (on page) with keyword (sought keyword).
    So you're keeping a record for every keyword on every page? Wow. What about key phrases too? What about "like" phrases"? What about miss-spellings? Just imagine how much storage you'll need to index that much for 0.0000001% of the internet. (hint: you can't afford it!)

IMN logo majestic logo threadwatch logo seochat tools logo