#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0

    Post Best performance search query for huge number of records


    Hello,

    I need to know what's the best way to search tables with million records. I have a table called <people> and this table has the column: <status>

    Status might be more than one word like:

    "I am happy today and the weather is nice".

    Also, it might be in different languages.

    Records in table are represented by UTF-8.

    I need to search for a sub-word, word, or even some words out of the whole sentence like for example (according the status above):

    Search 1: keyword = "ppy"

    Search 2: keyword = "am weather"

    Search 3: keyword = "nice"

    Search 4: keyword = "day weath"

    I would greatly appreciate if you hint me to the best method to apply a robust search. As far as I know using "LIKE" is not practical for huge records. I heard about the full text index but never used that.

    Thank you very much!

    Cheers,
  2. #2
  3. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    Read this wiki for some more information.

    But basically you only have two options:
    1. Use FULLTEXT index
    or
    2. use a stand alone software search engine (there are a lot of them see wiki above for some examples).

    Just notice that you might have to pay for the performance of fulltext searches with a few disadvantages like: common words are basically ignored in searches, you might get false positives, etc.
    /Stefan
  4. #3
  5. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
    Originally Posted by Moderns
    I heard about the full text index but never used that.
    you should try it
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    Originally Posted by sr
    Read this wiki for some more information.

    But basically you only have two options:
    1. Use FULLTEXT index
    or
    2. use a stand alone software search engine (there are a lot of them see wiki above for some examples).

    Just notice that you might have to pay for the performance of fulltext searches with a few disadvantages like: common words are basically ignored in searches, you might get false positives, etc.
    Thank you very much for your clarification. I will go for the full text index search. I would appreciate if you clear the right equivalent statement in full text search for the below:

    $sql = ' SELECT people.status, people.id FROM people WHERE
    people.status LIKE "%'.$keySearch.'%" AND people.id != "'.$this->myId.'" ORDER BY people.time ASC LIMIT '.$limit;

    Also, what shall I do to the table in mysql? I never worked on this so your detailed answer is much appreciated.

    Thanks a lot.
  8. #5
  9. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
    first thing you have to do is create a FULLTEXT index on the column(s) you want to search

    then compose the query using MATCH syntax, which you will find explained in the manual
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    Originally Posted by r937
    first thing you have to do is create a FULLTEXT index on the column(s) you want to search

    then compose the query using MATCH syntax, which you will find explained in the manual
    Thank you! But shall I use the Boolean search? Could you please hint me about the correct statement?

    One more thing, will it search for partial segments of words?
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    I have changed the column "status" to full text index and the engine is MyISAM. I have tried the below sql but it is not working I am getting the error:
    "test" AND people.id != "'.$this->myId.'" ORDER BY people.time ASC LIMIT '.$limit;

    You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near "test"

    $keySearch="test";
    $limit=10;
    $this->myId=1;

    $sql = ' SELECT people.status, people.id FROM people WHERE MATCH (people.status) AGAINST "'.$keySearch.'" AND people.id != "'.$this->myId.'" ORDER BY people.time ASC LIMIT '.$limit;

    mysql_query($sql);

    echo mysql_error();


    Would you please advise?

    Thanks.
  14. #8
  15. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
    please test it in mysql first, without php
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    I have something strange!

    In my table people, I have the status for each person as below:

    id status
    -- ------
    1 hello
    2 hello
    3 hello
    4 hello
    5 hello

    When I search using the SQL below:

    SELECT people.status FROM people WHERE

    MATCH (people.status) AGAINST ("hello*" IN BOOLEAN MODE)

    HAVING people.id!= 1

    ORDER BY status_time ASC LIMIT 10


    I get ZERO rows! I should get 4 rows!

    BUT:

    If the table people like below:

    id status
    -- ------
    1 sello
    2 sello
    3 sello
    4 sello
    5 sello

    And if I change the keyword for search to "sello", then I get 4 rows! Why!! Below is the SQL:

    SELECT people.status FROM people WHERE

    MATCH (people.status) AGAINST ("sello*" IN BOOLEAN MODE)

    HAVING people.id!= 1

    ORDER BY status_time ASC LIMIT 10


    Is there any restriction on the "hello" in mysql?

    Thanks.
  18. #10
  19. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    Originally Posted by r937
    Why when having all records set to "hello" in the table, we can't find the search result for "hello" WHILE:

    when having all records set to "sello" in the table, then we can get 4 hits when we search about "sello"

    I am looking for a logical clarification! Both are two words! Difference is one letter!
  22. #12
  23. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,280
    Rep Power
    4279
    Originally Posted by Moderns
    Difference is one letter!
    doesn't matter

    results are bizarre

    because there are so few rows in the table
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    Originally Posted by r937
    doesn't matter

    results are bizarre

    because there are so few rows in the table
    What do you mean by "bizarre"? Random? You mean some words will accepted and others not?
  26. #14
  27. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    Originally Posted by Moderns
    What do you mean by "bizarre"? Random? You mean some words will accepted and others not?
    Correct!
    Here's the ignore list of really common words.
    And for words _not_ on this list you still get odd results when there are only few rows in the table.

    Normal queries in databases is: "Give me all records that match this criteria"
    Fulltext queries in databases are more fuzzy like: "Give me the _best_ match for what we are looking for".

    Execution of fulltext search criterias actually returns a floating point value of how well a specific row matches the expression and the result set rows are sorted by this number.

    So the cost for getting good performance for large amounts of text is that they behave fuzzy and random for small amounts. Because statistics are greatly skewed when there is a very limited amount of values in a set.
    /Stefan
  28. #15
  29. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2010
    Posts
    95
    Rep Power
    0
    Originally Posted by sr
    Correct!
    Here's the ignore list of really common words.
    And for words _not_ on this list you still get odd results when there are only few rows in the table.

    Normal queries in databases is: "Give me all records that match this criteria"
    Fulltext queries in databases are more fuzzy like: "Give me the _best_ match for what we are looking for".

    Execution of fulltext search criterias actually returns a floating point value of how well a specific row matches the expression and the result set rows are sorted by this number.

    So the cost for getting good performance for large amounts of text is that they behave fuzzy and random for small amounts. Because statistics are greatly skewed when there is a very limited amount of values in a set.
    Brilliant! Thank you very much for the detailed clarification! Cheers.

IMN logo majestic logo threadwatch logo seochat tools logo