#1
  1. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352

    Paging Extremely large result sets


    WARNING: This thread is 7+ years old. I have had 4 other jobs since I asked this question and do not need the answer to it.

    Hello all,
    I am having a problem with the paging of enormous result sets in MySQL.

    I have made a search engine with more than 5,000,000 records in a MySQL database table with a FULLTEXT index. The user(s) search the index and then I display a paginated set of results. Unfortunately, some search terms return more than 1,000,000 results. The COUNT(*) on the table to get the total number of results takes upwards of 30 seconds, which is far too long.

    Are there any tricks that you guys know of that can help me with this? Any kind of estimation tools or secret counting functions that aren't slow? I know that even Google estimates their search results, maybe I'll have to do the same thing.

    My table structure is:
    Code:
    CREATE TABLE `SearchTableTerms` (
      `SearchTableTermsId` int(11) NOT NULL auto_increment,
      `ItemId` int(11) NOT NULL,
      `SearchTerms` text,
      PRIMARY KEY  (`SearchTableTermsId`),
      KEY `ItemId` (`ItemId`),
      FULLTEXT KEY `SearchTerms` (`SearchTerms`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8
    I have tried the following things ($db_conn is a ADODB connection object):
    1)
    PHP Code:
    $rs $db_conn->Execute("SELECT ItemId FROM SearchTableData WHERE MATCH(SearchTerms) AGAINST ('+user +entry' IN BOOLEAN MODE) LIMIT 25");
    $recordCount $db_conn->GetOne("SELECT COUNT(*) FROM SearchTableData WHERE MATCH(SearchTerms) AGAINST ('+user +entry' IN BOOLEAN MODE)"); 
    2)
    PHP Code:
    $rs $db_conn->Execute("SELECT SQL_CALC_FOUND_ROWS ItemId FROM SearchTableTerms WHERE MATCH(SearchTerms) AGAINST ('+user +terms' IN BOOLEAN MODE) LIMIT 25");
    $recordCount $db_conn->GetOne("SELECT FOUND_ROWS()"); 
    3)
    PHP Code:
    $rs $db_conn->Execute("SELECT ItemId FROM SearchTableTerms WHERE MATCH(SearchTerms) AGAINST ('+user +terms' IN BOOLEAN MODE)");
    $recordCount $rs->RecordCount(); 

    All of those are too slow (they're organized in order of speed, fastest first). Any other tips?

    Thanks in advance,
    Dan
    Last edited by ManiacDan; March 25th, 2013 at 10:42 AM. Reason: Added disclaimer
  2. #2
  3. Contributing User
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jun 2002
    Location
    close to the edge
    Posts
    1,035
    Rep Power
    69
    This is a MySQL question and should be moved to the appropriate forum!
  4. #3
  5. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352
    I have posted in the MySQL forum, but no one seems to know. I was hoping that maybe there was a way to do it quicker programmatically in PHP. Unfortunately, it seems that "you lose" is the only answer I'm going to get. I guess I'll start caching the search terms or doing some sort of crazy estimate system.

    Thanks for taking the time to read it, at least.


    -Dan
  6. #4
  7. (retired)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2003
    Location
    The Laboratory
    Posts
    10,101
    Rep Power
    0
    Please don't cross-post.

    You're after SQL_CALC_FOUND_ROWS() and FOUND_ROWS().

    --Simon
  8. #5
  9. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352
    Sorry Simon, I thought there might be a php workaround where I didn't have to use the built-in MySQL stuff.

    Unforunately I am NOT looking for SQL_CALC_FOUND_ROWS, that is much too slow.

    I am actually doing a query with a join to a table that has the actual Item Data (which relates to ItemId in the original table).

    Code:
    SELECT SQL_CALC_FOUND_ROWS 
    ItemId 
    FROM 
    SearchTableTerms 
    JOIN SearchTableData USING (ItemId) 
    WHERE MATCH(SearchTerms) AGAINST ('+picasso' IN BOOLEAN MODE) LIMIT 25;
    --23 seconds

    Code:
    SELECT 
    ItemId 
    FROM 
    SearchTableTerms 
    JOIN SearchTableData USING (ItemId) 
    WHERE MATCH(SearchTerms) AGAINST ('+picasso' IN BOOLEAN MODE) LIMIT 25;
    --0.01 seconds

    The solution I have come up with is just to search the table, display the results, then do the counts and use JavaScript to display the paging information. Dirty, and not compatible with all browsers, but my company's policy is not to care about anything other than IE, so it will have to do for now.

    Also, my tables are too large for a MEMORY table type, and the queries are too unique to be cached. Goody.

    Thanks for your help, I'll try to get a SQL specialist to look at my my.conf in case there's some setting in there that could improve my counting speed.


    -Dan
  10. #6
  11. Trapped on the forums...help
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2003
    Location
    /Users/edman007
    Posts
    4,617
    Rep Power
    906
    use an EXPLAIN query, that will tell you what indexes are being used, then you may try fooling around with the query and table to see if you can make it use the right index

    BTW, the SQL_CALC_FOUND_ROWS and FOUND_ROWS() suggest by Simon should be the fastest working solution, it counts the rows and returns the result in the same query, making the query to select the count doesn't even make mysql do so much as look at the table, where a COUNT() in a separate query runs two duplicate queries which is much slower
  12. #7
  13. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352
    Thank you for the tip, edman007. Unfortunately I need to do it in two queries. Using SQL_CALC_FOUND_ROWS slows down the initial query too much. If I do it that way, the procedure is:
    1) User searches
    2) Search for results with SQL_CALC_FOUND_ROWS (25+ seconds)
    3) Display results along with counts

    That's too slow. My way:
    1) User searches
    2) Search for results
    3) Display results
    4) COUNT results
    5) Display counts (javascript)


    That way, the user can browse the results while the count is being run. It's messy and very sub-optimal, but there's not much else I can do.

    I will include an EXPLAIN for a common search term:

    Code:
    mysql> EXPLAIN SELECT DISTINCT ItemId, FileName, Title, Creator, AuctionDate, SellPrice, Year, ShortDescription, Service, MintMark, `Condition`, Size FROM SearchTableData JOIN SearchTableTerms USING (ItemId) JOIN CategoryAncestor ON SearchTableData.CategoryId = CategoryAncestor.Category WHERE DeleteFlag = 0 AND (AuctionDate IS NULL OR (AuctionDate > '1970-01-01 00:00:00' AND AuctionDate < NOW())) AND MATCH(SearchTerms) AGAINST ('+franklin' IN BOOLEAN MODE) LIMIT 25;
    +----+-------------+------------------+----------+----------------------------------------------+-------------+---------+-------------------------------+------+------------------------------+
    | id | select_type | table            | type     | possible_keys                                | key         | key_len | ref                           | rows | Extra                        |
    +----+-------------+------------------+----------+----------------------------------------------+-------------+---------+-------------------------------+------+------------------------------+
    |  1 | SIMPLE      | SearchTableTerms | fulltext | ItemId,SearchTerms                           | SearchTerms | 0       | NULL                          |    1 | Using where; Using temporary |
    |  1 | SIMPLE      | SearchTableData  | ref      | IX_SearchTable,ItemId,DeleteFlag,AuctionDate | ItemId      | 4       | GP.SearchTableTerms.ItemId    |    1 | Using where                  |
    |  1 | SIMPLE      | CategoryAncestor | ref      | Category                                     | Category    | 4       | GP.SearchTableData.CategoryId |    1 | Using index; Distinct        |
    +----+-------------+------------------+----------+----------------------------------------------+-------------+---------+-------------------------------+------+------------------------------+
    3 rows in set (0.00 sec)

    I don't think there should be any speed problems with this table, my server is a beast (4GB RAM, 2.0Ghz x 16 processors).

    Any ideas?

    -Dan
  14. #8
  15. No Profile Picture
    Distributing Loser
    Devshed Novice (500 - 999 posts)

    Join Date
    Jan 2006
    Posts
    886
    Rep Power
    201
    I'm not DB genius, but I'd have thought a quicker way to do it would be thus: have a table which has a row for each unique word used in the database. Generally speaking, that takes you down to 70,000 or so standard dictionary words, so even with another 70,000 slang or specific words you're only searching 140,000 rows. Bonus. Normalize it by having another table which correlates, which basically has a couple of columns explaining that the word reference id is N and the occurance in the fulltext is ID N. Perform a join across them, and you'll know how many times the word occurs in the text, so you can weight output.

    Perhaps this doesn't scale well though. I've never tried with something a million rows large.
  16. #9
  17. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352
    That's basically how a fulltext index works, as far as I know.

    Unfortuantely, I have more than 200,000 unique words in the database, and the average SearchTerms length is 50 words.

    50 Words X 10,000,000 = 500,000,000 lookup rows.

    Not to mention the fact that the user can search for more than one word at a time. Yikes.

    Just so everyone knows, the search in question is performed against an enormous database of collectible items. Art, coins, currency, comic books, movie posters, etc. So I have items ranging from Renoir's Chapeau Epingle to Superman to Bridge Over the River Kwai. Doesn't bode well for unique words.

    I think we're going to have to purchase a search engine from someone. I hate that idea.

    -Dan
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    1
    Rep Power
    0
    Originally Posted by ManiacDan
    I think we're going to have to purchase a search engine from someone. I hate that idea.

    -Dan
    Derp!
  20. #11
  21. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,908
    Rep Power
    6352
    Sneaky devil. Thread closed
    Last edited by ManiacDan; April 9th, 2013 at 01:07 PM.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.

IMN logo majestic logo threadwatch logo seochat tools logo