#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    3
    Rep Power
    0

    Arrow De-duplicate 'smart' random row query?


    Hi guys,

    following some research, I found a query suitable for my needs: it randomly returns IDs from the table. The ID field is an auto-increment, so there are no holes.

    Code:
    SELECT `mydb`.`myTable`.id
    FROM   (SELECT Floor (Rand() * (SELECT Count(*) 
    								FROM   `mydb`.`myTable`)) num, 
    			   @num := @num + 1 
    		FROM   (SELECT @num := 0) a, 
    			   `mydb`.`myTable` 
    		LIMIT  2000000) b, 
    	   `mydb`.`myTable` 
    WHERE  b.num = `mydb`.`myTable`.id
    The issue I experience is that the target table (myTable) contains 30-400M records, depending on the situation. In the LIMIT, I want to retrieve 2M randomly selected IDs, however I get a lot of duplicates (which is expected).

    Is it possible to de-duplicate the query and yet receive 2M records? I thought to create a table and let it manage the UNIQUE values, but again I will get less than expected.

    Any thoughts? Many thanks!
  2. #2
  3. No Profile Picture
    Problem Solver
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jan 2001
    Location
    Stockholm, Sweden
    Posts
    4,495
    Rep Power
    534
    If you want to do it this fast way by first selecting id's and then run a distinct on them the only suggestion I can give is to use two LIMIT.
    One for the inner SELECT with a slightly higher LIMIT to compensate for the "loss" of duplicates (depending on how many duplicates you usually have in the subset, which is entirely dependent on your data.

    And then the outer LIMIT part of the SELECT DISCTINCT chooses the exact amount of rows you want returned.
    /Stefan

IMN logo majestic logo threadwatch logo seochat tools logo