March 11th, 2013, 05:03 PM
De-duplicate 'smart' random row query?
following some research, I found a query suitable for my needs: it randomly returns IDs from the table. The ID field is an auto-increment, so there are no holes.
The issue I experience is that the target table (myTable) contains 30-400M records, depending on the situation. In the LIMIT, I want to retrieve 2M randomly selected IDs, however I get a lot of duplicates (which is expected).
FROM (SELECT Floor (Rand() * (SELECT Count(*)
FROM `mydb`.`myTable`)) num,
@num := @num + 1
FROM (SELECT @num := 0) a,
LIMIT 2000000) b,
WHERE b.num = `mydb`.`myTable`.id
Is it possible to de-duplicate the query and yet receive 2M records? I thought to create a table and let it manage the UNIQUE values, but again I will get less than expected.
Any thoughts? Many thanks!
March 13th, 2013, 05:33 AM
If you want to do it this fast way by first selecting id's and then run a distinct on them the only suggestion I can give is to use two LIMIT.
One for the inner SELECT with a slightly higher LIMIT to compensate for the "loss" of duplicates (depending on how many duplicates you usually have in the subset, which is entirely dependent on your data.
And then the outer LIMIT part of the SELECT DISCTINCT chooses the exact amount of rows you want returned.