#1
  1. 300lb Bench!
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Aug 2001
    Location
    New York
    Posts
    2,353
    Rep Power
    62

    Looking for an efficient way to delete rows based on another table


    So here's the deal. I have a script that runs every day that downloads a feed file and adds the new data into a table (innodb).

    table1
    ===========
    event_name
    date
    times
    (other fields)
    unique_id

    For every event, there will be multiple date/times combos. Let's say, for example

    2013-09-06 6pm
    2013-09-06 8:30pm
    2013-09-07 6pm
    2013-09-08 4pm

    So tomorrow, the feed will have entries for the 7th, 8th, 9th and maybe the 10th, but it will no longer have entries for the 6th, since the day has passed.

    The problem is, entries for the 6th will still be in table1. So one solution would be to create a temporary table every time the feed script runs that has the exact same schema as table1. After loading all of the data in table1, also load it into the temp table. table1_tmp will only have the newest data. Now, I can do something like

    delete from table1 where unique_id not in (select unique_id from table1_tmp);

    and all of the data not in the current feed will then be deleted from table1. The above approach works great when the tables have a few thousand rows in them. It's simply too slow for a few hundred thousand rows. In my case, table1 might have 500K rows and table1_tmp might have 300K rows. Using the above approach, I tried the delete statement and 22 hours later it was still running! And the command didn't hang. I looked at the processlist and it said it was still running just fine. Sure enough, when I do it with tables that only have a few thousand rows in it, it's much faster (maybe only takes an hour). And if I'm deleting rows where the unique_id *is* in the subquery

    delete from table1 where unique_id in (select unique_id from table1_tmp);

    it's *much* faster (maybe a few minutes), but that's not what I'm looking to do. I have a lot of rows and I'm looking to delete from the main table rows that are not in the temp table. Considering this will be running every day, I'd like the delete command to finish in a reasonable time (maybe an hour).

    One solution I see is to simply run everything in a transaction, truncate the table and just load everything in the feed file into it every day. Since there's referential integrity, I'd set all foreign key checks to 0 first, then set it back to 1 afterward. This would probably make the whole process take no more than 30 seconds, but it just seems like a drastic approach. If there's nothing else, I think I might take this approach. Anyway, if anyone has suggestions, please let me hear them.

    ******************
    Edit: Btw, I'm sure some of you will ask, why not just perform the simple
    delete from table1 where date < (today's date)

    There's only one problem with this. Sometimes, the feed file has errors in it. So you might have something like

    some event 2013-09-11 6pm
    some event 2013-09-11 7pm
    some event 2013-09-11 8:30pm

    The problem is, the vendor made a mistake and the second entry is really supposed to be 6:30pm. The very next day, they correct their mistake the new feed file looks like

    some event 2013-09-11 6pm
    some event 2013-09-11 6:30pm
    some event 2013-09-11 8:30pm

    If I just delete old entries, I have no mechanism to fix the second entry. If I dump everything into a temp table, however, when the new feed runs, table1 will look like

    some event 2013-09-11 6pm
    some event 2013-09-11 6:30pm
    some event 2013-09-11 7pm
    some event 2013-09-11 8:30pm

    and table1_tmp will look like

    some event 2013-09-11 6pm
    some event 2013-09-11 6:30pm
    some event 2013-09-11 8:30pm

    and when I delete from the first table rows that are not in the second table, since the 7pm listing is not in the temp table, it gets deleted.
    Last edited by colpaarm; September 6th, 2013 at 04:43 PM.
    Correspondence chess
    nothingbutchess.com

IMN logo majestic logo threadwatch logo seochat tools logo