#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2002
    Posts
    32
    Rep Power
    15

    ISAM table locking issues


    Hey-

    The situation:

    I have a table that handles confirmations. It has 300k+ rows usually. It serves two purposes: keep track up user pickups/reads, and provide the raw data for aggregate statistics.

    On pickup, the appropriate record receives an UPDATE call.

    Every hour, a script compiles aggregate statistics on these confirmations, doing a series of reads. This data is then written to a cache table for end-user reads.

    The queries:
    Pickup: An update is run with MsgID as the key.

    Caching:
    query 1:
    select count(*) as total from [confirmations] where campaignID = [int] AND Received IS NOT NULL

    query 2:
    select distinct h.clientID from [Hits] h, [confirmations] c where h.clientID = c.clientID AND Received IS NOT NULL and h.date > [int] AND h.date < [int] AND c.campaignID = [int]

    The [confirmations] table currently has msgid as primary key, and an index on clientID.


    Now, this query is creating some serious problems. Basically, whenever the caching reads happen, it locks updates to the table. The caching reads takes a while to execute, and this unexpectedly blocks the pickup updates from running.

    My question is two-fold:
    1) Indexes. I assume creating 2 new indexes, one on campaignID alone and one on campaignID and Received, would improve the selects in speed. If so, which order do i need to define the multiple index? index_name(CampaignID, Received) or index_name(Received, CampaignID)? And, since Received can and will hold NULL values, does this create a problem?

    2) Why is the table locking updates? As far as I can tell, the default isolation level is "Read Committed". Shouldn't you still be able to read rows that are not being updated at the time the read reaches them? Do you need a "dirty read" for this?

    I basically need the read to happen regardless of updates. It needs to not lock the updates until it finishes, since some inaccuracy is trivial when the caching is scheduled and not real-time-critical.


    What are some solutions to this? The following have come to mind:

    InnoDB table for the confirmations:
    With the row-level locking, I should be good to go. The reads won't lock the table for updates as well.

    Separating each confirmation/campaignID into separate tables:
    This would reduce the largest table size to 200k or less entries, for now. Additionally, the caching script should only lock one table at a time since it'll do one campaignID at a time. I am concerned about the scalability of this approach, since it just reduces the number of rows for now. If the rows double, then what?


    Any observations, suggestions, ideas?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Oct 2002
    Location
    Binghamton, NY
    Posts
    536
    Rep Power
    28

    Re: ISAM table locking issues


    Originally posted by badhaiku.dk
    Hey-

    The situation:

    I have a table that handles confirmations. It has 300k+ rows usually. It serves two purposes: keep track up user pickups/reads, and provide the raw data for aggregate statistics.

    On pickup, the appropriate record receives an UPDATE call.

    Every hour, a script compiles aggregate statistics on these confirmations, doing a series of reads. This data is then written to a cache table for end-user reads.

    The queries:
    Pickup: An update is run with MsgID as the key.

    Caching:
    query 1:
    select count(*) as total from [confirmations] where campaignID = [int] AND Received IS NOT NULL

    query 2:
    select distinct h.clientID from [Hits] h, [confirmations] c where h.clientID = c.clientID AND Received IS NOT NULL and h.date > [int] AND h.date < [int] AND c.campaignID = [int]

    The [confirmations] table currently has msgid as primary key, and an index on clientID.
    Assuming all of your indexes are proper, then I would assume that these queries both should be very very quick. You are only running these queries once per hour? (not once per user or anything similar to that?) Also how many rows are they returning, I don't know the layout of your data, but assuming they aren't returning a lot of rows, it should be fast.

    Now, this query is creating some serious problems. Basically, whenever the caching reads happen, it locks updates to the table. The caching reads takes a while to execute, and this unexpectedly blocks the pickup updates from running.

    My question is two-fold:
    1) Indexes. I assume creating 2 new indexes, one on campaignID alone and one on campaignID and Received, would improve the selects in speed. If so, which order do i need to define the multiple index? index_name(CampaignID, Received) or index_name(Received, CampaignID)? And, since Received can and will hold NULL values, does this create a problem?
    For the first query, an index on (CampaignID, Received) should be good. There is no reason to create an index on CampaignID as well, since MySQL can use this two column index for any query containing only CampaignID. NULL's do make the indexes and searches slower (and take up more disk space), but I doubt you would notice any difference with so few rows. I personally try to avoid NULL's whenever possible anyways, since they tend to be a pain to work with. From the name of it Received sounds like a flag of some sort? Maybe try setting a default value instead of NULL? (This is just conjecture again, since I don't know your tables)

    2) Why is the table locking updates? As far as I can tell, the default isolation level is "Read Committed". Shouldn't you still be able to read rows that are not being updated at the time the read reaches them? Do you need a "dirty read" for this?
    MyISAM tables (or ISAM, I hope you are using MyISAM) do not have isolation similar to this. Instead they use table level locks (advantagous in some circumstances, bad in others). When are you are running a SELECT statement, it applies a READ lock on the table. This means that anyone else can read from the table, but no one can update the table until it is finished. An UPDATE statement uses a WRITE lock, which will block all other threads until it is done.

    I basically need the read to happen regardless of updates. It needs to not lock the updates until it finishes, since some inaccuracy is trivial when the caching is scheduled and not real-time-critical.

    What are some solutions to this? The following have come to mind:

    InnoDB table for the confirmations:
    With the row-level locking, I should be good to go. The reads won't lock the table for updates as well.
    Yes, one solution could be use to InnoDB with non-locking reads. This would be the easiest, although most likely not the highest performance.

    Separating each confirmation/campaignID into separate tables:
    This would reduce the largest table size to 200k or less entries, for now. Additionally, the caching script should only lock one table at a time since it'll do one campaignID at a time. I am concerned about the scalability of this approach, since it just reduces the number of rows for now. If the rows double, then what?


    Any observations, suggestions, ideas?
    I would suggest that you optimise your select statements first. Make sure you have the appropriate indexes, and I do not see any problems other than that. Try to use EXPLAIN to see how they are executed and make them as fast as possible. Feel free to ask here if you can't get them to be that fast. This should solve the locking issues.

    Otherwise you could try using InnoDB which would most likely solve the problem easily. But this is more of a patch to the solution, rather than fixing the real problem of your queries not being indexed properly.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2002
    Posts
    32
    Rep Power
    15
    Thanks for the response. It helps to know that I am on the right track at least =)

    Last night, I went through and added a campaignID index and a (campaignID, Received(15)) index, and the performance gain was (as expected) huge. I'd expect the indexes to slow down insert a tad, but I don't expect this to be noticeable. We do run 165k looped inserts on the table on occasion, but since they're separate queries, this shouldn't tie things up to the point where other things get held up.

    The two queries in question are run once per hour for each "project". We generally have about 10-20 projects at a time, so a total of 20-40 executed queries. Having them speed up seemed to significantly reduce the length of time the table was locked. Once again, since they're looped queries, MySQL can sneak in other updates in between, keeping the processlist flowing along.

    All in all, the script takes about 35 seconds to run at this time. The next holdup would be the joined queries, and optimizing the [hits] table.

    As for the indexing, does it matter in which order I call the campaignID and Received in in the second query as related to the order of the index arguments? Or does it not matter? I know (and was reminded. thanks) that I can use parts of the index, from the left (ie, campaignID alone). But I am only assuming that the order of parameters in the WHERE clause is non-significant as it pertains to the index order?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2002
    Posts
    32
    Rep Power
    15
    Random afterthought: Does mysqldump also put read locks on the tables when it dumps 'em?
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Oct 2002
    Location
    Binghamton, NY
    Posts
    536
    Rep Power
    28
    Yes, mysqldump just uses SELECT in order to read the data. So with MyISAM that means it will use a read lock.

    No, the order doesn't matter when both columns are used in the same query in the WHERE part. The time that it does matter is with the prefix of an index issue. Also it does matter if trying to use an index for ORDER'ing or other non-WHERE clause things.

    As a general rule, queries on a production system should never take more than 1 second, and I personally design all of mine to be less than .05 seconds. This allows for a high level of concurrent access and helps to minimize any access issues with table locks.

IMN logo majestic logo threadwatch logo seochat tools logo