September 10th, 2012, 09:37 PM
Master-Master replication, artificial queues, and concurrency
Background: we have two MySQL 5.5 servers running replication back at each other (where my issue is focused), plus a variety of read-only slaves where we try to push most of the SELECTing work. On a logarithmic scale of 1=none and 10=Facebook we probably max our traffic out around... 6 or 7? Feels like. Not incredibly high but enough to be worrisome. We do some ecommerce and have added/are adding a feature where someone will place an "order" for something and it will be recorded, but we'll wait to process that order until a few hours later.
To do this later processing we have an InnoDB table recording the orders and a cronjob to go through and find ones to process. To try to eliminate the threat of concurrency (I don't really care if this script runs 10 at a time) it uses a SELECT+UPDATE like so:
If that UPDATE returns one affected row then it considers it a success and processes the order. If not then apparently that order was "claimed" by another concurrent cronjob and it just tries again from the top. (And if there aren't any results then the script stops.)
SELECT id FROM table WHERE order is eligible to process AND status = "pending"
-- followed by
UPDATE table SET status = "processing" WHERE id = value from above
What I want to know is whether that's the best approach. The orders basically have to be stored in a database and not a real queuing system (I think we have a Rabbit somewhere) because we care about not losing data in case of Big Problems. So my questions include:
- Is that scheme safe?
- What about the master-master replication? Row or statement, I don't know. What if both servers get an identical UPDATE at the same time and don't push out/pull down the change fast enough?
- Should it be using a SELECT... FOR UPDATE? LOCK IN SHARE MODE? Does it need a LIMIT 1 to fit into some special criteria? Transactions?
- And of course: is there something better?
Last edited by requinix; September 10th, 2012 at 10:00 PM.
September 12th, 2012, 10:06 AM
Since you have the key ingredient:
Your solution should work fine as long as your workers only fetch possible jobs from one server.
Although if you have more than two states "processing","pending"
I would add:
to make sure that there wasn't any risk for race condition.
UPDATE table SET status = "processing"
id = value from above
AND status = "pending"
BUT if you are running worker processes against both servers then you are out of luck since there is no common lock and no guarantee when and how the replication will take place. Granted it is usually very fast but since MySQL replication is asynchronous there is no guarantee when the change is written to the other server so you might have one worker on either server start working on the same order.
I don't see any big difference between what you are using now when you try the UPDATE and then read the error code or use SELECT FOR UPDATE to lock the record before performing the UPDATE.
If you get a very high volume of queries it can have an impact on performance since you might get a lot of hit-and-miss with the first, but then again acquiring a lock and then performing the changes also takes some time so it's often down to testing to see which is better if you consider really high volumes.
September 12th, 2012, 02:33 PM
That's the thing: the query will go to a load balancer hiding the two servers. But maybe some setup where there's the primary master and secondary master would be nice... throw some queries (small volume) to the primary and the rest (that need a master) get distributed between them.
Actually I have that, it just didn't make it into my post. The whole "returns one affected row" thing kinda depends on it.
I figured as much, but I was hoping there was some super-secret MySQL solution.
Our solution thus far is to try to limit the cronjob to one instance at a time. I don't like it and it's not my first choice but it's the only solution that's been accepted.
September 14th, 2012, 10:16 AM
Which is a bit similar to how I usually advice scaling for websites.
Originally Posted by requinix
Run master->slave replication to keep the environment simple and avoid possible relational nightmare if your two masters get out of sync.
Write your application so that you have two database query functions, one that always queries the master (so that you in the application can control your writing (and/or sometimes reading) to one instance and the other function for load balanced reading from the servers where synchronicity isn't an issue.
Why do you have to run the cronjob process through the load balancer also? If you want to be able to run them in parallel you could run them all against the same server but still let the normal application traffic run through the load balancer.
Originally Posted by requinix
September 14th, 2012, 02:31 PM
So after more questions it sounds like I was misunderstanding a couple key things:
1. With AWS (where our databases live) there may be two masters but they're behaviorally identical to just one, like in terms of concurrency and such. So the master<->master replication apparently isn't "normal" replication and not an issue I need to worry about.
2. The main problem is two processes trying to update the same row at the same time. Under high load and bad conditions the two will both get back success on the same row; even updating with an ID and querying for that wouldn't work. As if the actual data in the table gets out of sync with what an individual process is doing: a worker caches the value that's been set even though the actual data hasn't been updated to match yet. Meanwhile the next worker sees (and caches) the old data. The result is that both processes think they were successful and only the latter one will actually be right.
I can't find anything to confirm #2 is true but I am finding people saying that in heavy-load situations UPDATEs and related should go through a queue to make sure there's only one process using the data at a time.
September 15th, 2012, 06:35 PM
Well I can't give any response to these vague descriptions.
Originally Posted by Requinox
But I must say that I have a hard time thinking that if you perform synchronous writes (like your updates) and check the return code. And that this would be some kind of cached information since that would go against a huge part of the strategy of an relational database.
But yes you can get a lot of things wrong and I don't know if this is what the other people are talking about.
Comments on this post