#1
  1. No Profile Picture
    Vote Libertarian
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Location
    N'wallins
    Posts
    276
    Rep Power
    15

    distributing database reads among db slaves


    [background]
    I have two identical Dell Poweredge Xeon boxes. Management wants to use one for admin/production and have the other available for redundancy/backup. I am not sure exactly what they have in mind, but I interpreted this to mean: keep the two boxes sync'd.
    Assuming a stable Linux/Apache configuration, this means keeping the PHP code sync'd and the MySQL database sync'd.
    To keep the PHP sync'd, I house a CVS repository on my workstation and use a build script on each box to pull the latest CVS code, move files around, and set the proper permisisons.
    To keep the database sync'd, I am using MySQL replication, whereby the master database (on the production box) logs all SQL queries and executes those same queries on the slave database (backup box).
    One side-benefit of replication is that you can distribute read operations among the slaves (or master and slave, in this case) to improve availability and performance. Writing to the slave will corrupt replication.
    [/background]

    There are several layers within which to to implement distributed reads. I am considering PHP right now for minimal complexity. There are also different granularities for distribution: optimally, each "SELECT" query coming down the pipe would go to different host than the previous query. I am not sure how to achieve this end result. Here is the best I could come up with:

    Every http request for an .html file spawns my php init script, which includes the necessary classes and instantiates the proper objects for my application. In my DbHandle constructor, I randomly select a db host to read from. Thus, if a single webpage has several queries, they will all be sent to the same host. The next webpage request may randomly select the same host again, or it may select a different (the only other) host.

    This is probably the least efficient way of distributing database reads because the granularity is at the webpage, rather than query level, and the selection is not sequential but random.

    Rather than selecting at random, is there some sort of variable that will persist between webpage requests (in memory.. forget file / database persistence for performance reasons)? IOW
    Code:
    if($just_used_slave)
    {
      $read_host = 'master';
      $just_used_slave = false;
    }
    else
      $read_host = 'slave';
      $just_used_slave = true;
    }
    Obviously, $just_used_slave would not be a regular PHP variable. What is available (and quick)?

    Anyone have any good ideas for distributing reads at the query level, without extra hardware (e.g. load balancer)?

    Thanks

    PS here is (basically) my DbHandle constructor:
    Code:
    function Init($user = 'default', $pass = 'default', $db = 'default')
    {
      $this->DbUser = $user;
      $this->DbPass = $pass;
      $this->Db = $db;
    
      $this->DbHost = 'master';
    
      if($this->DbUser == 'reader' and intval(rand()) % 2)  // sometimes use the slave
        $this->DbHost = 'slave';
    
      return ($this->Connection = mysql_connect($this->DbHost, 
                                                $this->DbUser, 
                                                $this->DbPassword,
                                                true));
    }
    Last edited by CokeJunkie; December 19th, 2003 at 04:49 PM.
    • "Write programs as if the most important communication they do is not to the computer that executes them but to the human beings who will read and maintain the source code in the future" - ESR, The Art of UNIX Programming
    • "Programs must be written for people to read, and only incidentally for machines to execute." - Abelson & Sussman, SICP, preface to the first edition
    • "Programs must be written for machines to execute or else you just have a boring book" - DaWei_M
    • Vote Libertarian
  2. #2
  3. PHP Wacko
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Sep 2003
    Location
    Washington DC
    Posts
    1,308
    Rep Power
    340
    IMHO, I think you may be better served to go with the random approach rather than attempt to track which database was used last. I suppose the reason I say this was because this sentence stuck out in my mind:
    the selection is not sequential but random.
    I must ask, does it really matter?

    In my mind, this is all about statistics. If you flip a coin 10 times, I doubt you will get 5 heads and 5 tails. More likely you will end up with a 7/3 or 6/4 split. Why? The data set isn't large enough. If, on the other hand, you flipped a coin 1 million times, I am willing to bet it will be very close to 500,000/500,000 split. As the number of times you flip the coin approaches infinity, the closer it comes to being exactly 50/50.

    What does any of this have to do with your problem? Will, say there are 10 people browsing your site. Does it really matter if all 10 people randomly select database database 1, instead of 5 people getting database 1 and 5 people getting database 2? Probably not, I'm sure your mysql database can easily handle 10 requests. However, what if 1000 people are hitting your page at once? Since our dataset has largely increased, chances are it will be much closer to a 50/50 split than if 10 people are hitting your page, which is exaclty what we want.
    "I'm working so I won't have to try so hard" - the strokes

    security articles I recommend:
    SQL Injection Attacks
    XSS
    CSRF
    Little Bobby Tables
  4. #3
  5. No Profile Picture
    Vote Libertarian
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Location
    N'wallins
    Posts
    276
    Rep Power
    15

    true, true


    the whole point of this exercise to make sure we can handle the slashdot effect gracefully (when/if it ever happens). The hardware we're running gives us a lot of headroom already, but more headroom can't hurt right?

    So you're spot on in looking at the 1,000 request case to see how this tweak will affect performance.

    The reason why I think sequential is better than random is that a random selection will eventually pound one server X times in a row. (choose X)

    Given a high load with little or no headroom, choosing the sequential case for that window of X requests will smooth the peak automatically, making it more probably that the server can handle the load.

    Then again, I may be thinking too (hard? / much? / ignorantly?)

    BTW, I need to do my homework regarding Apache/MySQL optimization to handle intermittent high loads. I'm not sure where to start. Apache bench?
    Both httpd.conf (1.3.28) and my.cnf (my-large) are nearly stock.
    Last edited by CokeJunkie; December 15th, 2003 at 06:50 PM.
    • "Write programs as if the most important communication they do is not to the computer that executes them but to the human beings who will read and maintain the source code in the future" - ESR, The Art of UNIX Programming
    • "Programs must be written for people to read, and only incidentally for machines to execute." - Abelson & Sussman, SICP, preface to the first edition
    • "Programs must be written for machines to execute or else you just have a boring book" - DaWei_M
    • Vote Libertarian
  6. #4
  7. No Profile Picture
    Vote Libertarian
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Location
    N'wallins
    Posts
    276
    Rep Power
    15

    Re: distributing database reads among db slaves


    Originally posted by CokeJunkie
    Rather than selecting at random, is there some sort of variable that will persist between webpage requests (in memory.. forget file / database persistence for performance reasons)? IOW
    Code:
    if($just_used_slave)
    {
      $read_host = 'master';
      $just_used_slave = false;
    }
    else
      $read_host = 'slave';
      $just_used_slave = true;
    }
    Obviously, $just_used_slave would not be a regular PHP variable. What is available (and quick)?
    apache_note() could be used for this purpose, no?
    • "Write programs as if the most important communication they do is not to the computer that executes them but to the human beings who will read and maintain the source code in the future" - ESR, The Art of UNIX Programming
    • "Programs must be written for people to read, and only incidentally for machines to execute." - Abelson & Sussman, SICP, preface to the first edition
    • "Programs must be written for machines to execute or else you just have a boring book" - DaWei_M
    • Vote Libertarian

IMN logo majestic logo threadwatch logo seochat tools logo