#1
  1. hiding my <b> from ur <strong>
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2004
    Posts
    959
    Rep Power
    181

    Optimizing Query that calculates, groups, and groups again


    Seeing if there's any bright optimization ideas out there on a particular query need.

    To simplify, imagine that we want to know how many kids in a class are present (P) between 0 and 50 percent of the time, and how many are present 50 to 100 percent of the time. Call the kids with under 50% attendance "bad" and the other ones "good".

    The data in the attendance table may look like:

    studentid | mark
    -------------------
    JACK | P
    JACK | A
    JACK | A
    JACK | A
    JACK | A
    JACK | P
    JILL | P
    JILL | P
    JILL | P
    JILL | A

    Now, Jack is a "bad" kid because he is absent over half the time, while Jill is a "good kid".

    The resulting output should simply look like

    array(
    'bad' => 1,
    'good' => 1,
    )

    1 bad kid, and one good kid.

    While a simple concept, the steps are expensive. The SQL first needs to calculate a percentage for each and every kid (not cheap in itself) and then aggregate the counts of those.

    For just the first part - calculating the percentages - I've found that:

    SELECT *, ((SELECT COUNT(*) FROM attendance WHERE mark = 'P' AND studentid=a.studentid)/(SELECT COUNT(*) FROM attendance WHERE studentid=a.studentid)) AS percentage FROM attendance AS a ... (join filters) ... GROUP BY studentid

    works, but is very expensive (and I have hundreds of thousands of marks). And of course, this would still require either PHP manipulation of the results or more CASE syntax for the second step of aggregation.

    Any thoughts on a different strategy? Of course, PHP manipulation from the start is also a possibility, but I'm again finding that solution is taking too long.

    Thanks for any input,
    Jeremy
    ****
    Enjoy my post? Drop some props by hitting the scales button up top. JBL

    Website Design in Los Angeles and Washington, DC by PoweredPages.com
  2. #2
  3. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,204
    Rep Power
    4279
    Code:
    SELECT CASE WHEN percentage >= 50
                THEN 'good'
                ELSE 'bad' END AS category
         , COUNT(*) AS total
      FROM ( SELECT studentid
                  , 100.0 * COUNT(CASE WHEN mark = 'P' 
                                       THEN mark END) / 
                            COUNT(*) AS percentage
               FROM attendance 
             GROUP 
                 BY studentid ) AS t
    GROUP
        BY category

    Comments on this post

    • daprezjer agrees
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  4. #3
  5. hiding my <b> from ur <strong>
    Devshed Novice (500 - 999 posts)

    Join Date
    Feb 2004
    Posts
    959
    Rep Power
    181
    Good stuff. Really simple follow up.

    There's a portion of the actual script that requires some calculation of db values to do multiple comparison values with the CASE statement. For simplicity sake, let's just say:

    CASE
    WHEN num/total > 5 OR num/total <=10 THEN 'top'
    WHEN num/total > 0 OR num/total <=5 THEN 'bottom'
    END

    Is there a way to write this script without using num/total twice for each WHEN statement? I've seen something like:
    WHEN num/total >5 <= 10 around as a shorthand, though I don't believe I ever got it to work. My (perhaps misplaced) worry is that it has to calculate num/total twice for just the one WHEN statement.
    ****
    Enjoy my post? Drop some props by hitting the scales button up top. JBL

    Website Design in Los Angeles and Washington, DC by PoweredPages.com
  6. #4
  7. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,204
    Rep Power
    4279
    Originally Posted by daprezjer
    My (perhaps misplaced) worry is that it has to calculate num/total twice for just the one WHEN statement.
    i would be very surprised if the optimizer failed to recognize the opportunity to do one calculation instead of two

    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL

IMN logo majestic logo threadwatch logo seochat tools logo