Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. A Change of Season
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Mar 2004
    Location
    Next Door
    Posts
    3,246
    Rep Power
    220

    How can I update this query so not only I get unique ip, I also get unique email


    Hi;

    Basically, I want to remove duplicate emails from the results. How can I update this query so not only I get unique ip, I also get unique email?

    Thanks
    Code:
     SELECT   *
    FROM     analytics
    WHERE    email NOT IN (list of some emails)
    AND      plc_url = 'plc1'
    GROUP BY ip
    Code:
    CREATE TABLE `pum_sequence` (
      `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `email` varchar(255) DEFAULT NULL,
      `ip` varchar(255) DEFAULT NULL,
      `status` varchar(255) DEFAULT NULL,
      `date_added` date DEFAULT NULL,
      `passed_to_ontraport` int(11) DEFAULT '0',
      `ontraport_user_id` int(11) DEFAULT NULL,
      `opt_in_for_this_offer` int(11) DEFAULT '0',
      `plc_1_visited` int(11) DEFAULT '0',
      `plc_2_visited` int(11) DEFAULT '0',
      `plc_3_visited` int(11) DEFAULT NULL,
      `plc_1_number_of_clicks_from_email` int(11) DEFAULT '0',
      `plc_2_number_of_clicks_from_email` int(11) DEFAULT '0',
      `plc_3_number_of_clicks_from_email` int(11) DEFAULT NULL,
      `plc_1_tagged_active` int(11) DEFAULT '0',
      `plc_2_tagged_active` int(11) DEFAULT '0',
      `plc_3_tagged_active` int(11) DEFAULT NULL,
      `plc_1_number_of_clicks_from_top_links` int(11) DEFAULT NULL,
      `plc_2_number_of_clicks_from_top_links` int(11) DEFAULT NULL,
      `plc_3_number_of_clicks_from_top_links` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`)
    ) EN
  2. #2
  3. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,435
    Rep Power
    9645
    You can't get unique IP or unique email without two queries. You can get unique IP and email pair in one query, with either a GROUP BY (if you want aggregate data) or a DISTINCT (if you want only the address and email).

    Note that SELECT * + GROUP BY doesn't make sense so you have to think about what you're trying to get from the query.
  4. #3
  5. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,732
    Rep Power
    4288
    but SELECT * with GROUP BY does make sense... in MySQL

    actually the rationale for it is quite reasonable, the only problem is, so few actual practitioners understand it

    Originally Posted by DaManual
    You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
    see also the ANY_VALUE() function
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  6. #4
  7. A Change of Season
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Mar 2004
    Location
    Next Door
    Posts
    3,246
    Rep Power
    220
    I might as well do this:

    PHP Code:
    $unique_visitors_plc_1 "SELECT * FROM analytics WHERE email NOT IN (some emails) AND plc_url = 'plc1' GROUP BY ip";
            
    $query $this->db->query($unique_visitors_plc_1);
            
    $this->data['plc_1_unique'] = $query->num_rows();
            
    $emails=array();
            foreach(
    $query->result_array() as $val=>$row)
                {
                    if(
    in_array($row['email'], $emails))
                        {
                            
    $this->data['plc_1_unique'] = $this->data['plc_1_unique']-1;
                        }
                    
    $emails[]=$row['email'];    
                } 
    Now $this->data['plc_1_unique'] has the right number of emails.

    Yes, no?
  8. #5
  9. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,732
    Rep Power
    4288
    Originally Posted by English Breakfast Tea
    Yes, no?
    i'm sorry, i don't speak php, perhaps someone else
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  10. #6
  11. A Change of Season
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Mar 2004
    Location
    Next Door
    Posts
    3,246
    Rep Power
    220
    Originally Posted by r937
    i'm sorry, i don't speak php, perhaps someone else
    Hey Rudy. I love you so much but that was for Requnix.
  12. #7
  13. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,435
    Rep Power
    9645
    Originally Posted by r937
    but SELECT * with GROUP BY does make sense... in MySQL
    To me it doesn't because of the gotcha:
    The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
    If one is fine with a "random" row per group then I feel like the query should explicitly state that rather than rely on a MySQL nuance. However I recognize the practicality of coding to a particular architecture, so I guess using it for performance reasons makes enough sense to me.

    Originally Posted by English Breakfast Tea
    I might as well do this:
    Not really. Not only are you selecting everything and then only using two values, you're doing a SUM that could just as well happen in the query itself... maybe. It's back to the gotcha and whether it's being used intentionally here.

    What would make most sense to me, someone who doesn't understand what the table and its rows are supposed to represent, would be to get a sum of the plc_1_unique per IP and email. You're almost doing the former except you are only getting one random row per IP address, and your code is effectively doing the latter except it's skipping the first row found for each email?
    The whole thing seems wrong to me.
  14. #8
  15. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,732
    Rep Power
    4288
    i agree, it seems wrong... there's way too much else going on in the grouping

    for example, suppose there are multiple emails for a given IP... couldn't those be different people using the same computer? or different people at a common location using a router?

    and if you ~did~ want to combine multiple emails for each IP, wouldn't you want to ~add~ stuff like number of clicks?
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  16. #9
  17. A Change of Season
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Mar 2004
    Location
    Next Door
    Posts
    3,246
    Rep Power
    220
    In the past 10 years I'd never seen you guys not completely agree with each other!

    I wonder what Kicken thinks.

    I think if we are dealing purely with MySQL, Rudy would be right. When it comes to the web, I think Requnix has better pointers.

    Requinix, when I count the results by hand, it looks correct. Can I see your solution, please? Or the psudocode maybe?

    Thanks
  18. #10
  19. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,435
    Rep Power
    9645
    Originally Posted by English Breakfast Tea
    Requinix, when I count the results by hand, it looks correct. Can I see your solution, please? Or the psudocode maybe?
    I can't do anything without understanding the nature of the table and its data.

    So I finally understand what the code is doing: counting the number of unique email addresses. The difficulty was that it does it in the totally backwards way of counting the total and then subtracting the duplicates, rather than just counting the unique values. But the whole thing still has the problem of implying a relationship between the email and IP addresses, which I don't think there is (or should be).

    What about
    Code:
    SELECT COUNT(DISTINCT email) FROM analytics WHERE email NOT IN (...) AND plc_url = 'plc1'
    That drops the IP address entirely.

    If that's not correct then you need to explain how the email and IP addresses are related to each other. Hopefully with some sample data showing the four possibilities for unique/shared values, such as
    Code:
    email       | ip
    ------------+------------
    foo@foo.com | 12.34.56.78 <- unique
    foo@foo.com | 23.45.67.89 <- same email different ip
    bar@bar.com | 12.34.56.78 <- different email same ip
    bar@bar.com | 23.45.67.89 <- unique
    With a GROUP BY ip there is no guarantee which of the two email addresses you get - it's not actually random, and it is somewhat predictable if you know how the table is indexed, but you can't know just by looking at the data.
    Code:
    SELECT email FROM analytics GROUP BY ip
    Code:
    email            email            email
    -----------  or  -----------  or  -----------
    foo@foo.com      foo@foo.com      bar@bar.com
    foo@foo.com      bar@bar.com      bar@bar.com
  20. #11
  21. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,732
    Rep Power
    4288
    Originally Posted by requinix
    To me it doesn't because of the gotcha:
    well, that's just the thing... if you do it when it's correct to do it, then it isn't a gotcha, because all values of non-aggregated columns are unique within the group. i.e. deterministic!!

    it's only when the grouping attempts to consolidate column values that are ~not~ functionally dependent on the grouping key that you get non-deterministic results, in which case you should/could use ANY_VALUE() to make it clear to others what your query is supposed to be doing
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  22. #12
  23. A Change of Season
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Mar 2004
    Location
    Next Door
    Posts
    3,246
    Rep Power
    220
    Oh! How come I never thought of this?
    Code:
    SELECT DISTINCT email FROM analytics WHERE email NOT IN (...) AND plc_url = 'plc1'
    How is this different results from
    Code:
    SELECT email FROM analytics WHERE email NOT IN (...) AND plc_url = 'plc1' group by email
    How come Rudy didn't mention DISTINCT?
  24. #13
  25. Lazy Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,435
    Rep Power
    9645
    Originally Posted by English Breakfast Tea
    How is this different results from
    Code:
    SELECT email FROM analytics WHERE email NOT IN (...) AND plc_url = 'plc1' group by email
    It's not, really. May very well work the same way behind the curtain.

    But don't you just need a count of the unique email addresses? Nothing you've posted so far suggests you actually needed each value, and getting them all when you just want a count is bad.
  26. #14
  27. SQL Consultant
    Devshed Supreme Being (6500+ posts)

    Join Date
    Feb 2003
    Location
    Toronto Canada
    Posts
    27,732
    Rep Power
    4288
    Originally Posted by English Breakfast Tea
    How come Rudy didn't mention DISTINCT?
    because nothing suggested that you wanted only the emails!!!

    using SELECT * and GROUP BY ip suggested that you wanted all the other columns as well, my friend

    we always start with the assumption that you know what you want
    rudy.ca | @rudydotca
    Buy my SitePoint book: Simply SQL
  28. #15
  29. A Change of Season
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Mar 2004
    Location
    Next Door
    Posts
    3,246
    Rep Power
    220
    Hey guys;

    sorry I wasn't 100% clear about what I want.

    The whole issue came up because:

    1 - Some users have multiple ip addresses.
    2 - Some visits don't have a value for email field.

    does it make sense?

    I want to get unique visitors per page.What's the best solution that gives the most accurate solution?

    Thanks
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo