#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    16
    Rep Power
    0

    Remove duplicates from slowly changing dimension.


    This should be an easy one for you SQL heads, but it's puzzling me!

    I've got a slowly changing dimension table for products with some duplicate attributes - what's the best way to remove them, please?

    Code:
    UNIQUE_ID | DATE_FROM | DATE_TO   | PRODUCT_ID | ATTRIBUTE_1 | ATTRIBUTE_2
    1           01-JAN-13   02-JAN-13   423          MONKEY        5
    2           03-JAN-13   04-JAN-13   423          MONKEY        5
    3           05-JAN-13   08-JAN-13   423          MONKEY        4
    4           09-JAN-13   10-JAN-13   423          SUPERMONKEY   4
    5           01-JAN-13   08-JAN-13   378          BANANA        2
    6           09-JAN-13   10-JAN-13   378          BANANA        3
    The natural key should be PRODUCT_ID, ATTRIBUTE_1 and ATTRIBUTE_2. The table should therefore be recreated as follows:

    Code:
    UNIQUE_ID | DATE_FROM | DATE_TO   | PRODUCT_ID | ATTRIBUTE_1 | ATTRIBUTE_2
    1           01-JAN-13   04-JAN-13   423          MONKEY        5
    2           05-JAN-13   08-JAN-13   423          MONKEY        4
    3           09-JAN-13   10-JAN-13   423          SUPERMONKEY   4
    4           01-JAN-13   08-JAN-13   378          BANANA        2
    5           09-JAN-13   10-JAN-13   378          BANANA        3
    Thanks!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Oct 2003
    Location
    Germany
    Posts
    2,777
    Rep Power
    348
    Code:
    select row_number() over (order by null) as unique_id,
           t.*
    from (
        select date_from, 
               date_to, 
               product_id, 
               attribute_1, 
               attribute_2, 
               row_number() over (partition by product_id, attribute_1, attribute_2 order by date_from) as rn
        from the_table
    ) t
    where rn = 1
    order by product_id, date_from
    I will not read nor answer questions where the SQL code is messy and not formatted properly using [code] tags.
    http://forums.devshed.com/misc.php?do=bbcode#code

    Tips on how to ask better questions:
    http://tkyte.blogspot.de/2005/06/how-to-ask-questions.html
    http://wiki.postgresql.org/wiki/SlowQueryQuestions
    http://catb.org/esr/faqs/smart-questions.html
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    16
    Rep Power
    0
    Hi Shammat - thanks so much for the pointers, but I also need to correct the DATE_TO field.

    In my example, PRODUCT_ID, ATTRIBUTE_1 and ATTRIBUTE_2 are identical for UNIQUE_IDs 1 & 2.

    DATE_FROM needs to be the earliest from all duplicate entries, whilst DATE_TO needs to be the latest from all duplicate entries. Note how UNIQUE_ID of the target has a DATE_FROM of 04-JAN-13, whereas your query does not change it.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    16
    Rep Power
    0
    Originally Posted by cs03dmj
    Hi Shammat - thanks so much for the pointers, but I also need to correct the DATE_TO field.

    In my example, PRODUCT_ID, ATTRIBUTE_1 and ATTRIBUTE_2 are identical for UNIQUE_IDs 1 & 2.

    DATE_FROM needs to be the earliest from all duplicate entries, whilst DATE_TO needs to be the latest from all duplicate entries. Note how UNIQUE_ID of the target has a DATE_FROM of 04-JAN-13, whereas your query does not change it.
    This appears to work, but I'm not sure it's as efficient as it could be:

    Code:
    select row_number() over (order by null) as unique_id,
           t.*
    from (
        select date_from, 
               max(date_to) over (partition by product_id, attribute_1, attribute_2 order by date_from) as date_to, 
               product_id, 
               attribute_1, 
               attribute_2, 
               row_number() over (partition by product_id, attribute_1, attribute_2 order by date_from) as rn
        from the_table
    ) t
    where rn = 1
    order by product_id, date_from
  8. #5
  9. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Sep 2006
    Posts
    834
    Rep Power
    387

    Cool


    Building on Shammat's excellent query, try this:
    Code:
    CREATE TABLE MyTable_New
    AS
    SELECT ROW_NUMBER () OVER (ORDER BY NULL) AS unique_id
         , t.date_from
         , t.date_to
         , t.product_id
         , t.attribute_1
         , t.attribute_2
      FROM (SELECT MIN (date_from) OVER (PARTITION BY product_id, attribute_1, attribute_2) AS date_from
                 , MAX (date_to) OVER (PARTITION BY product_id, attribute_1, attribute_2) AS date_to
                 , product_id
                 , attribute_1
                 , attribute_2
                 , ROW_NUMBER () OVER (PARTITION BY product_id, attribute_1, attribute_2 ORDER BY date_from) AS rn
              FROM MyTable) t
     WHERE rn = 1
     ORDER BY product_id, date_from
    /
    ALTER TABLE MyTable RENAME MyTable_Old;
    ALTER TABLE MyTable_New RENAME MyTable;
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    16
    Rep Power
    0
    Thanks both. Do you know if it's possible to achieve this with just one PARTITION BY clause? In my real example, there are over 100 attributes(!)
  12. #7
  13. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Sep 2006
    Posts
    834
    Rep Power
    387

    Cool


    Originally Posted by cs03dmj
    Thanks both. Do you know if it's possible to achieve this with just one PARTITION BY clause? In my real example, there are over 100 attributes(!)
    Code:
    CREATE TABLE mytable_new
    AS
        SELECT ROW_NUMBER () OVER (ORDER BY NULL) AS unique_id, t.date_from
             , t.date_to, t.product_id
             , t.attribute_1, t.attribute_2
          FROM (  SELECT MIN (date_from)  date_from
                       , MAX (date_to)  date_to
                       , product_id
                       , attribute_1
                       , attribute_2
                    FROM mytable
                GROUP BY product_id, attribute_1, attribute_2) t
      ORDER BY product_id, date_from
    /

IMN logo majestic logo threadwatch logo seochat tools logo