MS SQL Development
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsDatabasesMS SQL Development

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old October 7th, 2004, 09:27 PM
confuzed2xover confuzed2xover is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2004
Posts: 185 confuzed2xover User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 5 h 45 m 53 sec
Reputation Power: 5
How can I delete all but one instance of a duplicate row?

I have several rows in a table that contain duplicate information. I want to trim them down to one instance per row, without other duplicates.

I have several columns, but only 2 of the columns are used to determine the "Identical Rows", however one of the other columns could also be used to determine which of the others are deleted.

Let's just say I have a table called DATA.
In DATA I have the following columns:
C1: ID/Key column (Unique numbers not associated with any other row)
C2: Specific Data column 1
C3: Specific Data Column 2
C3: Keeper if not null
C4: Misc Info
C5: Misc Info

Basically IF C2 & C3 are duplicated, then I want to keep one of these and ditch the others.

The row I want to keep is the one that has info in C4, but at times C4 does not have anything in it on any of them, at that point I just care about keeping one of the C2 & C3 Duplicate rows.

The other issue I face is C2 will always have something in it, but at times C3 will not and C4 will.

So I want to keep the one duplicate row that has the most info in it. At minimum, I want to always keep C2 & C3 if they both contain data.

For example, if I had the following duplicate rows:
(C2) red (C3) 123 (C4) XYZ
(C2) red (C3) 123 (C4) XYZ
(C2) red (C3) 123 (C4) XYZ
(C2) red (C3) (C4) XYZ
(C2) red (C3) 123 (C4)

I want to keep: (C2) red (C3) 123 (C4) XYZ

I have some rows that have three exact duplicates, some that have four, some that have two.

How can I tell the Query Manager to delete all but one instance of each duplicate row?

I did do a query to find out how many duplicates I have and found it to be around 232,000+. Basically I will be cutting this in half if I can get rid of the duplicates.

I have read 30 different ways on many sites and I am now officially confused

Thank you for your help.

Last edited by confuzed2xover : October 7th, 2004 at 11:10 PM. Reason: Realized another issue

Reply With Quote
  #2  
Old October 15th, 2004, 09:54 AM
confuzed2xover confuzed2xover is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2004
Posts: 185 confuzed2xover User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 5 h 45 m 53 sec
Reputation Power: 5
Well I have gotten this far with the above question:
Code:
select C1,C2,C3, C4,C5,C6,C7,C8, C9 
from DATA
WHERE EXISTS (Select C2, C3, C4
from DATA
GROUP BY C2, C3,C4
HAVING count(*) > 1)
ORDER BY C2, C3 


Where I am having issues is turning this into a statement to delete all but one row of each unique row. I want to keep one of the rows and delete the other duplicates.

Any ideas?

Reply With Quote
  #3  
Old October 15th, 2004, 04:11 PM
r937's Avatar
r937 r937 is offline
SQL Consultant
Click here for more information.
 
Join Date: Feb 2003
Location: Toronto Canada
Posts: 17,700 r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level) 
Time spent in forums: 1 Month 3 Weeks 6 Days 17 h 5 m 9 sec
Reputation Power: 986
which one do you want to keep?

you have to have some rule, like pick the row with the largest C4 or something
__________________
r937.com | rudy.ca

Reply With Quote
  #4  
Old October 15th, 2004, 08:59 PM
confuzed2xover confuzed2xover is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2004
Posts: 185 confuzed2xover User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 5 h 45 m 53 sec
Reputation Power: 5
Still confused

Where I am having the issue is.............

I want to keep one of the rows, but I want to keep the row that has the most information in it.

If I have a row that has C2, C3, & C4 populated, and that has three other exact matches, I want to keep one of those rows and delete the others.

If I have a row that has C2, C3, C4, & C5 that has several other matches with C2, C3, & C4, then I want to keep the row that has C2, C3, C4, & C5 populated.

I know this sounds kind of confusing, but I'll explain what I have and why.......

I have a table that is populated with several columns of information about parts.
C2 is the part number column
C3 is the alternate number
C4 is the description
C5 is the type of part
C6 is the manufacturer
C7 is the manufacturer code
C8 is the Revision or the Series/make/model.

Basically I am trying to keep the one row that has the most information in it, and delete the other rows.

All rows will have at least part number & manufacturer. From there is where it gets more detailed. Basically I want to keep one row with the most information in it.

The other problem I face is that though part number and manufactuer will be consistent, there are many times I will have the same part number with several different alternate part numbers. I need to keep one row of each that has the most information in it with the alternate part number as well.

Okay, that probably just got confusing which is why I am having problems figuring out how to do this.

Some part numbers do not have or will never have alternate part numbers, but some have many different alternate part numbers for the same part number, that is why I have a seperate column for the alternate part numbers. This way when I look up a part number, I can see all the alternate numbers that one part number has as well, or if I look up the alternate, I can see the industry common part number.

So with that in mind, I want to have only one reference of a particular part number, alternate, and description at minimum. Though there are several duplicates that are just part number & manufacturer & type.

What I am faced with is multiple scenerios of duplicates that are a wide range of exact matches. It could be C2, C3, C4, & C5 that match several, it could be C2, C4, C5, & C8 that have exact matches, etc.

My main objective is to keep one row that has the most information of a particular part number & or alternate part number & manufacturer.

Reply With Quote
  #5  
Old October 15th, 2004, 09:06 PM
r937's Avatar
r937 r937 is offline
SQL Consultant
Click here for more information.
 
Join Date: Feb 2003
Location: Toronto Canada
Posts: 17,700 r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level) 
Time spent in forums: 1 Month 3 Weeks 6 Days 17 h 5 m 9 sec
Reputation Power: 986
okay, i have an idea

do an ordinary SELECT, pull out every row, and assign each row a "score"

this score will be based on how many columns have information

save the result into a table

then you can delete all rows from the main table where they match a row in the saved table that has a lower than maximum score for its group

make sense?

Reply With Quote
  #6  
Old October 16th, 2004, 06:02 PM
confuzed2xover confuzed2xover is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2004
Posts: 185 confuzed2xover User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 5 h 45 m 53 sec
Reputation Power: 5
Not being a SQL guru or anything, I will have to research the "Score" thing, but that starts me in the right direction.

I think I can see where you are going with this and it sort of makes sense, though I will have to research how to get a score for each, I think I can see how this would work.

I'll hammer on it......................

Thank you so much for your input, it is invaluable. I really appreciate your suggestions, you have helped me more than once and I am very grateful.

Reply With Quote
  #7  
Old October 16th, 2004, 07:23 PM
r937's Avatar
r937 r937 is offline
SQL Consultant
Click here for more information.
 
Join Date: Feb 2003
Location: Toronto Canada
Posts: 17,700 r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level)r937 User rank is General (90000 - 100000 Reputation Level) 
Time spent in forums: 1 Month 3 Weeks 6 Days 17 h 5 m 9 sec
Reputation Power: 986
okay, good luck

remember to use CASE expressions, and you can nest them if necessary

e.g.
Code:
select case when C4 is not null 
            then case when C5 is null
                      then 3 
                      else 5 
                 end
            else 12 
       end            as score
  from yourtable

Reply With Quote
Reply

Viewing: Dev Shed ForumsDatabasesMS SQL Development > How can I delete all but one instance of a duplicate row?


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 5 hosted by Hostway