#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2013
    Posts
    5
    Rep Power
    0

    Remove duplicates in flat file


    Hi all,

    I have a issues while loading a flat file to the DB. It is taking much time.
    When analyzed i found out that there are duplicates entry in the flat file.

    There are 2 type of Duplicate entry.

    1) is entire row is duplicate. ( i can use sort | uniq) to remove the duplicated entry.
    2) the PK which are forming the composite columns are same for 2 records , but the other columns are different which is also rejected and only one is getting loaded. PFB an example for the same.

    My Pk are 1 , 4, 6, 8 from the flat file which is going to be loaded into the DB.

    Column names : 1 2 3 4 5 6 7 8 9 10
    Records 1 a b c d e f g h i j
    records 2 a k l d m f n h o p

    So since my PK are alone same and the rest is also different the Loader is ommiting those records. Can you tell me a script by which i can omit the record 2.
    Please help.. We are in brink of issues to be fixed before tomorrow evening.

    Thanks in advance
    Sam
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2006
    Posts
    2,436
    Rep Power
    1688
    The obvious question is how do you know it is row 2 you wish to skip? I am guessing you want the first row inserted (regardless of values of any no-PK column) and any subsequent ones ignored?
    To be honest this may be simpler in mySQL by adding making the command INSERT IGNORE ...

    Going the other way, not sure what it would do, but you could try using something like:
    sort - u -k1,1 -k4,4 -k6,6 -k8,8 to see what happens
    The moon on the one hand, the dawn on the other:
    The moon is my sister, the dawn is my brother.
    The moon on my left and the dawn on my right.
    My brother, good morning: my sister, good night.
    -- Hilaire Belloc
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2013
    Posts
    5
    Rep Power
    0
    Thanks Simon...
    Will check on that....


    Other suggestions are welcomed.
  6. #4
  7. Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Sep 2006
    Posts
    840
    Rep Power
    387

    Cool


    Originally Posted by samjoshuab
    Hi all,

    I have a issues . . .
    . . .
    2) the PK which are forming the composite columns are same for 2 records , but the other columns are different which is also rejected and only one is getting loaded. PFB an example for the same.
    . . . E t c . . .
    Thanks in advance
    Sam
    If you know which record it is you want to ignore, why waste all this time and not just remove it manually?
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2013
    Posts
    5
    Rep Power
    0
    Originally Posted by LKBrwn_DBA
    If you know which record it is you want to ignore, why waste all this time and not just remove it manually?
    Hey...
    these files are daily getting loaded. Daily i cant sit on it to remove the duplicates right.. more over its a 20 million flat file..
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2006
    Posts
    2,436
    Rep Power
    1688
    I'd see how the INSERT IGNORE works for you to be honest, especially if the creating app/software can do that for you automagically.
    If you start 'monkeying' with the file after it's arrival you'll need to bear a few things in mind. Always make sure you have enough disk space (it may sound stupid, but finding out that you've been truncating the file for a week or two is NO fun!). Most *nix commands will not edit a file in place (I believe sed can do so) so you'll always need at least the same amount of space again. Plus the /tmp (and /var/tmp depending on version of *nix) file system needs to have a bit of elbow room.
    Check that each stage has worked (as best you can) and log things if/when they go wrong (even if that's 'just' to the syslog), and log the start/stop of the overall process.
    The moon on the one hand, the dawn on the other:
    The moon is my sister, the dawn is my brother.
    The moon on my left and the dawn on my right.
    My brother, good morning: my sister, good night.
    -- Hilaire Belloc

IMN logo majestic logo threadwatch logo seochat tools logo