#1
  1. Cast down
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Location
    Sweden
    Posts
    321
    Rep Power
    12

    How to kill duplicates in a file?


    In a text file, how do you go about killing duplicates?
  2. #2
  3. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,255
    Rep Power
    2222
    Bummer that you're doing it in DOS/Windows. In Linux, there's a command, uniq, which removes duplicate lines from a sorted file. It does not change the file, but rather outputs to a second file or to stdout.

    You might want to see if you can find a DOS port for it. Or find the source code and create your own port.
  4. #3
  5. Cast down
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Location
    Sweden
    Posts
    321
    Rep Power
    12
    Even if I were on Linux I'd still want to know, I really have no use, I am just curious as to how to do it.
  6. #4
  7. Contributing User
    Devshed Supreme Being (6500+ posts)

    Join Date
    Jan 2003
    Location
    USA
    Posts
    7,255
    Rep Power
    2222
    Originally posted by movEAX_444
    Even if I were on Linux I'd still want to know, I really have no use, I am just curious as to how to do it.
    Well, in Linux you would use the uniq command as a filter.

    To do it programmatically -- off the top of my head -- you would sort the file first, then read one line at a time keeping the previous line. If adjacent lines are the same, then they are duplicates and only one copy should be output. If the lines are not the same, then output them.

    Let's look at this step-wise:
    1. Declare two string buffers, sNew and sOld.
    2. Read the first line into sOld and output it.
    3. Read the next line into sNew.
    4. Compare sNew and sOld.
    5. If they are the same, then do nothing (thus discarding the duplicate in sNew).
    6. If they are different, then output sNew and copy sNew to sOld.
    7. Repeat Steps 3 - 6 until EOF is reached.
    8. Close the input and output files.

    Of course, the input file must be sorted to make that work. Both DOS and Linux have the command-line filter, sort.

IMN logo majestic logo threadwatch logo seochat tools logo