#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    24
    Rep Power
    0

    Bash script to find and get rid of duplicate files in a folder


    I run a mailserver which saves copies of both incoming and outgoing emails. Unfortunately the way it is set up, if someone sends an outgoing email to more then one recipient, it saves more then one copy of the email. This gets a little silly if they are sending large emails to lots of different people.

    So i would like to have a bash script which will check for duplicate files in a folder and delete duplicates only leaving one copy of each.

    I know that thd cmp command will check one file against another, but i don't know how to get it to delete one of them, if there are no differences to the files.

    Also, this will only be useful if it can go through a whole folder checking for duplicates in it. It isn't helpful if i have to tell it which files to compare.

    I'm sorry i am new to bash scripts (i run them via cygwin on windows 2003 as they seem to have much more functionality then the windows command prompt does) and so i am unable to do this myself.

    Any help would be much appreciated.

    Thankyou.

    Ben
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jul 2004
    Location
    Middle Europa
    Posts
    1,200
    Rep Power
    14
    not the better way to do that, re- configure
    the mail server, but why not just for fun:
    not tested, very slow on large dirs, perl is faster

    delete 'echo' before 'rm' to activate
    Code:
    #!/usr/bin/sh
    ALL=`find . -type f`
    
    for ONE in $ALL
    do [ -f $ONE ] || continue  #still deleted
        for TWO in $ALL
        do [ -f $TWO ] || continue # NOT sure it's needed
             case $ONE in $TWO) continue;; esac # really the same file
             cmp -s $ONE $TWO && echo rm -f $TWO
        done
    done
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    24
    Rep Power
    0
    Thanks very much for the script.

    Unfortunately it seems not only to delete duplicates, but all copies of them, and infact all files in the dir!
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jul 2004
    Location
    Middle Europa
    Posts
    1,200
    Rep Power
    14
    don't believe, maybe typo?
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    24
    Rep Power
    0
    tested it again, and this time it didn't delete all files, but it did delete all copies of the duplicated file.

    I have a folder with:
    1.txt
    2.txt
    3.txt
    4.txt
    4a.txt
    4b.txt

    1,2,3 are unique
    4,4a,4b are identical.

    With echo on, this is the output:

    $ bash "f:/cygwin/bin/duplicates.sh"
    rm -f ./4a.txt
    rm -f ./4b.txt
    rm -f ./4.txt
    rm -f ./4b.txt
    rm -f ./4.txt
    rm -f ./4a.txt

    This appears to want to remove all three identical files.

    Thanks
    Ben
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    24
    Rep Power
    0
    In fact, just after posting i tried it with some of the saved emails and it is trying to delete unique files as well.
    My emails are saved int he format [date][time][from][to].txt
    The two emails here are unique and are marked for deletion by the script. files 1,2 and 3 are not marked for deletion.

    $ bash "f:/cygwin/bin/duplicates.sh"
    rm -f ./4a.txt
    rm -f ./4b.txt
    rm -f ./4.txt
    rm -f ./4b.txt
    rm -f ./4.txt
    rm -f ./4a.txt
    rm -f ./[2004-09-20][165111][user@hermes.cam.ac.uk][mydad@ourdomain.com].txt
    rm -f ./[2004-09-20][165426][user@gmail.com][me@ourdomain.com].txt
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jul 2004
    Location
    Middle Europa
    Posts
    1,200
    Rep Power
    14
    don't you know what [] is for in shells ?

    infact, when you love troubles, generate
    filenames containing special && metacharacters, like

    aaa;bbb
    aaa[bbb]
    aaa&bbb
    aaa`bbb

    and so on including tabs, vertical tabs, backspaces , spaces,
    newlines and all the rest (ctrl, esc)

    99% of all shells, perls, awk ... will give up (this is still vivable)
    or make chaos.

    personnaly i know 2 languages handling this correctly: 'c++' and his (good, old) predecessor 'c'.
    Last edited by guggach; September 20th, 2004 at 11:38 AM. Reason: typo
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    24
    Rep Power
    0
    I'm afraid i am completely clueless.

    Are these file names going to cause a problem?

    Is there any way to avoid this?

    I tried it without the echo, and you are correct that it did leave the one copy 4.txt, the echo part was confusing me as it seemed to want to delete this as well.

    Thanks
    Ben
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jul 2004
    Location
    Middle Europa
    Posts
    1,200
    Rep Power
    14
    ben, may i kindly suggest you: read a book ?
    try following:

    mkdir /tmp/aaa
    cd /tmp/aaa
    touch 1 2 3 4 5 6 7 8 9 0 # this are filenames
    then enter:
    ls [0-9]
    look at the output and suppose it was 'rm' not 'ls'

    a last point: think unix, not the OS you are accidentally running
    bash is good, not standard, nor native.

    write code running on all *nix, for scripts write the easy to
    write/read/mantain old bourn-sh.

    if it's a performance issue try 'perl' or better 'c' && 'c++'
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2004
    Posts
    24
    Rep Power
    0
    Thankyou very much for your help and advice.

    I don't really have any programming experience, but have been trying to write scripts which will make it simpler and less time consuming to manage our server.

    I tried writing batch files for windows command prompt, but found that it doesn't support the commands which i was trying to use.

    So i was adviced to use bash scripts via cygwin, which was what i have been trying to do.

    I have looked up lots of websites which tell you about the various commands you can use when writing shell scripts, but hadn't cottoned onto the problems with [ and ] in file names.

    Thank you very much for your help, and i shall see if someone in the perl forum can point me in the right direction.

    Thanks again.
    Ben
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    2
    Rep Power
    0

    Need to add a check of the file name


    You need to make sure that you skip when you compare ONE to TWO when they are the same file, by checking the name.
    Code:
    #!/usr/bin/sh
    ALL=`find . -type f`
    
    for ONE in $ALL;do
      for TWO in $ALL;do
        if [ $ONE == $TWO ];then
          echo "Do not delete myself"
        else
          #case $ONE in $TWO) continue;; esac
          #cmp -s $ONE $TWO && rm -f $TWO
          cmp -s $ONE $TWO && echo "Delete file $TWO"
        fi
      done
    done
  22. #12
  23. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    2
    Rep Power
    0

    Faster version


    Faster version , if you know how to create child processes of the md5sum section let me know

    Code:
    #!/usr/bin/sh
    declare -a list=( ./* );
    declare -a sums;
    cnt=${#list[@]}
    
    echo "creating md5sum list"
    for ((x = 0; x < $cnt -1; x++))
    do
        sums[$x]=`md5sum ${list[$x]} | cut -d ' ' -f 1`
        progress=$(echo "scale=2;($x/$cnt)*100" | bc)
        echo -ne "progress $progress %\r"
    done
    
    echo "doing compare"
    for ((x = 0; x < $cnt -1; x++))
    do
      for ((y = x+1; y < $cnt; y++))
      do
        if [ "${sums[$x]}" == "${sums[$y]}" ];then
          if [ ${list[$x]} != ${list[$y]} ];then
            #remove '#' in next line to enable
            echo "Delete file ${list[$y]}" # && rm -f ${list[$y]}
          fi
        fi
      done
    done
  24. #13
  25. No Profile Picture
    Contributing User
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Mar 2006
    Posts
    2,448
    Rep Power
    1751
    You do realise you are responding to a 6 1/2 year old thread here?

    Anyway, have a look at dropping the md5s into a nohup-ed batch and outside the loop, after the done, put a wait. But, that screws up the progress display - so ... up to you!
    The moon on the one hand, the dawn on the other:
    The moon is my sister, the dawn is my brother.
    The moon on my left and the dawn on my right.
    My brother, good morning: my sister, good night.
    -- Hilaire Belloc

IMN logo majestic logo threadwatch logo seochat tools logo