SunQuest
           UNIX Help
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsOperating SystemsUNIX Help

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now!
  #1  
Old September 20th, 2004, 06:43 AM
benwylie benwylie is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 24 benwylie User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 47 m
Reputation Power: 0
Bash script to find and get rid of duplicate files in a folder

I run a mailserver which saves copies of both incoming and outgoing emails. Unfortunately the way it is set up, if someone sends an outgoing email to more then one recipient, it saves more then one copy of the email. This gets a little silly if they are sending large emails to lots of different people.

So i would like to have a bash script which will check for duplicate files in a folder and delete duplicates only leaving one copy of each.

I know that thd cmp command will check one file against another, but i don't know how to get it to delete one of them, if there are no differences to the files.

Also, this will only be useful if it can go through a whole folder checking for duplicates in it. It isn't helpful if i have to tell it which files to compare.

I'm sorry i am new to bash scripts (i run them via cygwin on windows 2003 as they seem to have much more functionality then the windows command prompt does) and so i am unable to do this myself.

Any help would be much appreciated.

Thankyou.

Ben

Reply With Quote
  #2  
Old September 20th, 2004, 08:08 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
not the better way to do that, re- configure
the mail server, but why not just for fun:
not tested, very slow on large dirs, perl is faster

delete 'echo' before 'rm' to activate
Code:
#!/usr/bin/sh
ALL=`find . -type f`

for ONE in $ALL
do [ -f $ONE ] || continue  #still deleted
    for TWO in $ALL
    do [ -f $TWO ] || continue # NOT sure it's needed
         case $ONE in $TWO) continue;; esac # really the same file
         cmp -s $ONE $TWO && echo rm -f $TWO
    done
done

Reply With Quote
  #3  
Old September 20th, 2004, 08:26 AM
benwylie benwylie is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 24 benwylie User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 47 m
Reputation Power: 0
Thanks very much for the script.

Unfortunately it seems not only to delete duplicates, but all copies of them, and infact all files in the dir!

Reply With Quote
  #4  
Old September 20th, 2004, 09:47 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
don't believe, maybe typo?

Reply With Quote
  #5  
Old September 20th, 2004, 10:52 AM
benwylie benwylie is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 24 benwylie User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 47 m
Reputation Power: 0
tested it again, and this time it didn't delete all files, but it did delete all copies of the duplicated file.

I have a folder with:
1.txt
2.txt
3.txt
4.txt
4a.txt
4b.txt

1,2,3 are unique
4,4a,4b are identical.

With echo on, this is the output:

$ bash "f:/cygwin/bin/duplicates.sh"
rm -f ./4a.txt
rm -f ./4b.txt
rm -f ./4.txt
rm -f ./4b.txt
rm -f ./4.txt
rm -f ./4a.txt

This appears to want to remove all three identical files.

Thanks
Ben

Reply With Quote
  #6  
Old September 20th, 2004, 11:00 AM
benwylie benwylie is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 24 benwylie User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 47 m
Reputation Power: 0
In fact, just after posting i tried it with some of the saved emails and it is trying to delete unique files as well.
My emails are saved int he format [date][time][from][to].txt
The two emails here are unique and are marked for deletion by the script. files 1,2 and 3 are not marked for deletion.

$ bash "f:/cygwin/bin/duplicates.sh"
rm -f ./4a.txt
rm -f ./4b.txt
rm -f ./4.txt
rm -f ./4b.txt
rm -f ./4.txt
rm -f ./4a.txt
rm -f ./[2004-09-20][165111][user@hermes.cam.ac.uk][mydad@ourdomain.com].txt
rm -f ./[2004-09-20][165426][user@gmail.com][me@ourdomain.com].txt

Reply With Quote
  #7  
Old September 20th, 2004, 11:27 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
don't you know what [] is for in shells ?

infact, when you love troubles, generate
filenames containing special && metacharacters, like

aaa;bbb
aaa[bbb]
aaa&bbb
aaa`bbb

and so on including tabs, vertical tabs, backspaces , spaces,
newlines and all the rest (ctrl, esc)

99% of all shells, perls, awk ... will give up (this is still vivable)
or make chaos.

personnaly i know 2 languages handling this correctly: 'c++' and his (good, old) predecessor 'c'.

Last edited by guggach : September 20th, 2004 at 11:38 AM. Reason: typo

Reply With Quote
  #8  
Old September 20th, 2004, 11:34 AM
benwylie benwylie is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 24 benwylie User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 47 m
Reputation Power: 0
I'm afraid i am completely clueless.

Are these file names going to cause a problem?

Is there any way to avoid this?

I tried it without the echo, and you are correct that it did leave the one copy 4.txt, the echo part was confusing me as it seemed to want to delete this as well.

Thanks
Ben

Reply With Quote
  #9  
Old September 20th, 2004, 11:52 AM
guggach guggach is offline
Contributing User
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jul 2004
Location: Middle Europa
Posts: 1,083 guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level)guggach User rank is Corporal (100 - 500 Reputation Level) 
Time spent in forums: 4 Days 19 h 44 m 45 sec
Reputation Power: 9
ben, may i kindly suggest you: read a book ?
try following:

mkdir /tmp/aaa
cd /tmp/aaa
touch 1 2 3 4 5 6 7 8 9 0 # this are filenames
then enter:
ls [0-9]
look at the output and suppose it was 'rm' not 'ls'

a last point: think unix, not the OS you are accidentally running
bash is good, not standard, nor native.

write code running on all *nix, for scripts write the easy to
write/read/mantain old bourn-sh.

if it's a performance issue try 'perl' or better 'c' && 'c++'

Reply With Quote
  #10  
Old September 20th, 2004, 12:24 PM
benwylie benwylie is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2004
Posts: 24 benwylie User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 47 m
Reputation Power: 0
Thankyou very much for your help and advice.

I don't really have any programming experience, but have been trying to write scripts which will make it simpler and less time consuming to manage our server.

I tried writing batch files for windows command prompt, but found that it doesn't support the commands which i was trying to use.

So i was adviced to use bash scripts via cygwin, which was what i have been trying to do.

I have looked up lots of websites which tell you about the various commands you can use when writing shell scripts, but hadn't cottoned onto the problems with [ and ] in file names.

Thank you very much for your help, and i shall see if someone in the perl forum can point me in the right direction.

Thanks again.
Ben

Reply With Quote
Reply

Viewing: Dev Shed ForumsOperating SystemsUNIX Help > Bash script to find and get rid of duplicate files in a folder


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 3 hosted by Hostway