Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old September 13th, 2004, 06:12 PM
booga booga is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2003
Posts: 74 booga User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 18 h 23 m 12 sec
Reputation Power: 5
Question Advice needed on approximate text matching

Hi all,

I find myself in a situation with a database of 30,000+ quotes, one-liners, motd's etc which I need to do some housekeeping on. (it's up on http://www.motd.org.uk if you want to see exactly what i mean)

The problem is that there are many 'similar' duplicate entries - not identical, but with differences in punctuation, grammar, similies etc.

I'd really prefer to not have to edit 30,000 rows by hand so I've been trying to figure out a way to automate this - but I've never needed to do this kindof thing and am a little (well, very ) unsure on the best way to go.

My best idea so far is:

Atomize all the quotes,
Remove irrelevant words, (it, not, is, the, a) etc, punctuation, formatting etc
Try to match this remaining core of words against the other entries, ie 90% of similar words might indicate a very similar quote.

Can anyone give me any tips on this? Do you think I'm on a reasonable track here?

Ta muchly!

Reply With Quote
  #2  
Old September 14th, 2004, 10:04 AM
jim mcnamara jim mcnamara is offline
......@.........
Dev Shed Beginner (1000 - 1499 posts)
 
Join Date: Jun 2004
Posts: 1,308 jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level)jim mcnamara User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Week 3 Days 6 h 19 m 24 sec
Reputation Power: 48
Are you on a Unix box? - regular expressions can do a lot to help, so can sed or grep.

Basically, what you do is what you've started -
remove the, a, an - articles
all prepositions
remove all of the them, those etc. - pronouns
all punctuation marks
all exclamation (like Oh or Hey)
sort the remaining words in each line removing duplicates
create a tag - like the original line number, add it to each line
sort the entire tag file line by line to remove duplicates

The problem now arises of semantics. Some words have identical meanings
is=be=am, was=were for example.
So you need to weed out that problem.

You should now be left with a tag file with line numbers of lines that are fairly unique.

Reply With Quote
  #3  
Old September 15th, 2004, 03:26 PM
imchi imchi is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jun 2003
Posts: 135 imchi User rank is Sergeant (500 - 2000 Reputation Level)imchi User rank is Sergeant (500 - 2000 Reputation Level)imchi User rank is Sergeant (500 - 2000 Reputation Level)imchi User rank is Sergeant (500 - 2000 Reputation Level)imchi User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 21 h 45 m 37 sec
Reputation Power: 21
If you have Perl, you can try the Soundex module -
http://search.cpan.org/dist/perl/lib/Text/Soundex.pm.
It is supposed to be able to match words with similar
pronunciation.
That doesn't address the issue of grammatical similarity, though.

Reply With Quote
  #4  
Old September 15th, 2004, 03:35 PM
booga booga is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2003
Posts: 74 booga User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 18 h 23 m 12 sec
Reputation Power: 5
Ah, fantastic - thanks all! I'll be merilly coding my way into a bright dupe free future now!

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Advice needed on approximate text matching


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 6 hosted by Hostway
Stay green...Green IT