|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Hi all,
I find myself in a situation with a database of 30,000+ quotes, one-liners, motd's etc which I need to do some housekeeping on. (it's up on http://www.motd.org.uk if you want to see exactly what i mean) The problem is that there are many 'similar' duplicate entries - not identical, but with differences in punctuation, grammar, similies etc. I'd really prefer to not have to edit 30,000 rows by hand so I've been trying to figure out a way to automate this - but I've never needed to do this kindof thing and am a little (well, very ) unsure on the best way to go.My best idea so far is: Atomize all the quotes, Remove irrelevant words, (it, not, is, the, a) etc, punctuation, formatting etc Try to match this remaining core of words against the other entries, ie 90% of similar words might indicate a very similar quote. Can anyone give me any tips on this? Do you think I'm on a reasonable track here? Ta muchly! |
|
#2
|
|||
|
|||
|
Are you on a Unix box? - regular expressions can do a lot to help, so can sed or grep.
Basically, what you do is what you've started - remove the, a, an - articles all prepositions remove all of the them, those etc. - pronouns all punctuation marks all exclamation (like Oh or Hey) sort the remaining words in each line removing duplicates create a tag - like the original line number, add it to each line sort the entire tag file line by line to remove duplicates The problem now arises of semantics. Some words have identical meanings is=be=am, was=were for example. So you need to weed out that problem. You should now be left with a tag file with line numbers of lines that are fairly unique. |
|
#3
|
|||
|
|||
|
If you have Perl, you can try the Soundex module -
http://search.cpan.org/dist/perl/lib/Text/Soundex.pm. It is supposed to be able to match words with similar pronunciation. That doesn't address the issue of grammatical similarity, though. |
|
#4
|
|||
|
|||
|
Ah, fantastic - thanks all! I'll be merilly coding my way into a bright dupe free future now!
![]() |
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > Software Design > Advice needed on approximate text matching |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|