Forums: » Register « |  Free Tools |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support |

New Free Tools on Dev Shed!

#1
July 8th, 2012, 04:50 AM
 jepot
Registered User

Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
Complex string matching based on list items

I have two lists like:

Code:
`list1 = ['hello', 'are', 'you', 'there'] #list of correct strings`

and

Code:
`list2 = ['There1', '1 are 1', 'helloooo', 'you'] #list of random strings that have similarity to list1's items but not in order`

list1 and list2 have the same number of items all the time.

I would like to fix the items in list2 by finding a closest match for each item in list2 from all the items in list1 and substituting that closest matching word to list2 instead, or adding it to a new list and returning that list. Since list1 and list2 have the same number of items all the time, I would also like to make it not possible to have two items from list2 match to a single item in list1 - ie each item in list2 corresponds to a unique item in list1. list1 will never have 2 or more identical words. A possible output (if item matches are added to a new list):

Code:
`newlist = ['there', 'are', 'hello', 'you']`

#2
July 8th, 2012, 05:27 AM
 sepp2k1
Contributing User

Join Date: Apr 2012
Posts: 75
Time spent in forums: 1 Day 23 h 21 m 30 sec
Reputation Power: 38
First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.

Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).

#3
July 8th, 2012, 05:41 AM
 jepot
Registered User

Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
Quote:
 Originally Posted by sepp2k1 First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance. Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).

Hi, I would like to use the method you have suggested. How would I be able to implement this on my script?

#4
July 8th, 2012, 06:14 AM
 sepp2k1
Contributing User

Join Date: Apr 2012
Posts: 75
Time spent in forums: 1 Day 23 h 21 m 30 sec
Reputation Power: 38
To implement the Levenshtein distance there's pseudo code on the wiki article I linked to (and I'm sure there's a python implementation flying around somewhere on the web as well). The rest should be straight forward.

#5
July 8th, 2012, 08:33 PM
 jepot
Registered User

Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
is the difflib module using the Levenshtein distance method?

#6
July 9th, 2012, 07:07 AM
 sepp2k1
Contributing User

Join Date: Apr 2012
Posts: 75
Time spent in forums: 1 Day 23 h 21 m 30 sec
Reputation Power: 38
No, but that doesn't mean you can't use it instead. Using the Levenshtein distance was only a suggestion. It depends on what you want the results to look like really.

#7
July 12th, 2012, 06:13 AM
 jepot
Registered User

Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
Thanks, problem solved

 Viewing: Dev Shed Forums > Programming Languages > Python Programming > Complex string matching based on list items