The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
Complex string matching based on list items
Discuss Complex string matching based on list items in the Python Programming forum on Dev Shed. Complex string matching based on list items Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

July 8th, 2012, 03:50 AM
|
|
Registered User
|
|
Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
|
|
|
Complex string matching based on list items
I have two lists like:
Code:
list1 = ['hello', 'are', 'you', 'there'] #list of correct strings
and
Code:
list2 = ['There1', '1 are 1', 'helloooo', 'you'] #list of random strings that have similarity to list1's items but not in order
list1 and list2 have the same number of items all the time.
I would like to fix the items in list2 by finding a closest match for each item in list2 from all the items in list1 and substituting that closest matching word to list2 instead, or adding it to a new list and returning that list. Since list1 and list2 have the same number of items all the time, I would also like to make it not possible to have two items from list2 match to a single item in list1 - ie each item in list2 corresponds to a unique item in list1. list1 will never have 2 or more identical words. A possible output (if item matches are added to a new list):
Code:
newlist = ['there', 'are', 'hello', 'you']
|

July 8th, 2012, 04:27 AM
|
|
|
First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.
Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).
|

July 8th, 2012, 04:41 AM
|
|
Registered User
|
|
Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by sepp2k1 First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.
Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method). |
Hi, I would like to use the method you have suggested. How would I be able to implement this on my script?
|

July 8th, 2012, 05:14 AM
|
|
|
|
To implement the Levenshtein distance there's pseudo code on the wiki article I linked to (and I'm sure there's a python implementation flying around somewhere on the web as well). The rest should be straight forward.
|

July 8th, 2012, 07:33 PM
|
|
Registered User
|
|
Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
|
|
|
is the difflib module using the Levenshtein distance method?
|

July 9th, 2012, 06:07 AM
|
|
|
|
No, but that doesn't mean you can't use it instead. Using the Levenshtein distance was only a suggestion. It depends on what you want the results to look like really.
|

July 12th, 2012, 05:13 AM
|
|
Registered User
|
|
Join Date: Dec 2008
Posts: 17
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
|
|
|
Thanks, problem solved
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|