Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old July 8th, 2012, 03:50 AM
jepot jepot is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2008
Posts: 17 jepot User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
Complex string matching based on list items

I have two lists like:

Code:
list1 = ['hello', 'are', 'you', 'there'] #list of correct strings


and

Code:
list2 = ['There1', '1 are 1', 'helloooo', 'you'] #list of random strings that have similarity to list1's items but not in order


list1 and list2 have the same number of items all the time.

I would like to fix the items in list2 by finding a closest match for each item in list2 from all the items in list1 and substituting that closest matching word to list2 instead, or adding it to a new list and returning that list. Since list1 and list2 have the same number of items all the time, I would also like to make it not possible to have two items from list2 match to a single item in list1 - ie each item in list2 corresponds to a unique item in list1. list1 will never have 2 or more identical words. A possible output (if item matches are added to a new list):

Code:
newlist = ['there', 'are', 'hello', 'you']

Reply With Quote
  #2  
Old July 8th, 2012, 04:27 AM
sepp2k1 sepp2k1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 75 sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Day 23 h 20 m 52 sec
Reputation Power: 38
First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.

Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).

Reply With Quote
  #3  
Old July 8th, 2012, 04:41 AM
jepot jepot is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2008
Posts: 17 jepot User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
Quote:
Originally Posted by sepp2k1
First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.

Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).


Hi, I would like to use the method you have suggested. How would I be able to implement this on my script?

Reply With Quote
  #4  
Old July 8th, 2012, 05:14 AM
sepp2k1 sepp2k1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 75 sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Day 23 h 20 m 52 sec
Reputation Power: 38
To implement the Levenshtein distance there's pseudo code on the wiki article I linked to (and I'm sure there's a python implementation flying around somewhere on the web as well). The rest should be straight forward.

Reply With Quote
  #5  
Old July 8th, 2012, 07:33 PM
jepot jepot is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2008
Posts: 17 jepot User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
is the difflib module using the Levenshtein distance method?

Reply With Quote
  #6  
Old July 9th, 2012, 06:07 AM
sepp2k1 sepp2k1 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Posts: 75 sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level)sepp2k1 User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 1 Day 23 h 20 m 52 sec
Reputation Power: 38
No, but that doesn't mean you can't use it instead. Using the Levenshtein distance was only a suggestion. It depends on what you want the results to look like really.

Reply With Quote
  #7  
Old July 12th, 2012, 05:13 AM
jepot jepot is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2008
Posts: 17 jepot User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 5 h 17 m 23 sec
Reputation Power: 0
Thanks, problem solved

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Complex string matching based on list items

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap