#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2008
    Posts
    17
    Rep Power
    0

    Complex string matching based on list items


    I have two lists like:

    Code:
    list1 = ['hello', 'are', 'you', 'there'] #list of correct strings
    and

    Code:
    list2 = ['There1', '1 are 1', 'helloooo', 'you'] #list of random strings that have similarity to list1's items but not in order
    list1 and list2 have the same number of items all the time.

    I would like to fix the items in list2 by finding a closest match for each item in list2 from all the items in list1 and substituting that closest matching word to list2 instead, or adding it to a new list and returning that list. Since list1 and list2 have the same number of items all the time, I would also like to make it not possible to have two items from list2 match to a single item in list1 - ie each item in list2 corresponds to a unique item in list1. list1 will never have 2 or more identical words. A possible output (if item matches are added to a new list):

    Code:
    newlist = ['there', 'are', 'hello', 'you']
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    83
    Rep Power
    39
    First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.

    Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2008
    Posts
    17
    Rep Power
    0
    Originally Posted by sepp2k1
    First you need to decide how exactly you define the closeness of two strings. One approach would be to use the Levenshtein distance.

    Once you've decided on that and implemented your chosen metric in Python, you can simply iterate over the items in list2 and for each item find the closest item in list1 (according to your chosen metric). After you choose an item, set it to None in list1, so it isn't chosen again. (If list1 should not be mutated, make a copy of it at the beginning of the method).
    Hi, I would like to use the method you have suggested. How would I be able to implement this on my script?
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    83
    Rep Power
    39
    To implement the Levenshtein distance there's pseudo code on the wiki article I linked to (and I'm sure there's a python implementation flying around somewhere on the web as well). The rest should be straight forward.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2008
    Posts
    17
    Rep Power
    0
    is the difflib module using the Levenshtein distance method?
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    83
    Rep Power
    39
    No, but that doesn't mean you can't use it instead. Using the Levenshtein distance was only a suggestion. It depends on what you want the results to look like really.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2008
    Posts
    17
    Rep Power
    0
    Thanks, problem solved

IMN logo majestic logo threadwatch logo seochat tools logo