#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    15
    Rep Power
    0

    Test for repeated items in a delimited list


    Given a string of items separated by a delimiter character, how can I test to see if any of those items appears more than once anywhere in the list? For example:

    Michael Jackson/Phil Collins/Peter Gabriel/Lionel Ritchie/George/Phil Collins/Michael/Don Henley

    ... should be seen as having a duplicate (Phil Collins).

    Delimiters can be added to the start and end of the lists and/or doubled-up (e.g. //) if that helps.

    Thanks.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    I probably would not use regexes for that. Split your line, store the elements in an array (or a map, a hash table, a dictionary, whatever data structure is the best in PHP) and check for duplicate entries.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    15
    Rep Power
    0
    Certainly an option - it just felt like something that ought to be achieveable with a single call to preg_match(). If not, then fair enough.

    Thanks for the reply.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    The point is that is is fairly easy to apply a regaex so that you have:
    Michael Jackson -> 1st match ($1 or /1)
    Phil Collins -> 2nd match
    Peter Gabriel -> 3rd match
    Lionel Ritchie -> 4th match.

    Let's stop here, we know there hasn't been any duplicate so far and one is coming (although the program does not know either, so let see how it can find out).

    Now, you get to the first duplicate, Phil Collins. How do you know it is a duplicate? You just can't unless you are prepared to go through every single match so far, and that's where it becomes difficult. Within a regex, you dont really know how many matches you made so far, so you can't really loop on those matches. And even if you did know how to do it, you'd have no other choice that just trying every single match so far. So, for every new word (here, for the sake of the discussion, "Phil Collins" is a single word in my mind), you have to check all the words recorded so far. This is quickly becoming very unefficient.

    Whereas if you use a hash table (or whatever data structure, I would use a hash in Perl), checking whether you have already met that word is very fast.

    If you know about the "big Oh" notation, we can say that checking for one word in a list is a search in O(n) complexuty (with n the number of words so far). Doing it for each new word has a complexity of O(nē);. Searching a word in a hash is a search in 0(1); searching over for every new word has a complexity of O(n). The difference between O(n) and 0(nē) is very significant, it is actually a huge difference as soon as you start to have more than a few words..

    So in brief, regex are impractical for this type of problem (as far as I can say, I may miss the clever idea that does it), but if you found a practical way to use them, they would be extremely likely to be very unefficient compared to the hash table solution.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    15
    Rep Power
    0
    Point taken - I am already proceeding with a hash-based approach.

    Thanks very much.

IMN logo majestic logo threadwatch logo seochat tools logo