November 18th, 2012, 12:15 PM
Test for repeated items in a delimited list
Given a string of items separated by a delimiter character, how can I test to see if any of those items appears more than once anywhere in the list? For example:
Michael Jackson/Phil Collins/Peter Gabriel/Lionel Ritchie/George/Phil Collins/Michael/Don Henley
... should be seen as having a duplicate (Phil Collins).
Delimiters can be added to the start and end of the lists and/or doubled-up (e.g. //) if that helps.
November 18th, 2012, 12:37 PM
I probably would not use regexes for that. Split your line, store the elements in an array (or a map, a hash table, a dictionary, whatever data structure is the best in PHP) and check for duplicate entries.
November 18th, 2012, 12:45 PM
Certainly an option - it just felt like something that ought to be achieveable with a single call to preg_match(). If not, then fair enough.
Thanks for the reply.
November 19th, 2012, 06:28 AM
The point is that is is fairly easy to apply a regaex so that you have:
Michael Jackson -> 1st match ($1 or /1)
Phil Collins -> 2nd match
Peter Gabriel -> 3rd match
Lionel Ritchie -> 4th match.
Let's stop here, we know there hasn't been any duplicate so far and one is coming (although the program does not know either, so let see how it can find out).
Now, you get to the first duplicate, Phil Collins. How do you know it is a duplicate? You just can't unless you are prepared to go through every single match so far, and that's where it becomes difficult. Within a regex, you dont really know how many matches you made so far, so you can't really loop on those matches. And even if you did know how to do it, you'd have no other choice that just trying every single match so far. So, for every new word (here, for the sake of the discussion, "Phil Collins" is a single word in my mind), you have to check all the words recorded so far. This is quickly becoming very unefficient.
Whereas if you use a hash table (or whatever data structure, I would use a hash in Perl), checking whether you have already met that word is very fast.
If you know about the "big Oh" notation, we can say that checking for one word in a list is a search in O(n) complexuty (with n the number of words so far). Doing it for each new word has a complexity of O(nē);. Searching a word in a hash is a search in 0(1); searching over for every new word has a complexity of O(n). The difference between O(n) and 0(nē) is very significant, it is actually a huge difference as soon as you start to have more than a few words..
So in brief, regex are impractical for this type of problem (as far as I can say, I may miss the clever idea that does it), but if you found a practical way to use them, they would be extremely likely to be very unefficient compared to the hash table solution.
November 19th, 2012, 06:34 AM
Point taken - I am already proceeding with a hash-based approach.
Thanks very much.