
November 19th, 2012, 06:28 AM
|
|
|
|
The point is that is is fairly easy to apply a regaex so that you have:
Michael Jackson -> 1st match ($1 or /1)
Phil Collins -> 2nd match
Peter Gabriel -> 3rd match
Lionel Ritchie -> 4th match.
Let's stop here, we know there hasn't been any duplicate so far and one is coming (although the program does not know either, so let see how it can find out).
Now, you get to the first duplicate, Phil Collins. How do you know it is a duplicate? You just can't unless you are prepared to go through every single match so far, and that's where it becomes difficult. Within a regex, you dont really know how many matches you made so far, so you can't really loop on those matches. And even if you did know how to do it, you'd have no other choice that just trying every single match so far. So, for every new word (here, for the sake of the discussion, "Phil Collins" is a single word in my mind), you have to check all the words recorded so far. This is quickly becoming very unefficient.
Whereas if you use a hash table (or whatever data structure, I would use a hash in Perl), checking whether you have already met that word is very fast.
If you know about the "big Oh" notation, we can say that checking for one word in a list is a search in O(n) complexuty (with n the number of words so far). Doing it for each new word has a complexity of O(nē);. Searching a word in a hash is a search in 0(1); searching over for every new word has a complexity of O(n). The difference between O(n) and 0(nē) is very significant, it is actually a huge difference as soon as you start to have more than a few words..
So in brief, regex are impractical for this type of problem (as far as I can say, I may miss the clever idea that does it), but if you found a practical way to use them, they would be extremely likely to be very unefficient compared to the hash table solution.
|