October 14th, 2008, 11:28 PM
Pulling my hair out..what little I have left! :)
Hi all- Newbie here.
I have a string as follows (for example): "I have visited the U.S., Puerto Rico, Canada, U.S. Virgin Islands and the Carribean."
I want to extract just the country names from that sentence, so: 1) U.S. 2) Puerto Rico 3) Canada 4) U.S. Virgin Islands and 5) Carribean.
I have something like: (United States|U\.S\.|Canada|Puerto Rico|U\.S\. Virgin Islands|Carribean)(,|\s|and|the)+
This matches the following: "U.S., Puerto Rico, Canada, U.S. Virgin Islands and the Carribean."
The problem I have is it matches U.S. twice, where it should only match once (it extracts U.S. from "U.S. Virgin Islands").
There must be a nicer way of doing this. It sucks to be a newb!
Thanks for any help.
October 15th, 2008, 01:16 AM
*cough* The Carribean isn't a country
Can't you just ignore the extra match?
(United States|U\.S\.(?! Virgin Islands)|Canada|Puerto Rico|U\.S\. Virgin Islands|Carribean)
October 15th, 2008, 11:30 AM
Thanks...I knew there was a simple way of doing this...I was over complicating it!! I'm probably going to have lots more questions to post on this board, hopefully ones that aren't so easy.
Originally Posted by requinix
P.S. Anyone recommend a good regex editor? I installed a trial version of regex buddy on Vista and after the first use, it got corrupted! Any freeware out there?
October 15th, 2008, 11:32 AM
For matches such as these, if some of your alternatives are extensions of other (i.e. U.S. Virgin Islands contains U.S. and then some more text) you always need to put the longer one first, so it will attempt to match that one first.