#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    3
    Rep Power
    0

    Regular Expression Being too Lazy


    Regular Expression:
    ^((Mass|Boston|.+?),(Brigham and Women's Hospital|Boston|Brigham and Women's|.+?),.+?,((Boston|ma)+|.+?),.+)

    Text to Match Against:
    Country,City,AccentCity,Region,Population,Latitude,Longitude
    us,booth corner,Booth Corner,MA,,41.5875000,-71.0888889
    us,boston,Boston,MA,571281,42.3583333,-71.0602778
    us,bourne,Bourne,MA,34065,41.7411111,-70.5994444
    Basically, in each case I want to say "If you can't find one of THESE, then just find ANYTHING", but when I add ".+?" into my OR statement in any of them, regex becomes lazy and matches that one straight away. I can't seem to set any precedence in the OR statement, because in the last OR statement ((Boston|MA)|.+?) it clearly can find a line where "MA" is in it, but regex simply says, "Oh to hell with that, let's just get anything!"

    How do I make it grab those it can find first, and if it can't find any of those, to just get anything?

    Invariably, it returns the first line of the text

    If you remove the ".+?" from the "(Boston|ma)" section, then it'll match to the second line, because it CAN find an "MA" in the string

    What I want: "If you can't find any of these, then match anything"
    What I get: "Match anything"
  2. #2
  3. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    94
    Hi Wildhoney,

    Let's say I show you a box with a drink in it. I don't tell you what the drink is, and ask you if you would like it.
    You tell me "if it's cranberry juice, give it to me, if it's vodka, give it to me, if it's anything else, also give it to me". Why not just say "give me whatever is in the box"?

    That's what you're doing with three of your five capture groups: "capture {specific_string OR anything} before the next comma".
    If you're going to capture anything anyway, why not just say "capture anything"?

    Not really sure what you're trying to achieve, so can't tell you much more at this stage, but if you're trying to get the strings between the commas, a lot of people are fond of preg_split() and even fonder of explode().
    [EDIT: ...if you're using php]

    Small technical pointer: it's faster to match [^,]+, than .+?,

    Wishing you a fun week.
    Last edited by ragax; April 2nd, 2012 at 12:55 AM. Reason: Clarified that one piece of information is php-related.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    3
    Rep Power
    0
    Thank you for your reply!

    Unfortunatley I'm not really after anything. Rather I have preferences. My first choice would be cranberry juice, with vodka coming in at a close second. However if you have none of these, then anythnig will do. To me it's illogical to simply assume I want anything, and disregarding my preferences.

    Nevertheless, the reason I would like this is that I'm writing a class that will have 3 details: country, state, and city. However, I won't necessarily have all of these items, so I pass in what I do have and let the regular expression do the rest.

    I must add though that I've now changed my TXT document that was being parsed with regular expression, into SQLite format, and with SQL we still have the same problem when I add OR statements in the same fashion as above. I have had to employ a cunning ORDER BY statement, but naturally shuffling of large datasets like this one -- which is in excess of one million records, isn't the wisest thing.

    So it seems like MySQL as well, by default, is lazy like Regex. If it can get away with finding anything, then it will -- which is a failure to recognise my preferences, even though I may not be particular when you don't have cranberry or vodka.
  6. #4
  7. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    94
    Hi Wildhoney,

    Thanks for your detailed reply.
    What it shows me most clearly is that I didn't understand your initial request---and that I still don't understand what you are trying to accomplish.

    Given the "Text to Match Against" you have in your first post, can you elaborate on what would you want a regex-based function to return for you?

    Late here but someone on the other side of the world will probably have the answer while I sleep.

    Wishing you a beautiful day.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Posts
    3
    Rep Power
    0
    Haha! You should be sleeping. This was driving me absolutely crazy even when I was wide awake -- when feeling tired it would be infinitely worse.

    By the way, thank you very much for your help so far.

    In my opinion -- which differs from regex's opinion, it would return the third line:
    (us,boston,Boston,MA,571281,42.3583333,-71.0602778).
    The reason for this is that my regex is composed of three parts: the country, the city, and the region/state. In the first part, the country is neither Mass or Boston, but it is anything. In the second part, the city is Boston. In the third part the region/state is MA. Therefore, the third line in my text to match against is the best match.

    Essentially, the match anything should be the last resort. It's not enough to simply make the bracket-encapsulated string optional by issuing the ? character afterwards, because this would disregard that part of the data entirely. I would just like regex to consider my preferences, otherwise it should consider that anything will do.

    As far as I can see, regex is disregarding my preferences. It's being completely lazy in just saying, "Okay, anything will do, so let's get anything". This is wrong! (Or I am wrong in my syntax), because I have explicitly specified my preferences (cranberry and vodka). Correct me if I am wrong, but I was under the impression that the order of the OR values implied precendence. Therefore as soon as regex found the first match in the OR statement, it would break from that OR segment, and move onto the next one. Only AND statements would consider all of the items in the segment.

    In PHP if we add an OR in a conditional statement, then if the first variable is true, then the second variable will not be considered. This, as far as I understand it from a programmer's perspective, is the correct behaviour:

    if (true || false) echo 'Voila!';
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Posts
    29
    Rep Power
    0
    Hello, Wildhoney,

    if you are trying to find "Boston" in any field, please use:

    Code:
    ^us,.*?(?<=,)boston,
    If not, please describe what you are trying to achieve. Not the details about operator precedence, but the big picture. What is the purpose of the whole regular expression? What should it find?
  12. #7
  13. Turn left at the third duck
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2011
    Location
    Nelson, NZ
    Posts
    112
    Rep Power
    94
    Hi Wildhoney,

    You are right that regex typically does not work in the way you are wishing.

    The regex engines I work with do not return a "best match". They return the first matches they can find---or, in a "match all" function such as php's preg_match_all, all the matches.

    It's up to the programmer to design the expression so that the first match is the "right" or "best" match. Sometimes that's possible directly in regex, sometimes not.

    In your expression, each line matches. Therefore, for a "plain match" operation, regex will return the first line. For a match_all, regex will return all the lines.

    Given that a regex engine works sequentially through the input, and given that you are willing to match any line if your "preferred terms" are not found, I cannot think of a simple, pure regex way of accomplishing your goals.

    But you could rank the matches programmatically without too much fuss:
    1. A "match all" function returns an array of matches
    2. You could give "special emphasis" to preferred search terms by capturing them in their own capture group: for instance,
    Code:
    (?:(coke)|(vodka)|[^,]+),
    Here, the overall group that contains the alternation is non-capturing (the ?: at the beginning of the parentheses), but if coke or vodka are matched, they are captured in Group 1 or Group 2 (assuming these are the first groups in the expression).
    3. When you examine the matches in the code, you can have a point system to add 1 to your "match value" for each of these "preferential groups" (Group 1, Group 2, etc) that are not empty (or set, though some engines will set empty groups and return an empty string). You can even assign a coefficient (such as 10 or 100) to Group 2 if it's more important than other parameters.
    4. Then you sort the matches according to that match value.

    I am sure there are other ways of dealing with this type situation---this is the first idea that pops into my mind. Someone else may have a better idea.

    I seem to recall reading about experimental switches in PCRE's JIT compiler that return "best matches", and maybe that is how some other types of engines work.

    Curious to hear where you go with that, let us know what you decide to do.

    Wishing you a beautiful day.
    Last edited by ragax; April 2nd, 2012 at 03:24 PM. Reason: Added disclaimer words ("typically", "usually")

IMN logo majestic logo threadwatch logo seochat tools logo