#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2017
    Posts
    4
    Rep Power
    0

    How to be really lazy?


    Hi!

    I have a string which is inspired by Markdown:
    Code:
    [Text](/command)
    I get the "Text" and "/command" with this regex:
    Code:
    \[(.+?)\]\((.+?)\)
    That works.

    But the problem arises when I have a string like this:
    Code:
    [foo] [Text](/command)
    Then "foo] [Text" is matched instead of "Text". I tried it with the lazy operator "?" but that doesn't seem to work in this case. I can observe some lazy behavior but the start of the match doesn't move to the right to make it more lazy.
    How to solve this?

    Thanks in advance!
    RandomByte
  2. #2
  3. Maddening Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,453
    Rep Power
    9645
    Ironically, when it comes to writing regular expressions, don't be lazy.

    Rethink what the regex does. Instead of matching "any amount of anything" in those two places, don't you want to be more precise? What kinds of characters do you want to allow? More to the point, what kinds of characters do you not want to allow?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2017
    Posts
    4
    Rep Power
    0
    Thanks for the response!

    Well, to make things easier I could simply forbid square brackets in the text part. Otherwise there can be all printable characters. Also the command part has no restrictions.
    I hoped for a solution that forces the regex matcher to prefer ungreedy matches of the text part nearby the command part. But as I understand now, this isn't ambiguously.
  6. #4
  7. Maddening Moderator
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    16,453
    Rep Power
    9645
    Originally Posted by RandomByte
    I hoped for a solution that forces the regex matcher to prefer ungreedy matches of the text part nearby the command part.
    A regex always tries to match - greedy or ungreedy only affects whether the engine goes too far and backtracks (greedy) or starts short and grows as necessary (ungreedy).

    There is, in fact, a way to force the regex to match the way you want:
    Code:
    \[((?:(?!(?R)).)*?)\]\(((?:(?!(?R)).)*?)\)
    It works by reading one character at a time (.) and requiring that the regex is not able to match (?!) recursively (?R) at each position.

    Problem is it's terribly inefficient (234 steps, 29ms) so restricting the regex to not allow []s and ()s is better (47 steps, 1ms). If you absolutely could not do that then your hand would be forced, but I've never seen a situation where that was the case.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2017
    Posts
    4
    Rep Power
    0
    Originally Posted by requinix
    Problem is it's terribly inefficient (234 steps, 29ms)...
    Apart from a cool looking solution, yes, it is inefficient. I have taken some strings from the application that this Regex will be used in: 5000-8000 steps, depending how the user configures it.
    I will take your advice and just forbid [] and () in the respective part.

    My final Regex is:
    Code:
    \[([^\[\]]+)\]\(([^\(\)]+)\)
    I also removed the ungreediness(?), which also cut down the steps to find all matches tremendously.

IMN logo majestic logo threadwatch logo seochat tools logo