#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    6
    Rep Power
    0

    A simple question - match strings that don't contain a word


    So I am trying to build an algorithm that I can use in my topology course. I'll try to explain the problem thoroughly so it will make sense.


    Pretty much, I have a sequence of edges, where each distinct edge is represented by a number, and each edges has an orientation. A "regular" forward orientation is the default so I don't put any symbol to indicate the forward orientation, but if the edge has a "reverse" orientation I put an r after the edge number. So my incoming codes look something like this:

    1 4 5r 5 4r 4 1r

    in this example this means first I have an edge called 1, then an edge called 4, (both of them have the default forward orientation), then an edge 5 with the reverse orientation, then an edge 5 but with the forward orientation, then another edge 4 this time with reverse orientation, then another 4 with the forward orientation, then an edge 1 with the reverse orientation.

    Now, say I have an edge and an oreintation represented by the scalar $e (so $e = "5" or maybe $e = "4r"). I want to be able to locate strings with the following form:

    $e X $e

    where X represents any arbitrary sequences of edges NOT containing $e. (also, I don't care when this happens in the string, just that this pattern needs to occur at some point)


    So for example, if $e = 4r, the following should match:

    1 2 4r 5 8r 4 4r

    but the following should NOT match

    1 2 4r 4r 5r 4

    (you see, I need something separating the matching edges)


    FOr the life of me I just can't get this to work.


    I'm down to using

    /\b($e)\b\s([^$e]\s)+\b($e)\b/

    and I don't understand why it doesn't work

    So this should match $e, I put the word boundaries (\b) around it so say 12 wouldn't match on 125 and also like 4 shouldn't match on 4r, and then a whitespace to move to the next character, then ([^$e]\s)+ means I want any word except for $e, followed by a whitespace, and then repeat this sequence an arbitrary number of times, until finally I get myself another matching $e. (Also note, I put parens around $e just so I could save the matches as variables for debugging purposes to figure out what I'm doing wrong. I know it shouldn't have an influence on if this matches or not)


    What is wrong with the regex code I'm using? I'm so new to this and I just don't understand what I'm doing wrong.
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,959
    Rep Power
    9397
    You can't use character sets for this: it's not the individual characters you want to match but the "token".

    First write the expression in terms of those edge tokens.
    Code:
    /\b$e\b (\S+ )+\b$e\b/
    That will find $e and a space, then one or more of (non-spaces followed by spaces), followed by $e again. You could be more explicit than \S but I didn't think there was a need to bother with it.

    That \S+ matches an individual edge but it shouldn't be $e itself. Use a negative assertion.
    Code:
    /\b$e\b ((?!$e\b)\S+ )+\b$e\b/
    The extra \b in there is for the 12/125 thing you realized.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    6
    Rep Power
    0
    requinix, thank you so very much for this response. This makes a lot of sense. So I want to be clear: the character classes only work on specific characters, not an entire string?

    I just want to clarify one more thing:

    You wrote this:

    ((?!$e\b)\S+ )+

    so what if though, $e = "1", and I'm at the edge "21". I want "21 to count. But won't this disregard "21" because it sees the "1" with a word boundary behind it, and so it throws it out? Don't I need ((?!\b$e\b)\S+ )+ or am I understanding this improperly?


    Also, just out of curiosity, it seems from your example that if I actually put the whitespace in the regex pattern, it's doing the same thing as saying \s.

    (what I mean is, you wrote /\b$e\b (\S+ )+\b$e\b/, and I guess this is just like writing /\b$e\b\s(\S+\s)+\b$e\b/ I totally didn't realize this until right now!)

    Is there ever a benefit to just using the \s instead? Or is it there so say you can write something like \s+ (you know in the situation where you don't know how many to expect)

    Thanks again, this post has been so helpful.
  6. #4
  7. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    13,959
    Rep Power
    9397
    Originally Posted by lovinPerl
    So I want to be clear: the character classes only work on specific characters, not an entire string?
    Right. If you had
    Code:
    ^[^dairy]*$
    it won't match "dairy" or "raid" or "yard" or "ddddd" or anything containing any of those letters.
    Forward assertions (?= and (?! are the "string" equivalent, but you can put entire expressions in there and not just simple strings.

    Originally Posted by lovinPerl
    I just want to clarify one more thing:

    You wrote this:

    ((?!$e\b)\S+ )+

    so what if though, $e = "1", and I'm at the edge "21". I want "21 to count. But won't this disregard "21" because it sees the "1" with a word boundary behind it, and so it throws it out? Don't I need ((?!\b$e\b)\S+ )+ or am I understanding this improperly?
    You're looking in the right direction but there's one more thing to consider: what came before that assertion. There will always be a space before, coming from either the space before the parentheses (matching the group the first time) or the space at the end of the parentheses (matching the group a second or later time).

    Imagine expanding the repetition out a bit:
    Code:
    /\b$e\b ((?!$e\b)\S+ )+\b$e\b/
    expands to
    /\b$e\b (?!$e\b)\S+ ((?!$e\b)\S+ )+\b$e\b/
    and again to
    /\b$e\b (?!$e\b)\S+ (?!$e\b)\S+ ((?!$e\b)\S+ )+\b$e\b/
    and so on
    (with each expansion requiring an additional edge between the two ends)

    It also makes use of the fact that word boundaries can only possibly occur just before or just after an edge; if they could appear in the middle (like if you used a symbol instead of "r" to indicate a reversal) then this expression wouldn't be good enough.

    Originally Posted by lovinPerl
    Also, just out of curiosity, it seems from your example that if I actually put the whitespace in the regex pattern, it's doing the same thing as saying \s.
    \s represents any whitespace at all: spaces, tabs, vertical tabs, carriage returns, newlines... a couple others I think. Since I don't think you have to deal with anything besides spaces I just put in literal spaces - less backslashes to sift through and easier to read.

    (and \S is the opposite of \s: any character that's not whitespace)

    So continuing that train of thought, maybe you have a simple HTML expression. There, any amount of practically any type of whitespace is all treated the same. If you use just a literal space then that's all you get: no tabs, no newline characters.
    Last edited by requinix; February 21st, 2013 at 12:00 AM.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    6
    Rep Power
    0
    thanks very much for taking the time to give such detailed responses. It all makes perfect sense now and I really feel I learned something. I feel kind of dumb with regex, but slowly I'm learning. It seems like such an insanely powerful tool

IMN logo majestic logo threadwatch logo seochat tools logo