#1
  1. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2004
    Location
    Northern Ireland
    Posts
    59
    Rep Power
    11

    Carot negate string rather than character


    Hello,
    My question is, "is it possible to negate a string?"

    My regular expression is:
    Code:
    [1-9][0-9]*\. Season (?<seasonNumber>\d+), Ep (?<episodeNumber>\d+): <.*?>(?<episodeName>.*?)<\/a>.*?Aired: <.*?>(?<month>\d+)\/(?<day>\d+)\/(?<year>\d+)
    I have a list of episodes, all one after the other.
    The structure of each episode:
    Code:
    <li>
       <div>
          <h1> 
             1. Season 1, Ep 1: <a href="link">Episode Name</a>
          </h1>
          <div class="1">
             <div class="2">
                <p>
                   Description here
                </p>
                Aired: <span>1/1/2009</span>
             </div>
          </div>
       </div>
    </li>
    However: Aired: <span>1/1/2009</span> is optional and may not occur.

    The regular expression matches occurrences when the aired date is present, but when it is absent, it will match the aired date of the episode below, effectively missing out the episode below and giving an episode without an air date the incorrect date.

    Is there any way I can negate the string "Season" from the ".*?" just before Aired, so that it cannot run onto the next episode?

    Something like: .*?[^'Season']
    (although obviously this does not work...)

    Thanks for any help you can give,
    Ralf
    Last edited by jedi_ralf; February 15th, 2009 at 01:48 PM.
    "True Power Lies Within The Blood Of Your Peoples Revenge... The Devils Fruit Can Lead Me There..." - Uchiha Sasuke
  2. #2
  3. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by jedi_ralf
    ...

    Is there any way I can negate the string "Season" from the ".*?" just before Aired, so that it cannot run onto the next episode?
    ...
    Yes, something like this:

    Code:
    (?:(?!Season).)*
    But perhaps a better way is to stop when it reaches "<li>" or "</li>" instead of "Season"?
  4. #3
  5. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2004
    Location
    Northern Ireland
    Posts
    59
    Rep Power
    11
    Originally Posted by prometheuzz
    Yes, something like this:
    Code:
    (?:(?!Season).)*
    But perhaps a better way is to stop when it reaches "<li>" or "</li>" instead of "Season"?
    That's a very good point, stopping at "</li>" makes far more sense.

    However, my new regular expression still doesn't seem to have fixed the problem:
    Code:
    [1-9][0-9]*\. Season (?<seasonNumber>\d+), Ep (?<episodeNumber>\d+): <.*?>(?<episodeName>.*?)<\/a>(?:(?!<\/li>).)*Aired: <.*?>(?<month>\d+)\/(?<day>\d+)\/(?<year>\d+)
    I don't quite understand what your addition is doing;
    Code:
    (?:(?!<\/li>).)*
    "?!" is a negative lookahead, meaning that the match cannot be followed by "</li>", right?
    I am unsure what "?:" is, but there is a "." for a character match and a "*" for multiple occurrences.

    So shouldn't it be something like:
    Code:
    (.(?!<\/li>))*
    Where a character cannot be followed by "</li>". However, this also does not work.

    Any more help you can give me is greatly appreciated!
    Ralf
    "True Power Lies Within The Blood Of Your Peoples Revenge... The Devils Fruit Can Lead Me There..." - Uchiha Sasuke
  6. #4
  7. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by jedi_ralf
    ... I am unsure what "?:" is,
    It's called a "non capturing group".
    See the paragraph called "Names and Numbers for Capturing Groups" from this page: http://www.regular-expressions.info/named.html

    Originally Posted by jedi_ralf
    but there is a "." for a character match and a "*" for multiple occurrences. So shouldn't it be something like:
    Code:
    (.(?!<\/li>))*
    Where a character cannot be followed by "</li>". However, this also does not work.
    It's a bit tricky, I'll give you that.
    You should place the negative look ahead in front of the DOT so that the regex engine will first check if there's an occurrence of the string you're negating, and only if it's NOT there, then match an arbitrary character. Placing the DOT in front of the look ahead will cause you to "miss" a character. This all sounds a bit confusing, so let me illustrate this with the following example:

    PHP Code:
    <?php
    $text 
    'abcdeAIREDfghij';

    if(
    preg_match('/(?:(?!AIRED).)*/'$text$match)) {
      
    print_r($match);
    }

    if(
    preg_match('/(?:.(?!AIRED))*/'$text$match)) {
      
    print_r($match);
    }
    ?>
    Run this PHP script and see what the output is and you'll understand the difference between the two regexes.

    Originally Posted by jedi_ralf
    Any more help you can give me is greatly appreciated!
    Ralf
    Have a look at the following demo:

    PHP Code:
    <?php
    $text 
    =<<< BLOCK
    some text to ignore
    <li>
       <div>
          <h1> 
             12. Season 8, Ep 15: <a href="link">Episode Name 2</a>
          </h1>
          <div class="1">
             <div class="2">
                <p>
                   Description here
                </p>
                Aired: <span>1/1/2009</span>
             </div>
          </div>
       </div>
    </li>
    some text to ignore
    <li>
       <div>
          <h1> 
             1. Season 1, Ep 1: <a href="link">Episode Name 1</a>
          </h1>
          <div class="1">
             <div class="2">
                <p>
                   Description here
                </p>
             </div>
          </div>
       </div>
    </li>
    some text to ignore
    BLOCK;

    $regex '@
      \d+\.\s+SEASON\s+  (?<seasonNumber>  \d+   ) ,\s*
      EP\s+              (?<episodeNumber> \d+   ) :\s*
      <a\s[^>]*>         (?<episodeName>   [^<]* ) </a>
      (?<aired>
        (?:(?!</?li>).)*
        AIRED:\s+<span>  (?<month>         \d+   ) /
                         (?<day>           \d+   ) /
                         (?<year>          \d+   ) </span>
      )?
    @isx'
    ;

    if(
    preg_match_all($regex$text$matchesPREG_SET_ORDER)) {
      foreach(
    $matches as $m) {
        echo 
    "seasonNumber  = {$m['seasonNumber']}\n";
        echo 
    "episodeNumber = {$m['episodeNumber']}\n";
        echo 
    "episodeName   = {$m['episodeName']}\n";
        if(
    $m['aired']) {
          echo 
    "aired         = {$m['day']}/{$m['month']}/{$m['year']}\n";
        }
        echo 
    "\n";
      }
    }
    ?>
    (replace the '\n' with '<br />' if you're executing the script on a web server.)

    The 'i', 's' and 'x' flags I used after the regex will do this:
    i - ignore case;
    s - let the DOT meta character also match new line characters (this is not the case by default!);
    x - ignore all white spaces, tabs and new lines in the regex pattern. This way, you can span your pattern over multiple lines and will let you allign it nicely.

    Feel free to post back if something is unclear to you.

    Good luck.
    Last edited by prometheuzz; February 16th, 2009 at 03:56 AM.
  8. #5
  9. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2004
    Location
    Northern Ireland
    Posts
    59
    Rep Power
    11
    Originally Posted by prometheuzz
    It's called a "non capturing group".
    See the paragraph called "Names and Numbers for Capturing Groups" from this page: http://www.regular-expressions.info/named.html
    So, using your example for negative look ahead, "?:" is a way of using a group without saving the result. If we remove "?:" is saves the result to the next numbered group (which is inefficient as we don't need it).
    Code:
    Array
    (
        [0] => abcde
        [1] => e
    )
    Originally Posted by prometheuzz
    It's a bit tricky, I'll give you that.
    You should place the negative look ahead in front of the DOT so that the regex engine will first check if there's an occurrence of the string you're negating, and only if it's NOT there, then match an arbitrary character. Placing the DOT in front of the look ahead will cause you to "miss" a character.
    ...
    - You look for the offending string first. If it is present, stop, else match the arbitrary character.
    The other way round:
    - You look at the arbitrary character and look ahead for the offending string. If it is present, stop (and you lose the arbitrary character as it was part of the expression), else match the arbitrary character.
    Is this right? And that would be why the results are:
    Code:
    Array
    (
        [0] => abcde
    )
    Array
    (
        [0] => abcd
    )
    Originally Posted by prometheuzz
    PHP Code:
    $regex '@
      \d+\.\s+SEASON\s+  (?<seasonNumber>  \d+   ) ,\s*
      EP\s+              (?<episodeNumber> \d+   ) :\s*
      <a\s[^>]*>         (?<episodeName>   [^<]* ) </a>
      (?<aired>
        (?:(?!</?li>).)*
        AIRED:\s+<span>  (?<month>         \d+   ) /
                         (?<day>           \d+   ) /
                         (?<year>          \d+   ) </span>
      )?
    @isx'
    ;

    if(
    preg_match_all($regex$text$matchesPREG_SET_ORDER)) {
      foreach(
    $matches as $m) {
        echo 
    "seasonNumber  = {$m['seasonNumber']}\n";
        echo 
    "episodeNumber = {$m['episodeNumber']}\n";
        echo 
    "episodeName   = {$m['episodeName']}\n";
        if(
    $m['aired']) {
          echo 
    "aired         = {$m['day']}/{$m['month']}/{$m['year']}\n";
        }
        echo 
    "\n";
      }
    }
    ?> 
    Using @ as a delimiter is smart; saves having to escape "/".
    Also, I hadn't seen the "{$m[0]}" or "$text =<<< BLOCK" notations before.
    I'd always appended to the string: echo "Array - [".$m[0]." ]";
    Good stuff to know.

    I only had 2 questions on the regular expression and PHP code:
    1. What is the advantage of "[^>]*" or "[^<]*" over ".*?"
    2. Would it not be more efficient to not hold the value of "aired" (ie. have a non-capturing group) and simply check if either "day" or "month" or "year"(or all) hold a value?

    My breakdown of the regular expression (just me stepping through it to make sure I understand :P ):
    (The expression is case insensitive and ignores whitespace, so \s has had to be used)
    Code:
    \d+\.                   - a digit and then a full stop
    \s+SEASON\s+            - 1 or more occurrences of whitespace, the word "Season", and then 1 or more occurrences of whitespace
    (?<seasonNumber>\d+)    - 1 or more occurrences of a digit and save to "seasonNumber"
    ,\s*                    - a comma and then 0 or more occurrences of whitespace
    
    EP\s+                   - the word "EP", and then 1 or more occurrences of whitespace
    (?<episodeNumber>\d+)   - 1 or more occurrences of a digit and save to "episodeNumber"
    :\s*                    - a colon and then 0 or more occurrences of whitespace
    
    <a\s[^>]*>              - "<a", a whitespace character, 0 or more occurences of any character but "<", and then a ">"
    (?<episodeName> [^<]*)  - 0 or more occurences of any character but "<" and save to "episodeName"
    </a>                    - the tag "</a>"
    
    (?<aired>               - Start the group "aired" and save any matches to this group
    (?:(?!</?li>).)*        - Start a non-capturing group which has 0 or more occurences
                              Inside this group is a negative lookahead for the tag "<li>" or "</li>"
    AIRED:\s+<span>         - the word "Aired:", 1 or more occurrences of whitespace, and then the tag "<span>"
    (?<month> \d+) /        - 1 or more occurrences of a digit and save to "month", and then a forward slash "/"
    (?<day>   \d+) /        - 1 or more occurrences of a digit and save to "dat", and then a forward slash "/"
    (?<year>  \d+) </span>  - 1 or more occurrences of a digit and save to "year", and then the tag "</span>"
    )?                      - end the group "aired" and give it 0 or 1 occurances (optional)
    So then the PHP code:
    you run the preg_match_all() within the if statement to check if it returns true.
    If you didn't do this, the foreach would still fail to run as there is no $matches array, but it would give an error that the variable $matches has not been initialised, right?
    The loop iterates over the array $matches getting value $m, which is another array.
    Because PREG_SET_ORDER is used in the preg_match_all(), it orders each set of matches into an array ($m), which is then an element of the larger array ($matches).
    So then using $m['aired'] as a boolean to check whether to print the air date. I actually want to discard the entry if it does not have an air date, so I will move the if statement up to include all printouts.
    Originally Posted by prometheuzz
    The 'i', 's' and 'x' flags I used after the regex will do this:
    ...
    x - ignore all white spaces, tabs and new lines in the regex pattern. This way, you can span your pattern over multiple lines and will let you allign it nicely.
    Yup, I've been using the "i" and "s" flags. Using "x" is a good idea so it's clear to read.

    Thank you so much, that's brilliant! I've learned lots more about Regular Expressions and PHP.
    Ralf
    "True Power Lies Within The Blood Of Your Peoples Revenge... The Devils Fruit Can Lead Me There..." - Uchiha Sasuke
  10. #6
  11. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by jedi_ralf
    So, using your example for negative look ahead, "?:" is a way of using a group without saving the result. If we remove "?:" is saves the result to the next numbered group (which is inefficient as we don't need it).
    Correct. But note that the time you win may not even be noticeable, especially when working on fairly small strings. I more or less use those non-capturing groups automatically. But leaving them out may be better for the "readability" of your regex, which is also an advantage!

    Originally Posted by jedi_ralf
    ...
    I only had 2 questions on the regular expression and PHP code:
    1. What is the advantage of "[^>]*" or "[^<]*" over ".*?"
    IMO, you should always be as precise as possible when constructing a regex. So, if you want to match until the next ">", use "[^>]*" over ".*?".
    Some more information as to why this is a Good Thing, have a look at this great article:
    http://www.regular-expressions.info/repeat.html

    Originally Posted by jedi_ralf
    2. Would it not be more efficient to not hold the value of "aired" (ie. have a non-capturing group) and simply check if either "day" or "month" or "year"(or all) hold a value?
    Good catch!
    Yes, that is correct, but note my first remark about the readability of a regex, which is also worth something. So, when working with fairly small strings, leave it in as it enhances the readability, but when working on large amounts of text and you want your regex to be as efficient as possible*, strip it!

    * <advocate of the devil> Is regex the right tool for the job in that case? </advocate of the devil>

    Originally Posted by jedi_ralf
    My breakdown of the regular expression (just me stepping through it to make sure I understand :P ):
    (The expression is case insensitive and ignores whitespace, so \s has had to be used)
    Code:
    ...
    Correct.

    Originally Posted by jedi_ralf
    So then the PHP code:
    you run the preg_match_all() within the if statement to check if it returns true.
    If you didn't do this, the foreach would still fail to run as there is no $matches array, but it would give an error that the variable $matches has not been initialised, right?
    No, it will still work if you leave the if-statement out of it, but I find it clearer when using an if statement. Just personal taste.

    Originally Posted by jedi_ralf
    The loop iterates over the array $matches getting value $m, which is another array.
    Because PREG_SET_ORDER is used in the preg_match_all(), it orders each set of matches into an array ($m), which is then an element of the larger array ($matches).
    Correct.
    Just to test it, you could leave the PREG_SET_ORDER out of it to see how that changes things.

    Originally Posted by jedi_ralf
    ...
    Thank you so much, that's brilliant! I've learned lots more about Regular Expressions and PHP.
    Ralf
    No problem!
    Within no time you'll be teaching me some things about regex! ; )
    Last edited by prometheuzz; February 16th, 2009 at 12:47 PM.

IMN logo majestic logo threadwatch logo seochat tools logo