#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    4
    Rep Power
    0

    Nested capture groups in a repeated group.


    I'm having some regex confusion. I want to capture a couple blocks of text within another, repeated block of text. For example, say I have a string like:
    Code:
    'some text<a>test1</a>stuff<b>bee1</b>text<a>test2</a>stuff<b>bee2</b>text<a>test3</a>stuff<b>bee3</b>'
    and I want to extract the content of each <a> and <b> tag--ie I want to extract 'test1', 'bee1', 'test2', 'bee2', etc. I have tried to use an expression like:
    Code:
    'some (text<a>([\w]*)</a>stuff<b>([\w]*)</b>)+'
    However, my results from this are:
    Code:
    'text<a>test3</a>stuff<b>bee3</b>', 'test3', 'bee3')
    I totally understand why this is the case, but what I really would like to know is how can I do what I want to do here? I'm at a loss.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    76
    Rep Power
    114
    I think you use the preg_replace_callback() function like this:
    PHP Code:
    function myReplaceCallback($matches)
    {
    	var_dump($matches);
    }
     
    $subject = '\'some text<a>test1</a>stuff<b>bee1</b>text<a>test2</a>stuff<b>bee2</b>text<a>test3</a>stuff<b>bee3</b>\'';
    preg_replace_callback('%<a>(.*?)</a>.*?<b>(.*?)</b>%s', 'myReplaceCallback', $subject);

    Result:
    Code:
    array
      0 => string '<a>test1</a>stuff<b>bee1</b>' (length=28)
      1 => string 'test1' (length=5)
      2 => string 'bee1' (length=4)
    
    array
      0 => string '<a>test2</a>stuff<b>bee2</b>' (length=28)
      1 => string 'test2' (length=5)
      2 => string 'bee2' (length=4)
    
    array
      0 => string '<a>test3</a>stuff<b>bee3</b>' (length=28)
      1 => string 'test3' (length=5)
      2 => string 'bee3' (length=4)
    Is this what you wanted?
  4. #3
  5. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Here's a way:

    PHP Code:
    $text 'some text<a>test1</a>stuff<b>bee1</b>text<a>test2</a>stuff<b>bee2</b>text<a>test3</a>stuff<b>bee3</b>';
    preg_match_all('#<([ab])>(.*?)</\1>#is'$text$matches);
    print_r($matches); 

    Comments on this post

    • SteffenL agrees : I learned something. Thank you. :)
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    4
    Rep Power
    0
    SteffenL, maybe I should have mentioned that I'm using python, not PHP.

    promethueuzz, I see what you are doing there, but I don't think that would work for me, even if I was using php, because my example above is just an example--the real regular expression is the same thing, but with different text, and also it is part of a larger regular expression.

    Any python preg_replace_callback(), maybe?
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    76
    Rep Power
    114
    Originally Posted by jacobe
    SteffenL, maybe I should have mentioned that I'm using python, not PHP.
    Ooh, I am sorry. For some reason I assumed it was PHP. Silly me. I do not know Python but I will look around a little later anyways unless you already have your solution by then.
  10. #6
  11. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by jacobe
    promethueuzz, I see what you are doing there, but I don't think that would work for me, even if I was using php,
    Why not? Using Python's re.compile(regex).findall(text) you can get the output as you described in your original post.

    Originally Posted by jacobe
    because my example above is just an example--the real regular expression is the same thing, but with different text, and also it is part of a larger regular expression.
    I often see that happening: someone asking help over-simplifies his or her problem resulting in answers that the one asking the question cannot use.

    I think it would be better to describe what you're really trying to do and provide real input data instead of the simple examples you now posted.

    Good luck.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Feb 2004
    Location
    London, England
    Posts
    1,585
    Rep Power
    1373
    Turn back now! Trying to match nested tags with regex will only lead to misery, insanity and eventually baldness. I would not wish that on anyone.

    Use a proper XML parser such as BeautifulSoup or ElementTree instead, and save yourself hours of futility.

    BTW, the Python equivalent to PHP's preg_replace_callback function is to use re.sub(...) with a callable instead of the replacement string. It will get called for each match and the return value will replace the matched substring. This will not help you parse nested tags though.

    Dave

    Comments on this post

    • prometheuzz agrees
  14. #8
  15. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by DevCoach
    Turn back now! Trying to match nested tags with regex will only lead to misery, insanity and eventually baldness. ...
    Note that the OP doesn't want to match nested tags, but s/he is grouping certain text in his/her regex pattern, and that group consists of two other groups (nested groups).

    But your suggestion is still a valid one (matching nested tags or not): use a proper parser on (x)html and don't go hacking your way through it with regex.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    4
    Rep Power
    0
    Originally Posted by DevCoach
    Turn back now! Trying to match nested tags with regex will only lead to misery, insanity and eventually baldness. I would not wish that on anyone.

    Use a proper XML parser such as BeautifulSoup or ElementTree instead, and save yourself hours of futility.

    BTW, the Python equivalent to PHP's preg_replace_callback function is to use re.sub(...) with a callable instead of the replacement string. It will get called for each match and the return value will replace the matched substring. This will not help you parse nested tags though.

    Dave
    While I thank you for the words of warning, I have to disagree! I was about to give up, as per your suggestion, when I tried something that had somehow previously slipped by me: re.findall(). This method does exactly what I needed.

    Thanks for your help everyone.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2009
    Posts
    4
    Rep Power
    0
    Originally Posted by prometheuzz
    Note that the OP doesn't want to match nested tags, but s/he is grouping certain text in his/her regex pattern, and that group consists of two other groups (nested groups).

    But your suggestion is still a valid one (matching nested tags or not): use a proper parser on (x)html and don't go hacking your way through it with regex.
    fyi, the use of angle brackets and html/xml/whatever-ish syntax was just an example...I was more interested in being able to do this in general, than just doing it with an html document.
  20. #11
  21. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by jacobe
    fyi, the use of angle brackets and html/xml/whatever-ish syntax was just an example...I was more interested in being able to do this in general, than just doing it with an html document.
    Well, there you go: my previous remark stands: explain your actual problem and don't try to over simplify or give input that you're not really working with.
  22. #12
  23. No Profile Picture
    User 165270
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2005
    Posts
    497
    Rep Power
    937
    Originally Posted by jacobe
    While I thank you for the words of warning, I have to disagree!
    Perhaps you disagree because you found a regex solution to your problem. But note that it is generally a better idea to parse (x)html using a true html parser instead of using regex for this.

    See: http://kore-nordmann.de/blog/do_NOT_...ng_regexp.html

    Originally Posted by jacobe
    I was about to give up, as per your suggestion, when I tried something that had somehow previously slipped by me: re.findall(). This method does exactly what I needed.
    As I previously suggested...

    Originally Posted by jacobe
    Thanks for your help everyone.
    You're welcome.
    Last edited by prometheuzz; June 16th, 2009 at 04:55 AM.

IMN logo majestic logo threadwatch logo seochat tools logo