June 15th, 2009, 05:11 AM
-
Nested capture groups in a repeated group.
I'm having some regex confusion. I want to capture a couple blocks of text within another, repeated block of text. For example, say I have a string like:
Code:
'some text<a>test1</a>stuff<b>bee1</b>text<a>test2</a>stuff<b>bee2</b>text<a>test3</a>stuff<b>bee3</b>'
and I want to extract the content of each <a> and <b> tag--ie I want to extract 'test1', 'bee1', 'test2', 'bee2', etc. I have tried to use an expression like:
Code:
'some (text<a>([\w]*)</a>stuff<b>([\w]*)</b>)+'
However, my results from this are:
Code:
'text<a>test3</a>stuff<b>bee3</b>', 'test3', 'bee3')
I totally understand why this is the case, but what I really would like to know is how can I do what I want to do here? I'm at a loss.
June 15th, 2009, 05:54 AM
-
I think you use the preg_replace_callback() function like this:
PHP Code:
function myReplaceCallback($matches)
{
var_dump($matches);
}
$subject = '\'some text<a>test1</a>stuff<b>bee1</b>text<a>test2</a>stuff<b>bee2</b>text<a>test3</a>stuff<b>bee3</b>\'';
preg_replace_callback('%<a>(.*?)</a>.*?<b>(.*?)</b>%s', 'myReplaceCallback', $subject);
Result:
Code:
array
0 => string '<a>test1</a>stuff<b>bee1</b>' (length=28)
1 => string 'test1' (length=5)
2 => string 'bee1' (length=4)
array
0 => string '<a>test2</a>stuff<b>bee2</b>' (length=28)
1 => string 'test2' (length=5)
2 => string 'bee2' (length=4)
array
0 => string '<a>test3</a>stuff<b>bee3</b>' (length=28)
1 => string 'test3' (length=5)
2 => string 'bee3' (length=4)
Is this what you wanted?
June 15th, 2009, 06:17 AM
-
Here's a way:
PHP Code:
$text = 'some text<a>test1</a>stuff<b>bee1</b>text<a>test2</a>stuff<b>bee2</b>text<a>test3</a>stuff<b>bee3</b>';
preg_match_all('#<([ab])>(.*?)</\1>#is', $text, $matches);
print_r($matches);
Comments on this post
June 15th, 2009, 11:33 AM
-
SteffenL, maybe I should have mentioned that I'm using python, not PHP.
promethueuzz, I see what you are doing there, but I don't think that would work for me, even if I was using php, because my example above is just an example--the real regular expression is the same thing, but with different text, and also it is part of a larger regular expression.
Any python preg_replace_callback(), maybe?
June 15th, 2009, 11:42 AM
-
Originally Posted by jacobe
SteffenL, maybe I should have mentioned that I'm using python, not PHP.
Ooh, I am sorry. For some reason I assumed it was PHP. Silly me. I do not know Python but I will look around a little later anyways unless you already have your solution by then.
June 15th, 2009, 12:06 PM
-
Originally Posted by jacobe
promethueuzz, I see what you are doing there, but I don't think that would work for me, even if I was using php,
Why not? Using Python's re.compile(regex).findall(text) you can get the output as you described in your original post.
Originally Posted by jacobe
because my example above is just an example--the real regular expression is the same thing, but with different text, and also it is part of a larger regular expression.
I often see that happening: someone asking help over-simplifies his or her problem resulting in answers that the one asking the question cannot use.
I think it would be better to describe what you're really trying to do and provide real input data instead of the simple examples you now posted.
Good luck.
June 15th, 2009, 05:58 PM
-
Turn back now! Trying to match nested tags with regex will only lead to misery, insanity and eventually baldness. I would not wish that on anyone.
Use a proper XML parser such as BeautifulSoup or ElementTree instead, and save yourself hours of futility.
BTW, the Python equivalent to PHP's preg_replace_callback function is to use re.sub(...) with a callable instead of the replacement string. It will get called for each match and the return value will replace the matched substring. This will not help you parse nested tags though.
Dave
Comments on this post
June 16th, 2009, 12:55 AM
-
Originally Posted by DevCoach
Turn back now! Trying to match nested tags with regex will only lead to misery, insanity and eventually baldness. ...
Note that the OP doesn't want to match nested tags, but s/he is grouping certain text in his/her regex pattern, and that group consists of two other groups (nested groups).
But your suggestion is still a valid one (matching nested tags or not): use a proper parser on (x)html and don't go hacking your way through it with regex.
June 16th, 2009, 01:00 AM
-
Originally Posted by DevCoach
Turn back now! Trying to match nested tags with regex will only lead to misery, insanity and eventually baldness. I would not wish that on anyone.
Use a proper XML parser such as BeautifulSoup or ElementTree instead, and save yourself hours of futility.
BTW, the Python equivalent to PHP's preg_replace_callback function is to use re.sub(...) with a callable instead of the replacement string. It will get called for each match and the return value will replace the matched substring. This will not help you parse nested tags though.
Dave
While I thank you for the words of warning, I have to disagree! I was about to give up, as per your suggestion, when I tried something that had somehow previously slipped by me: re.findall(). This method does exactly what I needed.
Thanks for your help everyone.
June 16th, 2009, 01:03 AM
-
Originally Posted by prometheuzz
Note that the OP doesn't want to match nested tags, but s/he is grouping certain text in his/her regex pattern, and that group consists of two other groups (nested groups).
But your suggestion is still a valid one (matching nested tags or not): use a proper parser on (x)html and don't go hacking your way through it with regex.
fyi, the use of angle brackets and html/xml/whatever-ish syntax was just an example...I was more interested in being able to do this in general, than just doing it with an html document.
June 16th, 2009, 01:16 AM
-
Originally Posted by jacobe
fyi, the use of angle brackets and html/xml/whatever-ish syntax was just an example...I was more interested in being able to do this in general, than just doing it with an html document.
Well, there you go: my previous remark stands: explain your actual problem and don't try to over simplify or give input that you're not really working with.
June 16th, 2009, 04:50 AM
-
Originally Posted by jacobe
While I thank you for the words of warning, I have to disagree!
Perhaps you disagree because you found a regex solution to your problem. But note that it is generally a better idea to parse (x)html using a true html parser instead of using regex for this.
See: http://kore-nordmann.de/blog/do_NOT_...ng_regexp.html
Originally Posted by jacobe
I was about to give up, as per your suggestion, when I tried something that had somehow previously slipped by me: re.findall(). This method does exactly what I needed.
As I previously suggested...
Originally Posted by jacobe
Thanks for your help everyone.
You're welcome.
Last edited by prometheuzz; June 16th, 2009 at 04:55 AM.