#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    3
    Rep Power
    0

    Matching nested expressions


    Dear list,

    I'm a real novice when it comes to regular expressions, so be gentle with me.

    I would like to match nested expressions. Here's an example of what I want to match:

    Suppose I <have> a <tag> like this. But then I might have <tags with <<nested>> tags> in them.

    I would like to match:

    1. <have>
    2. <tag>
    3. <tags with <<nested>> tags>
    4. <<nested>>

    Most are ok, but I'm having trouble getting something to work for 3. I've tried things like:

    <[^<]+[^>]>

    but this doesn't work. I'm actually wondering if it's even possible to have an expression which can handle all these cases.

    Does anyone have an expression that would handle this, or can you give any advice how to proceed?

    Many thanks,

    Martin
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,005
    Rep Power
    9398
    It could be but it depends on what flavor of regular expressions you're using. In other words, what programming language are you working with?
  4. #3
  5. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    Hi,

    if the nesting can be arbitrarily deep, this is impossible with regular expressions (I mean real regexes, not the pumped up pseudo-regexes by Perl or something).

    I know many people think regexes are a kind of all-powerful tool for text manipulation, but they are not. They are in fact the most primitive grammar and cannot be used for complex expressions like nested tags, let alone fully featured languages like XML.

    If you want to process nested tags, you'll either need additional code or a more powerful grammar. For example, you could search for both "<" and ">" and use a stack to keep track of the nesting: Whenever you find an opening "<", you add its index to the stack. Whenever you find a closing ">", you take the last index from the stack. This gives you each tag.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    3
    Rep Power
    0
    Originally Posted by Jacques1
    Hi,

    if the nesting can be arbitrarily deep, this is impossible with regular expressions (I mean real regexes, not the pumped up pseudo-regexes by Perl or something).

    I know many people think regexes are a kind of all-powerful tool for text manipulation, but they are not. They are in fact the most primitive grammar and cannot be used for complex expressions like nested tags, let alone fully featured languages like XML.

    If you want to process nested tags, you'll either need additional code or a more powerful grammar. For example, you could search for both "<" and ">" and use a stack to keep track of the nesting: Whenever you find an opening "<", you add its index to the stack. Whenever you find a closing ">", you take the last index from the stack. This gives you each tag.
    I'm actually attempting to do this in Objective-C which has ICU based regular expressions. I figured this might be beyond regular expressions. I've been handling this by parsing the string by hand and using simple logic, but I was wondering if regular expressions would provide a mid-way between my brute force approach and a full grammar based parser/tokenizer. It seems that's not so, and I'll have to investigate other options.

    Thanks all for the responses!

    Martin
  8. #5
  9. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,005
    Rep Power
    9398
    I don't know what kind of regular expressions Jacques is talking about but PCRE/Perl and .NET both support expressions that work for the nested tags problem. So sweeping generalizations like "impossible with regular expressions" and "cannot be used for complex expressions" are not quite true.

    ICU's expression syntax looks a lot like Perl's but the docs I found don't explicitly mention supporting the recursion that PCRE expressions do, nor implementing the alternative method that .NET uses (forget what it's called but it's basically a depth counter).
    I say it's worth a quick shot to confirm it won't work. Try matching
    Code:
    <([^<>]+|(?R))+>
    against your text: I expect it to find only <have>, <tag>, and <nested>.

    Assuming that's what happens, you have at least three options:
    1. If you want to replace in text that will not introduce <s or >s then you can do the find/replace iteratively, without recursion, until no more replacements are made.
    2. If it might introduce them then you can write something a little complicated that runs the regex intelligently, so as to not match against text that was replaced in.
    3. Use a recursive-descent parser. The "grammar" is quite simple ("<" followed by non-<-> characters which are captured followed by ">") so the parser would be simple too.
  10. #6
  11. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,957
    Rep Power
    1046
    Originally Posted by requinix
    I don't know what kind of regular expressions Jacques is talking about but PCRE/Perl and .NET both support expressions that work for the nested tags problem.
    If his language supports more powerful grammars, matching nested tags is possible. That's exactly what I said in the second part.

    But those grammars are not regular -- no matter how they call it and what syntax they use.



    Originally Posted by requinix
    So sweeping generalizations like "impossible with regular expressions" and "cannot be used for complex expressions" are not quite true.
    Well, it's a simple fact of informatics. This isn't changed by taking a context-sensitive grammar and (mis-)labeling it as a "regular expression".
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    3
    Rep Power
    0
    Originally Posted by requinix
    I don't know what kind of regular expressions Jacques is talking about but PCRE/Perl and .NET both support expressions that work for the nested tags problem. So sweeping generalizations like "impossible with regular expressions" and "cannot be used for complex expressions" are not quite true.

    ICU's expression syntax looks a lot like Perl's but the docs I found don't explicitly mention supporting the recursion that PCRE expressions do, nor implementing the alternative method that .NET uses (forget what it's called but it's basically a depth counter).
    I say it's worth a quick shot to confirm it won't work. Try matching
    Code:
    <([^<>]+|(?R))+>
    against your text: I expect it to find only <have>, <tag>, and <nested>.

    Assuming that's what happens, you have at least three options:
    1. If you want to replace in text that will not introduce <s or >s then you can do the find/replace iteratively, without recursion, until no more replacements are made.
    2. If it might introduce them then you can write something a little complicated that runs the regex intelligently, so as to not match against text that was replaced in.
    3. Use a recursive-descent parser. The "grammar" is quite simple ("<" followed by non-<-> characters which are captured followed by ">") so the parser would be simple too.
    I tried the expression you suggested but this didn't find the tokens you expected. I wonder if Apple's implementation of ICU regular expressions behaves slightly differently. I'll play around with that some more to try to make progress.

    To put more context on my question, I'm actually trying to do syntax highlighting, in particular of LaTeX documents with some additional markup (these <...> <<...>> expressions). I think I'm going to need to do a recursive-descent parser as you suggest. I've read a couple of books about parsers and tokenizing, but I find it quite confusing and am not sure where to start. I think I need to read some more on that to understand the principles. I don't need to fully parse LaTeX source code - only enough to do some syntax highlighting.

    Thanks again for the responses.

    Martin
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Frankly, although it is possible to do it with a powerful regex package such as Perl's regexes or probably PCRE, I agree with Jacques that (1) we are no longer talking of standard regular expressions as defined in classical information theory but of regex extensions (interestingly, in Perl 6, it has been decided to name them not "regular expressions", but "regex", to reflect that fact that they go much beyond the scope of standard regular expressions and can no longer be called that way), and, (2), that, in general, for complicated nested tags, a real parser with an adequate grammar is probably much better than regexes.

    Just as, generally speaking, you usually don't want to use regexes to explore HTML or XML files (except possibly for very very very simple transformations not related to the HTML or XML structure).

IMN logo majestic logo threadwatch logo seochat tools logo