Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old December 28th, 2012, 03:17 AM
mbobh mbobh is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2012
Posts: 3 mbobh User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 m 12 sec
Reputation Power: 0
Matching nested expressions

Dear list,

I'm a real novice when it comes to regular expressions, so be gentle with me.

I would like to match nested expressions. Here's an example of what I want to match:

Suppose I <have> a <tag> like this. But then I might have <tags with <<nested>> tags> in them.

I would like to match:

1. <have>
2. <tag>
3. <tags with <<nested>> tags>
4. <<nested>>

Most are ok, but I'm having trouble getting something to work for 3. I've tried things like:

<[^<]+[^>]>

but this doesn't work. I'm actually wondering if it's even possible to have an expression which can handle all these cases.

Does anyone have an expression that would handle this, or can you give any advice how to proceed?

Many thanks,

Martin

Reply With Quote
  #2  
Old December 28th, 2012, 03:43 AM
requinix's Avatar
requinix requinix is offline
Still alive
Click here for more information.
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,711 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 4 Days 6 h 48 m 5 sec
Reputation Power: 8969
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
It could be but it depends on what flavor of regular expressions you're using. In other words, what programming language are you working with?

Reply With Quote
  #3  
Old December 28th, 2012, 04:53 AM
Jacques1's Avatar
Jacques1 Jacques1 is offline
pollyanna
Click here for more information.
 
Join Date: Jul 2012
Location: Germany
Posts: 1,875 Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 2 Days 3 h 35 m 16 sec
Reputation Power: 813
Hi,

if the nesting can be arbitrarily deep, this is impossible with regular expressions (I mean real regexes, not the pumped up pseudo-regexes by Perl or something).

I know many people think regexes are a kind of all-powerful tool for text manipulation, but they are not. They are in fact the most primitive grammar and cannot be used for complex expressions like nested tags, let alone fully featured languages like XML.

If you want to process nested tags, you'll either need additional code or a more powerful grammar. For example, you could search for both "<" and ">" and use a stack to keep track of the nesting: Whenever you find an opening "<", you add its index to the stack. Whenever you find a closing ">", you take the last index from the stack. This gives you each tag.

Reply With Quote
  #4  
Old December 28th, 2012, 05:33 AM
mbobh mbobh is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2012
Posts: 3 mbobh User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 m 12 sec
Reputation Power: 0
Quote:
Originally Posted by Jacques1
Hi,

if the nesting can be arbitrarily deep, this is impossible with regular expressions (I mean real regexes, not the pumped up pseudo-regexes by Perl or something).

I know many people think regexes are a kind of all-powerful tool for text manipulation, but they are not. They are in fact the most primitive grammar and cannot be used for complex expressions like nested tags, let alone fully featured languages like XML.

If you want to process nested tags, you'll either need additional code or a more powerful grammar. For example, you could search for both "<" and ">" and use a stack to keep track of the nesting: Whenever you find an opening "<", you add its index to the stack. Whenever you find a closing ">", you take the last index from the stack. This gives you each tag.


I'm actually attempting to do this in Objective-C which has ICU based regular expressions. I figured this might be beyond regular expressions. I've been handling this by parsing the string by hand and using simple logic, but I was wondering if regular expressions would provide a mid-way between my brute force approach and a full grammar based parser/tokenizer. It seems that's not so, and I'll have to investigate other options.

Thanks all for the responses!

Martin

Reply With Quote
  #5  
Old December 28th, 2012, 07:38 AM
requinix's Avatar
requinix requinix is offline
Still alive
Click here for more information.
 
Join Date: Mar 2007
Location: Washington, USA
Posts: 12,711 requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)requinix User rank is General 120th Grade (Above 100000 Reputation Level)  Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1Folding Points: 417516 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 5 Months 1 Week 4 Days 6 h 48 m 5 sec
Reputation Power: 8969
Send a message via AIM to requinix Send a message via MSN to requinix Send a message via Yahoo to requinix Send a message via Google Talk to requinix
I don't know what kind of regular expressions Jacques is talking about but PCRE/Perl and .NET both support expressions that work for the nested tags problem. So sweeping generalizations like "impossible with regular expressions" and "cannot be used for complex expressions" are not quite true.

ICU's expression syntax looks a lot like Perl's but the docs I found don't explicitly mention supporting the recursion that PCRE expressions do, nor implementing the alternative method that .NET uses (forget what it's called but it's basically a depth counter).
I say it's worth a quick shot to confirm it won't work. Try matching
Code:
<([^<>]+|(?R))+>

against your text: I expect it to find only <have>, <tag>, and <nested>.

Assuming that's what happens, you have at least three options:
1. If you want to replace in text that will not introduce <s or >s then you can do the find/replace iteratively, without recursion, until no more replacements are made.
2. If it might introduce them then you can write something a little complicated that runs the regex intelligently, so as to not match against text that was replaced in.
3. Use a recursive-descent parser. The "grammar" is quite simple ("<" followed by non-<-> characters which are captured followed by ">") so the parser would be simple too.

Reply With Quote
  #6  
Old December 28th, 2012, 10:36 AM
Jacques1's Avatar
Jacques1 Jacques1 is offline
pollyanna
Click here for more information.
 
Join Date: Jul 2012
Location: Germany
Posts: 1,875 Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level)Jacques1 User rank is Lieutenant General (80000 - 90000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 2 Days 3 h 35 m 16 sec
Reputation Power: 813
Quote:
Originally Posted by requinix
I don't know what kind of regular expressions Jacques is talking about but PCRE/Perl and .NET both support expressions that work for the nested tags problem.


If his language supports more powerful grammars, matching nested tags is possible. That's exactly what I said in the second part.

But those grammars are not regular -- no matter how they call it and what syntax they use.



Quote:
Originally Posted by requinix
So sweeping generalizations like "impossible with regular expressions" and "cannot be used for complex expressions" are not quite true.


Well, it's a simple fact of informatics. This isn't changed by taking a context-sensitive grammar and (mis-)labeling it as a "regular expression".

Reply With Quote
  #7  
Old December 28th, 2012, 11:25 AM
mbobh mbobh is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2012
Posts: 3 mbobh User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 20 m 12 sec
Reputation Power: 0
Quote:
Originally Posted by requinix
I don't know what kind of regular expressions Jacques is talking about but PCRE/Perl and .NET both support expressions that work for the nested tags problem. So sweeping generalizations like "impossible with regular expressions" and "cannot be used for complex expressions" are not quite true.

ICU's expression syntax looks a lot like Perl's but the docs I found don't explicitly mention supporting the recursion that PCRE expressions do, nor implementing the alternative method that .NET uses (forget what it's called but it's basically a depth counter).
I say it's worth a quick shot to confirm it won't work. Try matching
Code:
<([^<>]+|(?R))+>

against your text: I expect it to find only <have>, <tag>, and <nested>.

Assuming that's what happens, you have at least three options:
1. If you want to replace in text that will not introduce <s or >s then you can do the find/replace iteratively, without recursion, until no more replacements are made.
2. If it might introduce them then you can write something a little complicated that runs the regex intelligently, so as to not match against text that was replaced in.
3. Use a recursive-descent parser. The "grammar" is quite simple ("<" followed by non-<-> characters which are captured followed by ">") so the parser would be simple too.


I tried the expression you suggested but this didn't find the tokens you expected. I wonder if Apple's implementation of ICU regular expressions behaves slightly differently. I'll play around with that some more to try to make progress.

To put more context on my question, I'm actually trying to do syntax highlighting, in particular of LaTeX documents with some additional markup (these <...> <<...>> expressions). I think I'm going to need to do a recursive-descent parser as you suggest. I've read a couple of books about parsers and tokenizing, but I find it quite confusing and am not sure where to start. I think I need to read some more on that to understand the principles. I don't need to fully parse LaTeX source code - only enough to do some syntax highlighting.

Thanks again for the responses.

Martin

Reply With Quote
  #8  
Old December 28th, 2012, 05:25 PM
Laurent_R Laurent_R is offline
Contributing User
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jun 2012
Posts: 511 Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level)Laurent_R User rank is Lieutenant Colonel (40000 - 50000 Reputation Level) 
Time spent in forums: 4 Days 19 h 58 m 22 sec
Reputation Power: 405
Frankly, although it is possible to do it with a powerful regex package such as Perl's regexes or probably PCRE, I agree with Jacques that (1) we are no longer talking of standard regular expressions as defined in classical information theory but of regex extensions (interestingly, in Perl 6, it has been decided to name them not "regular expressions", but "regex", to reflect that fact that they go much beyond the scope of standard regular expressions and can no longer be called that way), and, (2), that, in general, for complicated nested tags, a real parser with an adequate grammar is probably much better than regexes.

Just as, generally speaking, you usually don't want to use regexes to explore HTML or XML files (except possibly for very very very simple transformations not related to the HTML or XML structure).

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Matching nested expressions

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap