Regex Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreRegex Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old August 24th, 2012, 06:32 AM
Luciano Luciano is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 42 Luciano User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 6 h 28 m 36 sec
Reputation Power: 10
Need Help optimizing regex

Hello,
I am writing a sort of glossary script, you know the kind linking words from a list and giving a popup with an explanation.

the important part is the regex that parses the keywords in the text.
Example ( just using bold tags to show the parsed output):

PHP Code:
 $word 'test';
$textstring 'this is a test <a class="test" href="www.test.com">some test examples</a> and another test.';

$newword '<b> ' $word '</b>;

$regex = '
/\b(?!<.*?)'.trim($word).'(?![^<>]*?>)\b/siU'; 

preg_replace($regex, $newword, $textstring); 


works fine, and the output is:

this is a test <a class="test" href="www.test.com">some test examples</a> and another test.

Up to here everything is OK.
--------------

Now what I am trying todo, is exclude the links completely, not only inside the anchor tag, but also inbetween (the innerHTML of the anchor).
To explain, I want the output to be:

this is a test <a class="test" href="www.test.com">some test examples</a> and another test.
(the text : "some test examples" not being parsed.)

I have tried nearly everything.
like
$regex = '/\b(?!<a.*?)|(?!<.*?)'.trim($word).'(?![^<>]*?>)|(?![^<>]*?</a>)\b/siU';
but oviously it doesnt work
Anybody got any ideas to put me in the right direction?

Help would be very appreciated.

Luc

Last edited by Luciano : August 24th, 2012 at 06:35 AM.

Reply With Quote
  #2  
Old August 24th, 2012, 07:04 AM
benno32's Avatar
benno32 benno32 is offline
/*
Dev Shed Novice (500 - 999 posts) Click here for more information
 
Join Date: Mar 2007
Location: Sydney, Australia
Posts: 729 benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level)benno32 User rank is Brigadier General (60000 - 70000 Reputation Level) 
Time spent in forums: 1 Week 6 Days 23 h 17 m 15 sec
Reputation Power: 619
I generally try to avoid using regex to parse html as there are so many different variables to consider. Could you use the DOMDocument class to help isolate text nodes?
__________________
*/

Reply With Quote
  #3  
Old August 24th, 2012, 07:36 AM
Luciano Luciano is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 42 Luciano User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 6 h 28 m 36 sec
Reputation Power: 10
Hi benno
thank you for your reply.

In this case it has to be done in this way as it is integrated in a forum software (smf 2.02) Another solution would require to apply modifications that can not be installed, de installed automatically.
the txtstring is the message (post).
As it is a running working forum, i want to change as little as possible.
I wrote the existing regex and it works fine. only problem is when the text of a link contains an keyword (this only happens when a user post a text bbcode link [ url=http:mylink] keyword [/url] that is why i want to exclude the parsing of the keyword there.)

That is why i do it this way.
actually it is pretty fast and 10 to 15 glossary terms per post work fine.

Luc

PS: The parsing could be done of course by replacing all links in the message with a coded string, and replace them after the parsing again.. but that would be much much more ressources.

Last edited by Luciano : August 24th, 2012 at 07:49 AM.

Reply With Quote
  #4  
Old August 25th, 2012, 04:03 PM
spacebar208's Avatar
spacebar208 spacebar208 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Location: spaceBAR Central
Posts: 190 spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 2 Days 9 h 50 m 57 sec
Reputation Power: 41
Try this regex:
Code:
\b(test)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)

Reply With Quote
  #5  
Old August 26th, 2012, 12:13 AM
Luciano Luciano is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Feb 2004
Posts: 42 Luciano User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 6 h 28 m 36 sec
Reputation Power: 10
WoW!!!
Thank you a bunch...
That works exactly as expected!
(but i must admit I dont understand every part of it.
why the h in <\/?[ha] for example.)
I tried it without the h, and it works also

but on my live board i wouldn't dare remove the h until i know exactly what I am doing.

Thank you again... for helping so fast in this extraordinary way!

Luc

Reply With Quote
  #6  
Old August 26th, 2012, 04:16 PM
spacebar208's Avatar
spacebar208 spacebar208 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2012
Location: spaceBAR Central
Posts: 190 spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level)spacebar208 User rank is Sergeant Major (2000 - 5000 Reputation Level) 
Time spent in forums: 2 Days 9 h 50 m 57 sec
Reputation Power: 41
Can't remember why I had 'h' in the list must have been something in the my data that I had needed it for at that time when I created the regex, The regex descr is below:
Code:
## Ignore if search item found in HTML tag

\b(test)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)

Assert position at a word boundary
Match the regular expression below and capture its match into backreference number 1
   Match the characters 'test' literally
Assert position at a word boundary
Assert that it is impossible to match the regex below starting at this position (negative lookahead): (?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)
   Match the regular expression below: (?:(?!<\/?[ha].*?>).)*
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      Assert that it is impossible to match the regex below starting at this position (negative lookahead): (?!<\/?[ha].*?>)
         Match the character '<' literally
         Match the character '/' literally
            Between zero and one times, as many times as possible, giving back as needed (greedy)
         Match a single character present in the list 'ha'
         Match any single character that is not a line break character
            Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
         Match the character '>' literally
      Match any single character that is not a line break character
   Match the character '<' literally
   Match the character '/' literally
   Match a single character present in the list 'ha'
   Match any single character that is not a line break character
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
   Match the character '>' literally
Assert that it is impossible to match the regex below starting at this position (negative lookahead): (?![^<>]*>)
   Match a single character NOT present in the list '<>'
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   Match the character '>' literally

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreRegex Programming > Need Help optimizing regex

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap