#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    3
    Rep Power
    0

    Find a text, but not inside another one


    Hi all,

    I'm searching for a regex which will find a text, but this text should not be inside another text (comment).

    e.g. searching string "<data>" inside the following comment sign (starting "<!---" till "--->") should not be found
    Code:
    <!---
    this is a comment about my <data>.
    but the <data> should not be found 
    --->
    but on the other hand in this case are two possible hits
    Code:
    <data> hits first here.
    <!---
    this is a comment about my <data>.
    but the <data> should not be found 
    --->
    And here comes hit no. 2 for <data>
    Thanks in advance for your helps.
  2. #2
  3. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,066
    Rep Power
    9398
    If that's XML then you should be using XML methods, not regular expressions. Exactly how depends on what language(s) you're using.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    3
    Rep Power
    0
    Hi requinix,

    Thanks for your reply.

    But its no XML, its a Coldfusion source file - so you should expect it as a normal string. I will also not only search for certain tags, but searching for normal text too.

    With "(<!---.*?--->)" I can find any comment. My question is now how to invert this regex, so it would find any text outside of any comment and at the next step this regex should also find the text I'm searching for ("<data>" in the example above).
  6. #4
  7. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,066
    Rep Power
    9398
    Then it's HTML (or looks like it)? DOM is the answer.

    I'm trying to steer you away from regular expressions because it's a nightmare to write something that respects HTML or XML grammar. Simply ignoring comments is not simple.
  8. #5
  9. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Hi,

    You might look for both comments and the actual search pattern and then skip the comments afterwards.

    i. e.
    Code:
    /<!---.*?--->|YOURPATTERN/
    However, this isn't really a good solution. I fully agree with requinix that you should use a parser rather than fumble with regular expressions.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Posts
    3
    Rep Power
    0
    Well, thanks for your efforts. So it seems that there is no way to skip text within a search.

    I solved it now with two steps as suggested by Jaques1. First I delete all unwanted "<!---.*?--->" from text. In second step I'm searching for the "<data>".

    DOM and other possibilities are too complex (and therefore time consumpting) for only looking if "<data>" is inside a text or not.
  12. #7
  13. Did you steal it?
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2007
    Location
    Washington, USA
    Posts
    14,066
    Rep Power
    9398
    Originally Posted by Torsten79
    DOM and other possibilities are too complex
    Huh. Then you must be using some language that doesn't feature any kind of DOM support whatsoever. Congratulations on being an exception to the rule.
  14. #8
  15. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2012
    Location
    spaceBAR Central
    Posts
    229
    Rep Power
    42
    If I understand you correctly, You can do it with sed:
    Code:
    $ cat t
    <data> hits first here.
    <!---
    this is a comment about my <data>.
    but the <data> should not be found
    --->
    And here comes hit no. 2 for <data>


    Print lines with search data ignoring comment blocks:
    Code:
    $ sed -n -e '/^<\!---/,/^--->/d' -e '/<data>/p' t
    <data> hits first here.
    And here comes hit no. 2 for <data>


    Print lines with search data ignoring comment blocks and their line numbers:
    Code:
    $ sed -n -e '/^<\!---/,/^--->/d' -e '/<data>/=;p' t
    1
    <data> hits first here.
    6
    And here comes hit no. 2 for <data>


    And a couple of examples of printing line number on same line with text found:
    Code:
    $ sed = t | sed 'N;s/\n/\t/' | sed -n -e '/^[0-9]\{1,\}\t<\!---/,/^[0-9]\{1,\}\t--->/d' -e '/<data>/p'
    1       <data> hits first here.
    6       And here comes hit no. 2 for <data>
    
    $ sed = t | sed 'N;s/\n/ - /' | sed -n -e '/^[0-9]\{1,\} - <\!---/,/^[0-9]\{1,\} - --->/d' -e '/<data>/p'
    1 - <data> hits first here.
    6 - And here comes hit no. 2 for <data>

IMN logo majestic logo threadwatch logo seochat tools logo