#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    39
    Rep Power
    1

    Printing the part of a string between 2 characters


    Hello,
    I have a line as follows <text top="276" left="57" width="33" height="16" font="5">Low</text>. I want to get data only between > and <, ie in this case, the word Low.

    Quite a few methods come to my mind, but they are more like parsing etc. I would rather like to use a perl feature which makes this kinda task easier, rather than parsing like in C.

    (I did follow http://stackoverflow.com/questions/1212799/how-do-i-extract-lines-between-two-line-delimiters-in-perl to print lines between 2 delimiters, but what I want is part of string between 2 characters and in doing so, a method to do similar tasks in future)

    Please advise.
    Thanks.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    The best way to parse HTML is certainly to use a specialized parsing tool such as HTML::Parser .

    For a very simple case like this, a regex such as:

    Perl Code:
    my $value = $1 if $line =~ />([^>]+)</;


    would work, but I would not recommend this approach for anything more complicated.

    EDIT: there is a typo above, it should be:

    Perl Code:
    my $value = $1 if $line =~ />([^<]+)</;


    Thanks to Noobie1000 for pointing it out this error.
    Last edited by Laurent_R; June 9th, 2013 at 10:15 AM.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    39
    Rep Power
    1
    Originally Posted by Laurent_R
    The best way to parse HTML is certainly to use a specialized parsing tool such as HTML::Parser .

    For a very simple case like this, a regex such as:

    Perl Code:
    my $value = $1 if $line =~ />([^>]+)</;


    would work, but I would not recommend this approach for anything more complicated.
    Laurent_R, worked just perfectly, However I have been trying to understand what it means. These are some of the things I found out.

    1. my $value = $1 : \1 is a regex pattern that means "match what was captured by the first set of capturing parens."
    2.=~ : used with patterns, as in to say, if pattern match
    3./>([^>]+)</ : I understand the /> and >/ , but don't understand the part in between.

    I would like to get better at pattern matching, should I be googling up on RE ? If yes, would you like a suggest a particular resource or general googling would do.

    Thanks.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Location
    /dev/null
    Posts
    162
    Rep Power
    18
    Firstly, the regex has to be as below with "<" inside the character class []. I'm sure that was a typo-error.

    Code:
    />([^<]+)</
    You're correct about pattern inside parenthesis. The string matched by pattern inside first set of () will be referred by $1

    [^<] => Square brackets [] refer to a character class. Characters inside square brackets will be considered for matching one single character. For e.g., to match "perl" or "Perl", you could use this RE: "[pP]erl".

    And the caret ^ at the beginning of a character class acts like a NOT of a pattern. So, [^<] will match any single character that is not a "<". The + symbol is a quantifier which tells the regex engine to match one or more occurrences of the previous character.

    In summary, [^<]+ means match one or more occurrences of any character that is not a "<".

    IMHO, this is kind of an ancient way of writing non-greedy regular expression patterns. But this kind of non-greedy matching works on almost all regex engines that doesn't have a specific non-greedy quantifier (rather quantifiers quantifier) like perl has.

    A question mark following a quantifier "*" or "+" makes it a non-greedy quantifier. So, the perl-ish way of writing this regex would be:
    Code:
    />(.+?)</
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Yes, [^<]+ is a negated character class and means: match as many characters other than < as possible. And the enclosing parens say to capture those characters and store them in $1.

    And true, I could also have used a non greedy quantifier such as '+?', which one you prefer is really a matter of personal taste. In such a case, I tend to prefer a negated character class because I feel it makes the intent slightly more explicit.

    Thanks again for pointing to my typo, I am sort of dyslexic with angle brackets (< and >) and get them wrong about half of the times.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    39
    Rep Power
    1
    Originally Posted by noobie1000
    Code:
    />([^<]+)</
    In summary, [^<]+ means match one or more occurrences of any character that is not a "<".
    Thanks a lot noobie1000 for the explanation, but one final question, isn't />.......</ alone enough to say take everything between these 2 characters. Do we again have to specify [^<]+ which means everything except <. Sorry to be dragging this thread long, but shall close it with the reply to this.
    Also have a look at my comment below to Laurent_R

    Thanks Laurent_R for the explanation.
    Originally Posted by Laurent_R
    Thanks again for pointing to my typo, I am sort of dyslexic with angle brackets (< and >) and get them wrong about half of the times.
    Ok now this is really weird, I'm getting the correct output with both regexes(typo and no typo). Can you please check out with this example, maybe ? <text top="275" left="763" width="30" height="17" font="0">249</text>

    Please try out this script where I have included your code.
    Code:
    #!/usr/local/bin/perl
    
    open (FILE,'data.txt') or die("could not open file data.txt");
    $line = <FILE>;
    chomp($line);
    
    print "line read == [$line]\n";
    my $value = $1 if $line =~ />([^>]+)</;
    print "value == [$value]";
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    The regex will match the first '>' right angle bracket, then everything up to but excluding another '>' right angle bracket.

    Then it looks for a '<' left angle bracket but can't find it so it backtracks one character and keeps backtracking until it finds the '<' or hits the first '>' angle bracket that it matched, in which case the regex would fail.

    Since it was able to match the '<' bracket and those brackets are not within the capturing parens, you end up with the desired text in $1.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Originally Posted by IAMTubby
    one final question, isn't />.......</ alone enough to say take everything between these 2 characters. Do we again have to specify [^<]+ which means everything except <. Sorry to be dragging this thread long, but shall close it with the reply to this.
    If your input string is:

    Code:
    <text top="276" left="57" width="33" height="16" font="5">Low</text><text top="275" left="56" width="32" height="17" font="6">High</text>
    the />.+</ would match much more than what you want, it would in fact match this:

    Code:
    >Low</text><text top="275" left="56" width="32" height="17" font="6">High<
    Using the negated character class [^<]+ prevents the regex to match too far. The alternative, as pointed by Noobie, is to use a non greedy quantifier, which will match as little as possible, where as the usual (greedy) quantifier matches as much as it can.
  16. #9
  17. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Originally Posted by IAMTubby

    Ok now this is really weird, I'm getting the correct output with both regexes(typo and no typo). Can you please check out with this example, maybe ? <text top="275" left="763" width="30" height="17" font="0">249</text>

    Please try out this script where I have included your code.
    Code:
    #!/usr/local/bin/perl
    
    open (FILE,'data.txt') or die("could not open file data.txt");
    $line = <FILE>;
    chomp($line);
    
    print "line read == [$line]\n";
    my $value = $1 if $line =~ />([^>]+)</;
    print "value == [$value]";
    It works in this case, but might not work on a longer string where there would be other tags after the last one in your tags.

    See my answer in my previous post.

IMN logo majestic logo threadwatch logo seochat tools logo