#1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2000
    Posts
    5
    Rep Power
    0
    Hi,
    im trying to extract specific tags from a html-file. The tag to be "killed" is passed over as second argument $ARGV[1] to the function.

    The following line is used to detect all tags containing $ARGV[1]:

    $_=~s/</?($ARGV[1])[^>]*>//gs;

    < # matching starts with <
    /? # one or none /
    ($ARGV[1]) # the tag
    [^>]* # none, one or some characters but no >
    > # and the closing >

    and the whole stuff should be replaced by nothing // i.e. simply wiped out.

    But, apparently there's a bug (a least one) in it because the existing file is simply duplicated, i.e. the regular expression doesn't replace anything.
    I did several tests with evaluating the RegExp first, but this doesn't seem to help in any way...
    I think the problem is the syntax for incorporating variables into regular expressions. I thought I had to do $ARGV[1] rather than $ARGV[1] to prevent the $ from being read as "End of String"...but...
    Does anybody know any solution to this? Thanks in advance.

  2. #2
  3. No Profile Picture
    freebsd
    Guest
    Devshed Newbie (0 - 499 posts)
    >>the regular expression doesn't replace anything.

    If you want to overwrite that file, you then need to put it into array, then open that file, wipe everything out, then print the array back to that file.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2000
    Posts
    81
    Rep Power
    14
    Your regexp is wrong. You escape the $ on the var when you want the variable to be interpolated. Also, when writing stuff with /'s in it, it's best to choose a different delimiter for the regexp -- it makes things a little less cluttered. Try this regexp:
    <BLOCKQUOTE><font size="1" face="Verdana,Arial,Helvetica">code:</font><HR><pre>
    $_ =~ s!</?$ARGV[1][^>]*>!!gm;
    [/code]

    You'll also be wanting to ensure that you're slurping the entire file into $_ as a tag could potentially break over multiple lines, and the regexp wouldn't be matching on that. To slurp the file, do something like:

    <BLOCKQUOTE><font size="1" face="Verdana,Arial,Helvetica">code:</font><HR><pre>
    {
    local $/;
    undef $/;
    $_ = <FILEHANDLE>;
    }
    [/code]

    The { and } are important(ish) to make the $/ local only to that code block. That way your old $/ (probably the default of n, unless you've been playing) will be restored.

    Oh, also note that I changed the s (single line) switch on your regexp to an m (multiline) as well, as that'll make sure breaks over lines behave nicely. Not that you really need to, but it means if you start sticking in $'s and ^'s at some later date it'll probably behave more how you expect.

IMN logo majestic logo threadwatch logo seochat tools logo