#16
  1. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Well then, you probably need to replace this line of code

    Perl Code:
    $bodyIs =~ s/$_/$hashWasIs{$_}/g foreach keys %hashWasIs;


    by something like this:

    Perl Code:
    foreach (keys %hashWasIs) {
         my $nr_matches = () = $bodyIs =~ /$_/g;
         next unless $nr_matches;
         $subs_count += $nr_matches; 
         $bodyIs =~ s/$_/$hashWasIs{$_}/g;
    }


    The first line uses the somewhat cryptic so-called "goatse" operator to count the number of matches of /$_/ in the line. The second one goes to the next key of the hash if no match we found. The third line increments the $subs_count global counter by the number of matches and the fourth line actually does the substitution.

    Actually, well, thinking about it, disregard the above, as it could be done with much less code lines by using a simpler syntax:

    Perl Code:
    $subs_count += $bodyIs =~ s/$_/$hashWasIs{$_}/g foreach keys %hashWasIs;


    which is based on the fact that the s///g operator returns the number of substitutions made (and false if no replacement was made). So the code change is finally very small.
  2. #17
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Yes! That works perfectly. And I get both metrics with
    Code:
    $intSubstitutionsMade += $bodyIs =~ s/$_/$hashWasIs{$_}/g and $intLinesChanged++ foreach keys %hashWasIs

    These do what I want, but I don't understand why. Why does regex (/^(.*?)$/) grab "body 1.txt" from "C:/temp/topfolder/body 1.txt"? How does it know to ignore the pathname?

    my ($Filename) = (/^(.*?)$/);
    my ($FilenameWithoutExtension) = (/^(.*)\./);
    my ($FilenameExtension) = (/([^\.]*)$/);
    print $FilenamePath."\n"; # C:/temp/topfolder/body 1.txt
    print $Filename."\n"; # body 1.txt
    print $FilenameWithoutExtension."\n"; # body 1
    print $FilenameExtension."\n"; #txt
  4. #18
  5. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,930
    Rep Power
    1225
    Why does regex (/^(.*?)$/) grab "body 1.txt" from "C:/temp/topfolder/body 1.txt"?
    If you execute that code, you'll see that it doesn't grab just the filename, it grabs the entire string.

    You could adjust the regex to something like this:
    Code:
    /.*\/(.*?)$/
    However, it would be much better to use the File::Basename module.
    Code:
    #!/usr/bin/perl
    
    use 5.10.1;
    use strict;
    use warnings;
    use File::Basename;
    use Data::Dumper;
    
    my $file_path = "C:/temp/topfolder/body 1.txt";
    my ($filename, $path, $ext) = fileparse($file_path, qr/\.[^.]*/);
    
    say Dumper ($filename, $path, $ext);
    say $file_path; # C:/temp/topfolder/body 1.txt
    say $path; # C:/temp/topfolder/
    say $filename; # body 1
    say $ext; # .txt
    Outputs:
    $VAR1 = 'body 1';
    $VAR2 = 'C:/temp/topfolder/';
    $VAR3 = '.txt';

    C:/temp/topfolder/body 1.txt
    C:/temp/topfolder/
    body 1
    .txt
  6. #19
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Originally Posted by gary84
    Why does regex (/^(.*?)$/) grab "body 1.txt" from "C:/temp/topfolder/body 1.txt"? How does it know to ignore the pathname?
    It does not.

    The following regex: "/^(.*?)$/" matches the whole file name with the path (although the question mark is useless here). It actually matches almost any one-line string.

    If you want to match the file name without the path, you could try this:

    Perl Code:
     /([^\/]+)$/


    which will match all characters other than / until the end of the string.

    But it is probably better to use a specific Perl module to do this, such as:

    Perl Code:
    use File::Basename;
  8. #20
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    The reason I thought it did was because it seemed to be doing that when this runs. I must be wrong because I don't understand what my ($strFilename) = (/^(.*?)$/) is really doing. And why ($strFilename) is in parens.

    PHP Code:
    #!/usr/bin/perl -w
    use strict;
    use 
    warnings;
    use 
    File::Find;
    my $strTopFolder "C:/temp/topfolder";
    find(\&DoSubstitutions2,$strTopFolder);
    sub DoSubstitutions2 {
        
    my $strFilenamePath $File::Find::name;
        if(-
    f $strFilenamePath) {
            
    my ($strFilename) = (/^(.*?)$/);
            
    my ($strFilenameWithoutExtension) = (/^(.*)\./);
            
    my ($strFilenameExtension) = (/([^\.]*)$/);
            print 
    $strFilenamePath."\n";
            print 
    $strFilename."\n";
            print 
    $strFilenameWithoutExtension."\n";
            print 
    $strFilenameExtension."\n";
        }


    Code:
    C:/temp/topfolder/body 1.txt
    body 1.txt
    body 1
    txt
    Yes, use File::Basename is more straightforward, I'm switching to using it instead.
  10. #21
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Putting ($strFilename) in parens forces a list context and, in a list context, the regular expression returns the list of matches.

    With:

    Perl Code:
    ($strFilename) =~ (/^(.*?)$/);


    the list on the left side captures the list of matches on the right side. In this specific case, the list on the left side has only one element so it captures only the first of the matches on the right side, but it still works because there only one match on the right side.

    In a scalar context (i.e. no parens):

    Perl Code:
    $strFilename =~ (/^(.*?)$/);


    the scalar variable on the left side will capture the number of matches (i.e. 1 in this case).

    But it really seems that you don't really know what is happening in your program and that it sort of works per chance.

    First, the normal operator for regexp matching is '=~', not '='.

    Second, when you do this:

    Code:
    my ($strFilename) = (/^(.*?)$/);
    you are doing a match against the $_ special variable, which happens to contain at this point the file name (body 1.txt), because this happens to be the way the Find::File module works, so that your code line could just be:

    Perl Code:
    my $strFilename = $_;


    with the same result. And your regex basically does not do anything useful, it just captures everything in $_. If $_ were to contain "C:/temp/topfolder/body 1.txt", then $strFilename would end up containing "C:/temp/topfolder/body 1.txt".

    You should probably make a series of tests without using the Find::File module and with a hadcoding of the file-and-path to better understand what the regexp are doing.

    For example, try something like this:

    Perl Code:
    #!/usr/bin/perl -w
    use strict;
    use warnings;
     
    my $file_and_path = "C:/temp/topfolder/body 1.txt";
    if(-e $file_and_path) {
            my ($strFilename) = (/^(.*?)$/);
            # this will not work, as $_ is not set, but you can try various things here, such as:
            my ($strFilename) = $file_and_path =~ /^(.*?)$/;
            my ($strFilename2) = $file_and_path =~ /^(.*)$/;
            my ($strFilename3) = $file_and_path =~ /([^//]+)$/;
            my $strFilename4_without_parens = $file_and_path =~ /([^/\]+)$/;
            # etc, make all kinds of tries until you really understand what's going on and what the regexps do
            print ...
    }


    I am not saying that you should use the above for your actual program, I am just suggesting this as an exercise for you to get a better understanding.

    Comments on this post

    • Axweildr agrees
  12. #22
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Thank you for explaining. I will do those tests. The $_ special variable is a very interesting feature of perl.
  14. #23
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    The next feature I want to incorporate is to write data (substitution change logs and substitution metrics) on-the-fly to an Access Database table.
    Demonstrations 1 and 2 work.
    Demonstration 3 throws an error,

    DBD::ODBC::st execute failed: [Microsoft][ODBC Microsoft Access Driver] Too few parameters. Expected 1. (SQL-07002) at db.pl line 26.
    SQL problem at db.pl line 26.

    Can you suggest a fix.

    Update: This works:
    $strSQL="INSERT INTO Table2 (Field1) SELECT 'Gary123'";
    I would rather use quotes than apostrophes, but I guess as long as it works I'll adjust.


    PHP Code:
    #!/usr/bin/perl -w
    use strict;
    use 
    warnings;
    use 
    DBI;

    # Input data from Microsoft Access Database Table
    my $dbFileSource="c:/temp/db1.mdb";
        
    # this mdb file has a table named table1,
        # a field named Field1, whose type is dbText
    my $dbh DBI->connect('dbi:ODBC:driver=microsoft access driver (*.mdb);dbq='.$dbFileSource);

    my $strSQL="";
    my $sth "";

    # demonstration 1: read data from Table
    $strSQL="SELECT * FROM Table1";
    $sth $dbh->prepare($strSQL);
    $sth->execute || die "SQL problem";
    my @row;
    while (@
    row=$sth->fetchrow_array()){ 
       print 
    "@row\n";
        }

    # demonstration 2: write a number to Table's Field
    $strSQL="INSERT INTO Table1 (Field1) SELECT 9";
    $sth $dbh->prepare($strSQL);
    $sth->execute || die "SQL problem";

    # demonstration 3: write a string to Table's Field
    $strSQL="INSERT INTO Table1 (Field1) SELECT \"Gary1\"";
    $sth $dbh->prepare($strSQL);
    $sth->execute || die "SQL problem"
  16. #24
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    I wrote a script to extract a term from between two keywords. This works. Could you recommend a more elegant way. I don't want to use any built-in XML libraries or tools, I want to have total control over every aspect of identifying the delimiters and extracting the string from between them.

    Input File:
    Code:
    <SI>
    <SSID>ACA_004872_000144</SSID>
    <SSID>ACA_004872_000145</SSID>
    </SI>
    Output:
    Code:
    ACA_004872_000144
    ACA_004872_000145



    PHP Code:
    #!/usr/bin/perl -w
    use strict;
    use 
    warnings;
    use 
    DBI;

    my $strSourceFile;
    $strSourceFile="c:/temp/myfile.xml";

    local $/=undef;
    open FILE$strSourceFile or die "file read problem $strSourceFile $!\n";
    my $strWhole= <FILE>;
    close FILE;

    my $strTarget "";
    my $intPointer 1;
    my $strPreceding"";
    my $boolDone 0;

    until ($boolDone) {
        
    $strTarget="<SSID>";
        
    $boolDone = &MovePointerPastTarget($strWhole$strTarget$intPointer);

        
    $strTarget="</SSID>";
        
    $boolDone = &MovePointerPastTarget($strWhole$strTarget$intPointer1$strPreceding);

        print 
    $strPreceding."\n";
    }

    #---------------------------------------------------
    sub MovePointerPastTarget {
        
    my($strWhole1$strTarget1$intPointer1$boolReturnResult1) = @_;
        
    my $strFoundTerm "";
        
    my $strResult "";
        
    my $intPointerMax length($strWhole1);

        
    $intPointer1 ++;
        
    until (($strFoundTerm eq $strTarget1) or ($intPointer1 $intPointerMax)) {
            
    $strFoundTerm .= substr $strWhole1$intPointer11;
            
    $intPointer1 ++;
            if (
    length($strFoundTerm) > length($strTarget1)) {
                if (
    $boolReturnResult1) {
                    
    $strResult.= substr $strFoundTerm, -(length($strFoundTerm)),1;
                }
                
    $strFoundTerm =  substr $strFoundTerm1;
            }
        }
        
    $intPointer1 --;
        
    $_[2] = $intPointer1;
        
    $_[4] = $strResult;
        if (
    $strFoundTerm eq $strTarget1) {
            return 
    0;
        }
        else {
            return 
    1;
        }
    }
    #--------------------------------------------------- 
  18. #25
  19. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,930
    Rep Power
    1225
    Code:
    #!/usr/bin/perl
    
    use 5.10.1;
    use strict;
    use warnings;
    
    my $source_file = 'c:/temp/myfile.xml';
    open my $fh, '<', $source_file or die "failed to open '$source_file' $!";
    
    my $xml_data = do {local $/=undef; <$fh>;};
    say $1 while $xml_data =~ /<SSID>(.+?)<\/SSID>/g;
  20. #26
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Thanks FishMonger. Yeah, I figured there was a 2-line way to do it in perl.
    (1) I have 30,000 items to input and write to an access MDB file. This script ran in 10 seconds on my 2.4GHz Core2Quad. Is it possible to speed it up, perhaps by doing my SQL Action calls more efficiently.
    (2) Your approach works MUCH faster than mine, but it doesn't retain intPointer's value as it scans through the. Maybe I don't really need to know it. Hm. I have to think about that.


    PHP Code:
    #!/usr/bin/perl -w
    use strict;
    use 
    warnings;
    use 
    DBI;
    use 
    5.10.1;
    use 
    Time::HiRes qw(gettimeofday);


    #---------------------------------------------------
    my $source_file='c:\temp\signal.xml';
    open my $fh'<'$source_file or die "failed to open '$source_file' $!";

    my $xml_data = do {local $/=undef; <$fh>;};

    my $TimeStart Time::HiRes::time();

    while (
    $xml_data =~ /<SSID>(.+?)<\/SSID>/g) {
        
    #print $1."\n";
        
    $strSQL="INSERT INTO Table2 (Field1) SELECT '$1'";
        
    $sth $dbh->prepare($strSQL);
        
    $sth->execute || die "SQL problem";
    };

    my $TimeStop Time::HiRes::time();
    my $duration $TimeStop-$TimeStart;
    my $TimeDuration substr("00".int($duration/60/60),-2).":".substr("00".int($duration/60),-2).":".substr("00".int($duration%60),-2);
    print 
    " Duration: $TimeDuration\n"
  22. #27
  23. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,930
    Rep Power
    1225
    Once thing you can/should do is put the prepare statement prior to the loop.

    Code:
    my $strSQL = 'INSERT INTO Table2 (Field1) SELECT ?';
    my $sth = $dbh->prepare($strSQL);
    
    while ($xml_data =~ /<SSID>(.+?)<\/SSID>/g) {
        $sth->execute($1) || die "SQL problem";
    };
  24. #28
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    That made it go twice as fast. 5 seconds, formerly 10. Outstanding.
    OK, now, I want to put my two keywords into an array or a hash to streamline the script. I would like the keywords (SSID, NOMENCLATURE) to appear as few times as possible in the script, ideally just once. Each one appears just once inside <SIGNAL>, so I need the WHILE for <SIGNAL> but I don't need the WHILE for the other keywords. There are 10 more keywords inside <SIGNAL> I'll add once the array feature exists. How do I rewrite this.

    FishMonger, thank you very much for your help, I appreciate it. This approach is working out very well, was exactly what I had in mind, easily adaptable, I love it.

    PHP Code:
    #!/usr/bin/perl -w
    use strict;
    use 
    warnings;
    use 
    DBI;
    use 
    5.10.1;
    use 
    Time::HiRes qw(gettimeofday);

    #---------------------------------------------------
    my $source_file 'c:\temp\signal.xml';
    my $dbFile="c:/temp/db1.mdb";
    my $dbh DBI->connect('dbi:ODBC:driver=microsoft access driver (*.mdb);dbq='.$dbFile);
    open my $fh'<'$source_file or die "failed to open '$source_file' $!";
    my $strWhole1 = do {local $/=undef; <$fh>;};

    my $TimeStart Time::HiRes::time();

    my $strSQL 'INSERT INTO Table2 (SSID,NOMENCLATURE) SELECT ?,?';
    my $sth $dbh->prepare($strSQL);

    my $Target1 "<SIGNAL>"my $Target2 "</SIGNAL>";
    while (
    $strWhole1 =~ /$Target1(.+?)$Target2/gs) {
        
    my $strWhole2= $1;

        
    my $Keyword 'SSID';
        
    my $str_SSID = &ExtractBetweenKeywords($strWhole2$Keyword);

        
    $Keyword 'NOMENCLATURE';
        
    my $str_NOMENCLATURE = &ExtractBetweenKeywords($strWhole2$Keyword);

        
    $sth->execute($str_SSID,$str_NOMENCLATURE) || die "SQL problem";
    };

    my $TimeStop Time::HiRes::time();
    my $duration $TimeStop-$TimeStart;
    my $TimeDuration substr("00".int($duration/60/60),-2).":".substr("00".int($duration/60),-2).":".substr("00".int($duration%60),-2);
    print 
    " Duration: $TimeDuration\n";
    #---------------------------------------------------
    sub ExtractBetweenKeywords {
        
    my ($strWhole$Keyword) = @_;
        
    my ($strResult) = $strWhole =~ /<$Keyword>(.+?)<\/$Keyword>/g;
        return 
    $strResult;
    }
    #--------------------------------------------------- 
  26. #29
  27. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,930
    Rep Power
    1225
    Before I respond to your latest question, let me try to break you of some poor coding practices before they become ingrained in your scripts.

    First, the least objectionable item.

    The use of the -w switch is unneeded and redundant since you're using the better option of loading the warnings pragma. Remove the -w switch.

    Using Time::HiRes does not make much sense if you're just going to ignore its level of resolution by truncating the fractions and use only the int portion.

    Var names should give an indication of what they hold. Names like "$strWhole1" tell me nothing about the data. A better choice in this case would have been "$xml_data" or something similar.

    DON"T use multi word titlecase var or subroutine names!! They are more difficult to read. Separate the words with an underscore and keep them short and to the point. i.e. no sentences or phrases like is often done in other messy languages.

    DON"T use all UPPERCASE var names unless it is really necessary, which is almost never.

    Parsing xml files like the way you're doing is (to be polite) a really poor approach. You really should be using one of the xml parsers; especially considering your current level of Perl knowledge. If you insist on doing it yourself, then do yourself a favor and look over the source code of the modules to get a better idea on how it should be done.

    Don't use '&' when calling your subs, unless it is needed (such as when passing it as a reference) or you know about about and what its side effects.

    Try to keep your line lengths below 80 chars. In almost all cases, you can breakup a long statement into chunks/lines and concatenate the parts.

    example using one of your statements
    Code:
    my $time_duration = substr("00".int($duration/60/60),-2)
                      . ':'
                      . substr("00".int($duration/60),-2)
                      . ':'
                      . substr("00".int($duration%60),-2);
  28. #30
  29. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,930
    Rep Power
    1225
    Getting back to your last question. You could do something like the following, which I'm not advocating because using an XML parser module found on cpan would be the proper solution.

    This builds a HoA (Hash of Arrays) of the xml data for the desired tags.
    Code:
    # declare the HoA that holds the data
    my %xml_data;
    
    # declare/define an array of desired xml tags
    my @xml_tags = qw(SSID NOMENCLATURE);
    
    # loop over the tags extracting the data and load it into the HoA
    foreach my $xml_tag (@xml_tags) {
        while ($xml_data =~ m~<$xml_tag>(.+?)</$xml_tag>~g) {
            push @{ $xml_data{$xml_tag} }, $1;
        }
    }
    Instead of the HoA, you could use a plain hash, but a more complex data structure would typically be more appropriate.

    You could insert the data into the database inside that while loop instead of building the hash, or build the hash and load the data afterwards.

IMN logo majestic logo threadwatch logo seochat tools logo