Page 1 of 3 123 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0

    Global multi-file string substitution


    I need a perl script that does this: globally perform every string substitution listed in FILE_WAS_IS, on every file inside and recursively downstream of FOLDER1. I'm open to suggestions on a good structure for FILE_WAS_IS. The code should allow me customize which file extensions I want the substitution applied to. Who should I go to for this? If someone gives me some basic ideas I can probably figure out how to write it.
  2. #2
  3. 'fie' on me, allege-dly
    Devshed Supreme Being (6500+ posts)

    Join Date
    Mar 2003
    Location
    in da kitchen ...
    Posts
    12,890
    Rep Power
    6444
    File::Find would be a good start search.cpan.org
    --Ax
    without exception, there is no rule ...
    Handmade Irish Jewellery
    Targeted Advertising Cookie Optout (TACO) extension for Firefox
    The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones


    09 F9 11 02
    9D 74 E3 5B
    D8 41 56 C5
    63 56 88 C0
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
    -- Jamie Zawinski
    Detavil - the devil is in the detail, allegedly, and I use the term advisedly, allegedly ... oh, no, wait I did ...
    BIT COINS ANYONE
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    You basically have two separate problems:
    - Finding all the files in a folder and recursively in the subfolders with a given extension or list of extensions;
    - For each file, read the file and for each file chunk read, apply the list of substitutions, which implies almost certainly two nested loops (although one could be implicit and not really seen).

    None of these appears to be difficult per se, which one is a problem for you?

    As mentioned by Axweildr, the File::Find module is probably a good starting point for the first problem (well it is almost a ready-made solution sitting there on the CPAN shelf).
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Yes, File::Find was very helpful, thanks. I did it. It works (I did a lot of Googling). I don't know how pretty this is, but it's the first perl script I've ever written and it took 70 seconds to perform the same set of substitions that took 30 minutes in VBA (modules in microsoft access). So I'm pretty happy.
    The WinDiff on MyFilesDoneWithVBA vs. MyFileDoneWithPerl shows tons of differences because of different "blanks" and I think that has something to do with carraige returns and linefeeds and whitespace, so those differences shouldn't matter. Hopefully.


    Code:
    #! /usr/bin/perl -w
    use warnings;
    use strict;
    use File::Find;
    use File::Slurp;
    
    #------------------------------------------------------
    # INPUT
    my $WasIsFile = "c:/perl/tdis/was-is.txt";
    my $topFolder = "c:/perl/tdis/filesforperl";
    
    #------------------------------------------------------
    # OUTPUT
    # Same files as above
    
    my $int1=1;
    my $int2=1;
    my $intFilesExamined=0;
    my $intFilesChanged=0;
    my @arrayWasIs;
    
    #------------------------------------------------------
    # Input WAS-IS text file as an array.
    # Required file format (front 5 characters will be truncated during processing):
    #   WAS: {1st_WAS_string}
    #   IS:  {1st_IS_string}
    #   WAS: {2nd_WAS_string}
    #   IS:  {2nd_IS_string}
    #   ... and so on...
    
    open (my $filehandle, "<$WasIsFile") or die "Can't open $WasIsFile: $!";
    while (my $line=<$filehandle>){
    	#chomp $line;
    	$line=substr($line,5);
    	push (@arrayWasIs, $line);}
    close $filehandle;
    
    #------------------------------------------------------
    
    my $start_localtime = localtime;
    find(\&FileSubstitutionLoop,$topFolder);
    
    my $end_localtime = localtime;
    print "Files Examined: $intFilesExamined\n";
    print "Files Changed: $intFilesChanged\n";
    print "Start: $start_localtime\n";
    print "End  : $end_localtime\n";
    
    sub FileSubstitutionLoop {
    	#print $File::Find::name."\n";
    	my $filename = $File::Find::name;
    	if( -f $filename) {
    		#print "filename :$filename \n";
    		$intFilesExamined++;
    		my $body=read_file($filename);
    		my $bodyWas=$body;
    		my $bodyIs=$body;
    		$int1=0;
    		do{
    			if ($int1 % 2)
    				{}
    			else{
    				$bodyIs=~ s/$arrayWasIs[$int1]/$arrayWasIs[$int1+1]/g}
    			$int1++}
    		until $int1==$#arrayWasIs;
    		if (!($bodyWas eq $bodyIs)){
    			$intFilesChanged++;
    			open (FILE, "> $filename") || die "problem writing\n";
    			print FILE "$bodyIs";
    			close(FILE)}}
    	else {
    		# not a file
    		}
    }
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Well done if it is you first Perl script.

    Your code looks pretty clean and you seem to have taken the right habits straight from the beginning.

    A few remarks, though, which you may or may not want to take into account.

    Rather than using an array for the list of substitutions, and testing on odd and even subscripts, use a hash, with the source being the key and the target the value.

    Declare your variables in smaller lexical scope, otherwise you nullify a significant part of the advantage of using strict and using warnings. For example $int1 should be declared within the subroutine rather than as a global variable (but, of course you no longer need it is you use a hash).

    The preferred way for opening files is now the three-argument syntax:

    Perl Code:
    open my $filehandle, "<", $WasIsFile" or die "blabla $!";


    Perl Code:
        else {
         # not a file
         }


    Don't do that: if you don't need an else clause, just omit it and simply replace the 3 lines above by a simple closing "}".

    Perl Code:
             if ($int1 % 2)
              {}
             else{


    Even more so, don't do that, change your test to this:

    Perl Code:
    unless ($int1 % 2)


    or this:

    Perl Code:
    if (not $int1 % 2)


    So, if you apply the above advice and use the %hashWasIs rather than an array, this code:

    Perl Code:
            do{
             if ($int1 % 2)
              {}
             else{
              $bodyIs=~ s/$arrayWasIs[$int1]/$arrayWasIs[$int1+1]/g}
             $int1++}
            until $int1==$#arrayWasIs;


    could boil down to something like this single line:

    Perl Code:
             $bodyIs =~ s/$_/$hashWasIs{$_}/g foreach keys %hashWasIs;
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Thanks for the ideas. I got them all incorporated except the hash. I can't figure out how to do it. My file is line deliminated, I can't find an example of how import it into a hash.

    Code:
    WAS: man
    IS:  trunk
    WAS: hair
    IS:  leash
    WAS: monger
    IS:  taco
    
    This doesn't work:
    my %hashWasIs;
    
    open my $filehandle, "<", $WasIsFile or die "Can't open $WasIsFile: $!";
    while (<$filehandle>)
    	{
    	   my ($key, $val) = split(/\n/, $_);
    	   $hashWasIs{$key} =$val;
    	}
    close $filehandle;
  12. #7
  13. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,264
    Rep Power
    1810
    Can you change your file format?

    Code:
    my ($key, $val) = split(/\n/, $_);
    The newline comes at the end of the line, after any data. It's not what you want to split on. You would split on colon followed by space to have two separate words on each line.

    The first word would be the key, and the second word the value. But you will have to pair adjacent lines.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Hi,

    if I look at your original code, it seems that your substitution file is constructed as follows:
    - line 2n + 1 = the pattern to be replaced
    - line 2n = the replacement string
    (with n starting at 0).

    Is this correct?

    If such is the case, then you need to do something like this to read the substitution file and populate the hash:

    Perl Code:
    my %hashWasIs;
    open my $filehandle, "<", $WasIsFile or die "Can't open $WasIsFile: $!";
    my $key;
    while (<$filehandle>) {
         chomp;
         if ($. % 2) { # input line is odd, this is a key
              $key = $_;
         } else { # input line is even, this is a replacement value
              $hashWasIs{$key}  = $_;
    }


    This is the normal logical way to do it. But, to tell the truth, Perl has some powerful shortcuts and, among other things, is able to transform an array with an even number of elements into a hash (at least insofar there is no duplicate in the odd items). So you could do this:

    Perl Code:
    open my $filehandle, "<", $WasIsFile or die "Can't open $WasIsFile: $!";
    my @temp_array = <$filehandle>; # "slurps" the file into a temporary array
    chomp @temp_array; # not sure it is needed, but it almost never hurts to be cautious
    my %hashWasIs = @temp_array; # tranforms the file content into a hash


    Actually, I did not try it, but I am almost sure it could even work directly in one shot, without the temporary array (or let me put it another way, I know the syntax basically works, but since I don't have your file, I can't test it):

    Perl Code:
    open my $filehandle, "<", $WasIsFile or die "Can't open $WasIsFile: $!";
    my %hashWasIs = <$filehandle>; # "slurps" the file into the hash


    Well, with this last syntax, we can't chomp to remove carriage returns, there might be a problem later when we try to use the hash. After all, the previous syntax with the temporary array is probably better.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    The first hash approach above, the 'normal logical way to do it', works for me. It makes sense to me. I've heard hash is a powerful tool and now I'm actually using one. $bodyIs =~ s/$_/$hashWasIs{$_}/g foreach keys %hashWasIs is one powerful way to execute a substitution script.

    The powerful shortcuts approach did not work for me. If perl is able to transform an array with an even number of elements into a hash, that would be great, and I'll keep playing with it.

    For the large set of files and substitutions I use to benchmark, the script runs in 62 seconds using an array, 60 seconds using a hash.

    This my $body=read_file($filename) works on my windows box running strawberry, but it doesn't work on the unix box my script will be used on. So I changed to the conventional approach my $body = <FILE> . I think this means slurp isn't on that unix machine. Unless I'm mistaken, it's running slightly faster using the conventional approach. I'll ask the lab manager IT guy if he knows.

    Thank you very much Axweildr and Laurent.
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Originally Posted by gary84
    The first hash approach above, the 'normal logical way to do it', works for me. It makes sense to me. I've heard hash is a powerful tool and now I'm actually using one. $bodyIs =~ s/$_/$hashWasIs{$_}/g foreach keys %hashWasIs is one powerful way to execute a substitution script.

    The powerful shortcuts approach did not work for me. If perl is able to transform an array with an even number of elements into a hash, that would be great, and I'll keep playing with it.
    It should work, something else must have been wrong (see below for a possible explanation). This is an example under the debugger:

    Code:
      DB<1> @c = qw / a b c d e f g h i j /;
    
      DB<3> x @c;
    0  'a'
    1  'b'
    2  'c'
    3  'd'
    4  'e'
    5  'f'
    6  'g'
    7  'h'
    8  'i'
    9  'j'
    As you can see, @c is now an array with 10 elements, the 10 first letters of the alphabet.

    If I now assign an hash to the array:

    Code:
      DB<4> %d = @c;
    
      DB<5> print $d{c};
    d
    
      DB<6> print "$_: $d{$_} \n" foreach sort keys %d;
    a: b
    c: d
    e: f
    g: h
    i: j
    As you see, %d is a perfectly valid hash.

    (Please note that I am not using the "my" function here to declare my variable, because this does not work under the debugger. I would of course do it in actual code.)

    Originally Posted by gary84
    For the large set of files and substitutions I use to benchmark, the script runs in 62 seconds using an array, 60 seconds using a hash.
    I did not expect the hash to be faster than the array in this specific case, because your are visiting every value stored in the data structure each time (I actually might have expected the hash itself to be slightly slower than the array, but the slight code simplification introduced by the use of a hash probably speeds things up a very little bit). A hash would be much faster than an array if you were searching just one value among a collection, versus iterating through the array to find it. I recommended a hash not for performance reasons, but because that data structure is, I think, far more appropriate and more natural from a conceptual and design standpoint.

    Originally Posted by gary84
    This my $body=read_file($filename)works on my windows box running strawberry, but it doesn't work on the unix box my script will be used on. So I changed to the conventional approach my $body = <FILE>. I think this means slurp isn't on that unix machine. Unless I'm mistaken, it's running slightly faster using the conventional approach. I'll ask the lab manager IT guy if he knows.
    Surely, slurping is there on your Unix machine, it is a Perl functionality. You might have encountered a context issue. If you use the diamond <> operator in a scalar context, you retrieve one line of input at a time, if you use it in list (or array) context, you slurp the while file into the array (each line being one element). For example:

    Perl Code:
    my $body = <FILE>; # scalar context, reads one line of the file into the $body scalar variable
    my @array_of_bodies = <OTHER_FILE>; # list context, reads the whole other file into the array


    The difference between the two examples is just that $body is a scalar variable and @array_of_bodies is an array.

    Actually, this somewhat subtle difference might be the reason why you were not able to convert an array to an hash, possibly your variable did not contain what you expected. If you're interested in trying it, please post the code that you have tried, we can probably find the reason why it did not work for you.
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Why does same script produces different results. This is strange.
    The file whose contents are undergoing substitution contains,
    Code:
    101 102
    103 104
    105 106
    107 108
    The file was-is.txt says,
    Code:
    WAS: 101
    IS:  201
    WAS: 102
    IS:  202
    WAS: 103
    IS:  203
    WAS: 104
    IS:  204
    WAS: 105
    IS:  205
    WAS: 106
    IS:  206
    WAS: 107
    IS:  207
    WAS: 108
    IS:  208
    On my windows box, the result is,
    Code:
    201 202
    203 204
    205 206
    207 208
    On my unix box, the result is,
    Code:
    101 202
    103 204
    105 206
    107 108
    The script is,
    PHP Code:
    #!/usr/bin/perl -w
    use warnings;
    use 
    strict;

    # activate everywhere
        
    use File::Find;
        use 
    Time::HiRes qw(gettimeofday);
        use 
    File::Path;
        
    #use File::Copy;
        #use File::Basename;
        #use File::Copy;
        #use IO::Dir;
        #use IO::File;
        #use Time::Local;
    #------------------------------------------------------
    # INPUT
    # activate at home
        
    my $strWasIsFile "C:/test1/was-is.txt";
        
    my $strTopFolder "C:/test1/files";
    # activate at work
        #my $strWasIsFile = "/home/wachg/test1/was-is.txt";
        #my $strTopFolder = "/home/wachg/test1/files";

    #------------------------------------------------------
    # OUTPUT
    # Same files in $strTopFolder

    #------------------------------------------------------
    my $intFilesExamined=0;
    my $intFilesChanged=0;
    my $intDiscreteChanges=0;
    my @AllowedExtensions = ("txt""csv""dto""mcr""tcd""xml");

    #------------------------------------------------------
    # Input WAS-IS text file as a hash
    # Required file format (front 5 characters will be truncated during processing):
    #   WAS: {1st_WAS_string}
    #   IS:  {1st_IS_string}
    #   WAS: {2nd_WAS_string}
    #   IS:  {2nd_IS_string}

    my %hashWasIs;
    my $key;
    open my $filehandle"<"$strWasIsFile or die "file read problem $strWasIsFile $!\n";
    while (<
    $filehandle>) {
        
    chomp;
        if ($. % 
    2) {
            
    $key substr($_,5); }
        else {
            
    $hashWasIs{$key} = substr($_,5); }}
    close $filehandle;
    #print join(",", %hashWasIs);

    #------------------------------------------------------

    if (1) {
        
    my $startTime Time::HiRes::time();
        
    find(\&MultiFileMultiStringSubstitution,$strTopFolder);

        
    my $stopTime Time::HiRes::time();
        
    my $duration $stopTime-$startTime;
        
    my $durationTime substr("00".int($duration/60/60),-2).":".substr("00".int($duration/60),-2).":".substr("00".int($duration%60),-2);

        print 
    "Files Examined: $intFilesExamined\n";
        print 
    "Files Changed: $intFilesChanged\n";
        print 
    "Duration: $durationTime\n";
        
    #print "Discrete Changes: $intDiscreteChanges\n";
        
    }

    #------------------------------------------------------
    sub MultiFileMultiStringSubstitution {
        
    my $filename $File::Find::name;
        
    my $int1=0;
        
    #print $filename."\n";
        
    if(-f $filename) {

            
    my ($filenamewithoutextension$fileextension) = (/^(.*?)\.?([^\.]*)$/);

            while (
    $int1 <= $#AllowedExtensions && ($AllowedExtensions[$int1] ne lc($fileextension))) {
                
    ++$int1;}

            if (
    $int1 <= $#AllowedExtensions) {
                #print "filename :$filename \n";
                
    $intFilesExamined++;

                
    local $/=undef;
                
    open FILE$filename or die "file read problem $filename $!\n";
                
    my $body = <FILE>;
                
    close FILE;
                
                
    my $bodyWas=$body;
                
    my $bodyIs=$body;
                
    $bodyIs =~ s/$_/$hashWasIs{$_}/foreach keys %hashWasIs;
                if (!(
    $bodyWas eq $bodyIs)) {
                    
    $intFilesChanged++;
                    
    open FILE">"$filename or die "file write problem $filename $!\n";
                    print 
    FILE "$bodyIs";
                    
    close(FILE)
                    }
                }
            else {} 
    #this is a folder, not a file
            
    }
    }
    #------------------------------------------------------ 
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Hi Gary,

    most probably, the difference comes from the end of line characters under Windows and Unix.

    Under Unix, the end of line (or new line) is a single character, ASCII 10 or "\n".

    Under Windows, the end of line is a combination of 2 characters, ASCII 13 and ASCII 10 (or "\r\n").

    When Perl is running under Windows, it knows that and the chomp function removes the two end of line characters. When Perl is running under Unix or Linux, Perl know is it running on Unix and chomp removes only the ASCII 10. The problem occurs if you are running under Unix with files prepared under Windows; Perl removes only the final ASCII 10, but not the previous ASCII 13. In this case, your %hashWasIs will contain the input WasIs line with the trailing "\r" character. Therefore it will not recognize the number when it is on the middle of the line of the file where the replacements have to be made, but it will if the number is at the end of the line because it can find "\r" at the end of the line.

    One solution is to use the dos2unix utility which will change your Windows/DOS format files to the Unix/Linux file format. If you don't have this utility on your system, you can use the following Perl one-liner under the Unix prompt to make the necessary changes to your input files:

    Code:
    perl -pi -e 's/\r//g' file_name.txt
    (This will remove all the "\r" characters from your file_name.txt file.)

    The other solution is to make the necessary changes in your script at the place where you read the WasIs file:

    Perl Code:
    while (<$filehandle>) {
        chomp;
         s/\r//g; # removes Windows \r characters from the line
        if ($. % 2) {
            $key = substr($_,5); }
        else {
            $hashWasIs{$key} = substr($_,5); }}
    close $filehandle;


    You can do that safely in your script even if it runs under Windows: the additional instruction will correct the problem under Unix but it will have no effect under Windows, where the \r will have been already removed by the chomp instruction.

    Do that for the WasIs file, you don't need and probably don't want to do that in your other files.
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Originally Posted by Laurent_R
    most probably, the difference comes from the end of line characters under Windows and Unix.
    Code:
    s/\r//g; # removes Windows \r characters from the line
    Thankyou, that worked. I thought that was the problem, and I tried to clean was-is.txt by doing OpenNotepad>SelectAll>Copy>OpenNotepad>Paste>Save but apparently that brought \r along.

    I need a way to count the grand total number of individual substitutions made while the script ran. For example if 10 files each had 10 string substitions the number 100 should print. It's ok if calculating metrics slows down the script.

    This probably also explains why the powerful shortcut approach didn't work. I need help getting rid of \r after inputting using the all-at-once approach.
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    You could add a counter $i and try to do this:

    Perl Code:
    $bodyIs =~ s/$_/$hashWasIs{$_}/g and $i++ foreach keys %hashWasIs;


    But that would work only if there is one substitution done at a time. That is, if you have $_ twice in the line and make two substitutions in one shot, you'll still increment the counter by 1.

    If you can have the same pattern more than once in a line, the code will have to change a bit more.
  28. #15
  29. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2013
    Posts
    18
    Rep Power
    0
    Yes the patterns repeat.
    Before,
    Code:
    101 102 101 102
    103 104 103 104
    105 106 105 106
    After,
    Code:
    201 202 201 202
    203 204 203 204
    205 206 205 206
    As you said, your approach says 6, I want it to say 12. I'm going to keep playing with it.
Page 1 of 3 123 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo