Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0

    Compare two files using AWK or SED


    Hello folks, I am trying to find the solution to a time old problem of comparing a new file against a similar old file. Any help would be appreciated:

    New file: (File1)

    FEATURE qu2 vendr 2008.02 permanent 20 7A7CDAF65 SIGN="15D6 \
    FEATURE qu2_llock vendr 2008.02 permanent 20 FB2A5CF76 \
    FEATURE qu2_manu vendr 2008.02 permanent 30 A0AF150F6 \
    FEATURE mpplus2 vendr 2008.02 permanent 40 501044FE SIGN="0227 \
    FEATURE 5TD5_512A vendr 2008.02 permanent uncounted FBF35FBA4 \
    FEATURE vendr_lnx vendr 2008.02 permanent 20 97A11F6175 \
    FEATURE qu2_stpiigx vendr 2008.02 permanent 40 27CA99935 \
    FEATURE 5TF7_0012 vendr 2008.02 permanent 40 5F0EFC594 \
    FEATURE 5TF7_0014 vendr 2008.02 permanent 40 71542D906 \
    FEATURE 5TF7_00BF vendr 2008.02 permanent 40 670B6E64A \
    FEATURE 5TF7_00BE vendr 2008.02 permanent 40 8C47B4023 \
    FEATURE 5TF7_00AD vendr 2008.02 permanent 40 571F667F7 \
    INCREMENT 5TF7_0016 vendr 2008.02 permanent 5 6978662DC \
    INCREMENT 5TF7_0017 vendr 2008.02 permanent 5 6EE456415 \


    OLD FILE: (File2)
    FEATURE qu2 vendr 2007.02 permanent 40 7A7CDAF65 SIGN="15D6 \
    FEATURE qu2_llock vendr 2007.02 permanent 50 FB2A5CF76 \
    FEATURE qu2_manu vendr 2007.02 permanent 50 A0AF150F6 \
    FEATURE qu2_nonvolatile_encryption vendr 2007.02 permanent 40 \
    FEATURE qu2_stpgx_2 vendr 2007.02 permanent 40 A873F9924 \
    FEATURE qu2_stpii vendr 2007.02 permanent 50 CE634C9C1 \
    FEATURE qu2_stpiigx vendr 2007.02 permanent 40 27CA99935 \
    FEATURE 5TF7_0012 vendr 2007.02 permanent 40 5F0EFC594 \
    FEATURE 5TF7_0014 vendr 2007.02 permanent 40 71542D906 \
    FEATURE 5TF7_0016 vendr 2007.02 permanent 9 6983F62DC \
    FEATURE 5TF7_0017 vendr 2007.02 permanent 9 6EE7AA415 \


    Key field is Col 2 on both files. It could be a number or a unique name.
    I want to compare both files based on key field COL2 and make sure it is present in the New file and if present, how many, which is given in COL6.
    Example: FEATURE 5TF7_0017 vendr 2007.02 permanent 9 6EE7AA415 (from File2)

    Upon comparison we can see that Feature 5TF7_0017 is present in the new file and in this case, there is an incremental set of 5 in File 1. (see last lines of both files)

    If possible, I like to write the results out to a new file showing what is missing from what file.

    Thank you.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    Code:
    awk ' 
    BEGIN { 
    while ( getline < "file1" ) { arr[$2] = $0 } 
    } 
    { if (length( arr[$2] ) == 0 )  print FILENAME":" $0 } 
      else delete arr[$2]; 
    }
    END { 
    for( key in arr ) 
    if ( length(arr[key]) ) print "file1:" arr[key] 
    } ' file2
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    Thank you Anbu23. I tried it, but I am getting an error at "else delete arr[$2];"

    I just saved the whole script as a file called ./mycmp and I had the new file named "file1" and the old file named "file2" in the same folder. I ran the script by simply typing ./mycmp at command prompt.

    Secondly, can you please add some comments so I can understand what you are doing and try to customize it to my needs.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    Remove the brace at the end of code.
    Code:
    { if (length( arr[$2] ) == 0 )  print FILENAME":" $0 }
    Code:
    while ( getline < "file1" ) { arr[$2] = $0 }
    Read all the lines from file1 and store in the array arr with field two as the index

    Code:
    if (length( arr[$2] ) == 0 )  print FILENAME":" $0
    else delete arr[$2];
    Here we are checking the field two from file2 is present in array. If length is zero then it means that key is not present in file1 and print it. Else delete that element in the array.So whatever field two present in file2 is deleted in arr

    Code:
    for( key in arr ) 
    if ( length(arr[key]) ) print "file1:" arr[key]
    Print the remaining elements in array
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    Thank you. That seemed to work. That was beautiful. I noticed that it stops after the first instance of the key field, but how can you make it look in the file for multiple entries of the same key field and also for missing entries. You could have something like this:

    newfile:
    FEATURE 5TF7_0001 vendrd 2008.02 permanent 40 34D86DAE \
    FEATURE 5TF7_0002 vendrd 2008.02 permanent 40 43A1F53D \
    FEATURE 5TF7_0003 vendrd 2008.02 permanent 40 108850D7 \
    FEATURE 5TF7_0004 vendrd 2008.02 permanent 40 B5C7E3A5 \
    FEATUREnu 5TF8_00A4 vendrd 2005.09 permanent 1 530E43DE \
    FEATUREnu 5TF8_00A8 vendrd 2007.02 permanent 1 CD37E410 \
    INCREMENT 5TF7_0005 vendrd 2004.12 permanent 1 F67C6BE7 \
    INCREMENT 5TF7_0006 vendrd 2008.02 permanent 9 88888002 \
    INCREMENT 5TF7_0007 vendrd 2008.02 permanent 9 329EC1EC \
    INCREMENTnu 106D_A008 vendrd 2004.08 permanent 5 9E997BAC \
    INCREMENTnu 5TF8_00A4 vendrd 2005.09 permanent 1 530EF3DE \
    INCREMENTnu 5TF8_00A7 vendrd 2007.02 permanent 1 46552FD7 \
    FEATUREnu 106D_A008 vendrd 2004.08 permanent 5 00015DDD \


    oldfile:
    FEATURE 5TF7_0001 vendrd 2008.02 permanent 10 34D85DAE \
    FEATURE 5TF7_0002 vendrd 2008.02 permanent 10 43A1F53D \
    FEATURE 5TF7_0003 vendrd 2008.02 permanent 10 1088F77D7 \
    FEATURE 5TF7_0004 vendrd 2008.02 permanent 10 B5C7E3A5 \
    INCREMENT 5TF7_0005 vendrd 2004.12 permanent 1 F67CDBE7 \
    INCREMENT 5TF7_0006 vendrd 2008.02 permanent 9 88888002 \
    INCREMENT 5TF7_0007 vendrd 2008.02 permanent 9 3298C1EC \
    INCREMENT 5TF7_0008 vendrd 2008.02 permanent 25 EA2C72CE \


    Result should produce something like this: (the order of reporting may be different than what is shown here)
    oldfile: INCREMENT 5TF7_0008 vendrd 2008.02 permanent 25 EA2C72CE \ (Missing in newfile)
    newfile: INCREMENTnu 106D_A008 vendrd 2004.08 permanent 5 9E997BAC \ (New entry in newfile: #1)
    newfile: INCREMENTnu 5TF8_00A4 vendrd 2005.09 permanent 1 530EF3DE \ (New entry in newfile: #1)
    newfile: INCREMENTnu 5TF8_00A7 vendrd 2007.02 permanent 1 46552FD7 \ (New entry in newfile: #1)
    newfile: FEATUREnu 106D_A008 vendrd 2004.08 permanent 5 00015DDD \(New entry in newfile: #2)
    newfile: FEATUREnu 5TF8_00A4 vendrd 2005.09 permanent 1 530E43DE \(New entry in newfile: #1)
    newfile: FEATUREnu 5TF8_00A8 vendrd 2007.02 permanent 1 CD37E410 \(New entry in newfile: #1)


    Here you also notice that 106D_A008 in the newfile twice, once as a FEATURE and as a INCREMENT. The count number (#1, #2...) in the comments bracket indicates that it was found in the file more than once.

    Any help that you can provide will be greatly appreciated. Thank you kindly for your time.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    Code:
    awk ' 
    BEGIN { 
    while ( getline < "oldfile" ) { arr[$2] = $0 } 
    } 
    { if (length( arr[$2] ) == 0 ){ ky[$2]++;  print FILENAME":" $0 " (New entry in " FILENAME " : #" ky[$2] ")" }
      else delete arr[$2]; 
    }
    END { 
    for( key in arr ) 
    if ( length(arr[key]) ) print "oldfile:" arr[key] " \ (Missing in " FILENAME ")"
    } ' newfile
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    Thank you again. I think we almost have it. I see one minor problem, which I have seen in other awk programs as well. When the output is printed on screen, it is showing up as follows:

    (New entry in newfile : #1)vendrd 2005.09 permanent 1 530E43DE \
    (New entry in newfile : #1)vendrd 2007.02 permanent 1 CD37E410 \
    (New entry in newfile : #1)8 vendrd 2004.08 permanent 5 9E997BAC \



    As you can see, the first part of the record is being replaced with the comments text. I tried placing a "\n" at beginning, end and a separate print statement, but didn't seem to help.

    Thank you.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    Code:
    awk ' 
    BEGIN { 
    while ( getline < "oldfile" ) { arr[$2] = $0 } 
    } 
    { if (length( arr[$2] ) == 0 ){ ky[$2]++;  print "(New entry in " FILENAME " : #" ky[$2] " ) " $0  }
      else delete arr[$2]; 
    }
    END { 
    for( key in arr ) 
    if ( length(arr[key]) ) print "(Missing in " FILENAME " ) " arr[key] 
    } ' newfile
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    Interesting! Is it working for you? I am still seeing the same problem.

    Am I doing anything wrong here, I wonder!. I saved the code as file filecomp.awk and changed the permissions to 755 and simply executed the code from command line by typing the filename.
    %> filecomp.awk

    I am running this on a Linux machine and my Xterm is using csh.

    (New entry in newfile : #1)vendrd 2005.09 permanent 1 530EF1F243DE \
    (New entry in newfile : #1)vendrd 2007.02 permanent 1 CD37C7FDE410 \
    (New entry in newfile : #1)8 vendrd 2004.08 permanent 5 9E9930A27BAC \


    For some reason $0 is getting written over the comments text in braces.

    I am new in AWK.
    The BEGIN section reads each line from oldfile and compares $2 against $2 of every line in newfile and prints out if there is no match and delete if there is a match and move on to the next record.

    I am bit confused about the END section. In the
    Code:
    for (key for arr)
    command, where is key defined and what is arr holding - a line from oldfile or line newfile? How does the system know what key field to be used as the variable key.
    And why do you need that for statement.
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    The code works fine for me.
    Code:
    for( key in arr )
    In arr we are storing the records using the second field as index. Arrays with arbitrary indices are called associative arrays because they associate names with values.
    Example :
    Code:
    arr["qu2"]="FEATURE qu2 vendr 2008.02 permanent 20 7A7CDAF65 SIGN="15D6 \"
    Above for is the syntax to read from associative array arr.

    what is arr holding - a line from oldfile or line newfile?
    In BEGIN section we store records from oldfile in array arr.

    And why do you need that for statement.
    To print the remaining records in the arr i.e we are printing the records of oldfile that is missing in the newfile
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    Well, there must be something wrong with my awk or the system. Are you running it on a Linux machine?

    Take a look at these codes, simplified to highlight just the problem with printing on screen:
    Code:
    awk '{
      while (getline < "oldfile" ) { arr[$2] = $0; 
       if (length( arr[$2] ) != 0 ) {
        ky[$2]++;  
        print FILENAME": " $0 " (New entry in " FILENAME " : #" ky[$2] ")" 
        }
      }
     
    }' newfile
    and the result I get is:

    newfile: FEATURE 5TF7_0001 vendrd 2008.02 permanent 10 34D863BE5DAE OLD-01 (New entry in newfile : #1)
    newfile: FEATURE 5TF7_0002 vendrd 2008.02 permanent 10 43A1F8A1053D OLD-02 (New entry in newfile : #1)
    newfile: FEATURE 5TF7_0003 vendrd 2008.02 permanent 10 1088505F77D7 OLD-03 (New entry in newfile : #1)
    newfile: FEATURE 5TF7_0004 vendrd 2008.02 permanent 10 B5C7D60DE3A5 OLD-04 (New entry in newfile : #1)
    newfile: INCREMENT 5TF7_0008 vendrd 2008.02 permanent 25 EA28E37C72CE OLD-08 (New entry in newfile : #1)

    Ignore the comments in the bracket, but the important point is that the Comments appear at the end of the line as expected. OLD_0# just indicates that the line is from the oldfile and it is line number 0#.

    Now look at this code, same code with minor changes:
    Code:
    awk '{
      while (getline < "oldfile" ) { arr[$2] = $0;}  # added this closing braces 
       if (length( arr[$2] ) != 0 ) {
        ky[$2]++;  
        print FILENAME": " $0 " (New entry in " FILENAME " : #" ky[$2] ")" 
      }  # removed one closing brace from next line
    }' newfile
    and the results I get:
    newfile: INCREMENT 5TF7_0008 vendrd 2008.02 permanent 25 EA28E37C72CE OLD-08 (New entry in newfile : #1)
    (New entry in newfile : #1)endrd 2008.02 permanent 40 43A1F8A1053D NEW-02
    (New entry in newfile : #1)endrd 2008.02 permanent 40 1088505F77D7 NEW-03
    (New entry in newfile : #1)endrd 2008.02 permanent 40 B5C7D60DE3A5 NEW-04
    (New entry in newfile : #1) vendrd 2004.12 permanent 1 F67C6C94DBE7 NEW-07
    (New entry in newfile : #1) vendrd 2008.02 permanent 9 8888EEFB8002 NEW-08
    (New entry in newfile : #1) vendrd 2008.02 permanent 9 329E91D8C1EC NEW-09


    Here you can see that the comments in the bracket is getting written over the $0 from newfile? The first line came out okay. Any idea why this problem?

    Here is a one liner:
    Clean Print Code:
    Code:
     awk 'BEGIN {while (getline < "oldfile" ) { arr[$2] = $0;}}{if (length( arr[$3] ) == 0 ) {ky[$3]++;print FILENAME ": " $0}}' newfile
    Result:
    newfile: NEW FEATURE 5TF7_0001 vendrd 2008.02 permanent 40 34D863BE5DAE NEW-01
    newfile: NEW FEATURE 5TF7_0002 vendrd 2008.02 permanent 40 43A1F8A1053D NEW-02


    Problem with Print: (I just added "Hello" after $0
    Code:
     awk 'BEGIN {while (getline < "oldfile" ) { arr[$2] = $0;}}{if (length( arr[$3] ) == 0 ) {ky[$3]++; print FILENAME ": " $0 "Hello"}}' newfile
    Result:
    Hellole: NEW FEATURE 5TF7_0001 vendrd 2008.02 permanent 40 34D863BE5DAE NEW-01
    Hellole: NEW FEATURE 5TF7_0002 vendrd 2008.02 permanent 40 43A1F8A1053D NEW-02
  22. #12
  23. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    try this
    Code:
    awk ' 
    BEGIN { 
    while ( getline < "oldfile" ) { arr[$2] = $0 } 
    } 
    { if (length( arr[$2] ) == 0 ){ ky[$2]++;  printf("%s:%s (New entry in %s : # %d )\n", FILENAME, $0 , FILENAME, ky[$2]) }
      else delete arr[$2]; 
    }
    END { 
    for( key in arr ) 
    if ( length(arr[key]) ) printf("oldfile:%s \ (Missing in %s\n )", arr[key] , FILENAME)
    } ' newfile > file
    I have never been worked in csh and linux. I think it might be due to some terminal setting which affect the output. Try the above code and send the output to file instead of screen.
    Dont include while loop as you did in previous post. What happens is for each line in the newfile the while loop is executed which is unnecessary. Whereas if you keep that while loop in BEGIN section it is executed only once before reading the newfile.
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    I think I can see where the problem is now.

    When we print $0, it is terminating the line with a ^M character at the end.

    Look at these outputs:
    newfile:NEW INCREMENT 5TF8_00A7 vendrd 2007.02 permanent 1 46552D2F3FD7 NEW-12^M (New entry in newfile : # 6 )
    newfile:NEW FEATURE 106D_A008 vendrd 2004.08 permanent 5 0001DA5T5DDD NEW-13^M (New entry in newfile : # 7 )
    oldfile:FEATURE 5OF7_0001 vendrd 2008.02 permanent 10 34D863BE5DAE OLD-01 (Missing in newfile)
    oldfile:FEATURE 5OF7_0002 vendrd 2008.02 permanent 10 43A1F8A1053D OLD-02 (Missing in newfile)
    As you can see, there is a ^M character introduced after each $0 print statement. You don't see that for second print satement. Is there any way to prevent that ^M from appearing at the end of the file. Can we read the lines in to the array using another method? or is there another command or control character to print the current record?
  26. #14
  27. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    47
    Rep Power
    21
    You can use dos2unix and unix2dos to remove and add the ^M characters.
  28. #15
  29. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2007
    Posts
    9
    Rep Power
    0
    Of course, that is from command line. But I was wondering if there is anything that we can use in the "print" statement or in the way by which we store data in the array "arr" to remove those ^M characters.
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo