#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    23
    Rep Power
    0

    [BASH]Ignore certain fields when comparing two csv files


    Hi,

    I am working on creating a csv files comparison tool. using shell script.

    Basically, I have two files ( the column headers are just examples and taken dynamically from file):

    File1:

    Code:
    M_DEALNUM,M_PTFOLIO,M_TERN_FMLY,M_TRN_GRP,M_TRN_TYPE,UW_VEGA1,UW_VEGA2,VOL_SKEW_UP
    ,,,,,,,,,
    4120,XAG,CURR,OPT,SMP,COM,0,21,-120334
    4121,XAG,CURR,OPT,SMP,COM,0,35,-120335
    File2:

    Code:
    M_DEALNUM,M_PTFOLIO,M_TERN_FMLY,M_TRN_GRP,M_TRN_TYPE,UW_VEGA1,UW_VEGA2,VOL_SKEW_UP
    ,,,,,,,,,
    4120,XAG,CURR,OPT,SMP,ORG,0,21,-120334
    4121,XAG,CURR,OPT,SMP,COM,0,36,-120335
    4122,XAG,CURR,OPT,SMP,COM,0,45,-120338
    I would ask the user to input which columns header(s) (ex. M_DEALNUM, VOL_SKEW_UP) to use as the key for each row.

    I would also ask the user to provide the columns that should not be compared for differences (ex. M_TRN_TYPE).

    So lets say the keys are: M_DEALNUM, VOL_SKEW_UP
    Columns to ignore: M_TRN_TYPE

    The output of the above file should be:

    file3:

    Code:
    4121,XAG,CURR,OPT,SMP,COM,0,36,-120335
    4121,XAG,CURR,OPT,SMP,COM,0,36,-120335
    
    +4122,XAG,CURR,OPT,SMP,COM,0,45,-120338
    Since roow 2 is different in file 1 and file 2, the same row in corresponding files are ouputted to file 3. Row 3 is only in file 2 so append it to file 3 with a plus sign ( just for illustration, doesnt have to be a plus sign).

    I created a sort command to sort file1 and file2 based on the user inputted keys.

    [code]

    RPT=${RPT_before/.csv_before/_comparison_output.csv}
    RPT_after=${RPT_before/before/after}

    #################################################################################################### ####################################################

    #TODO: Wprking on sorting and then comparing reports throuhg unix:

    ##Sort report based on keys given.

    declare -a CoL_INDEX

    while read rpt_line
    do
    IFS=',' read -ra COL <<< "$rpt_line"
    break;

    done < $OUTPUT_DIR$RPT_before
    #echo $COL[*]
    z=0
    for x in "${KEYS[@]}";do

    echo $x
    index=0

    while [ "$index" -lt "${#COL[@]}" ]; do

    if [[ "${COL[$index]}" == "$x" ]]
    then
    COL_INDEX[$z]=$((index+1))
    echo ${COL_INDEX[${z}]}
    let "z++"
    break;
    fi
    let "index++"

    done
    done

    #build the position string to append to sort
    pos=""
    y=0
    ind=0
    while [ "$ind" -lt "${#COL_INDEX[@]}" ]; do

    pos="${pos}-k ${COL_INDEX[${ind}]},${COL_INDEX[${ind}]} "
    let "ind++"

    done

    #create a tmp file to intake sort result
    echo $pos
    TCH=`touch $OUTPUT_DIR${RPT_before}_tmp`
    "sort -n -b -t ',' $pos $OUTPUT_DIR$RPT_before -o $OUTPUT_DIR${RPT_before}_tmp"

    MV=`$OUTPUT_DIR$RPT_before.tmp $OUTPUT_DIR$RPT_before`

    [code]

    Basically, I need to do the compasion now with considering the columns to ignore. I know there is a diff command but am not sure if I can ignore certain fields to compare.

    Can you please suggest a way to do this?

    Thanks!

    regards,
    akaballa123
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,851
    Rep Power
    481

    A bash command to return matching line numbers


    Install j from www.jsoftware.com
    Save the code block as file j.ijs
    Hmm, a logic error is that CSV means comma separated and I've made you specify the comma. Oh well. Chalk it up as another "PIN" number. file.[12] are the files following your keyed sort. (For different lines change -: to -.@-: )

    matching_lines=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' file.1 file.2 )


    Code:
    exit@(0: smoutput)^:(6~:#ARGV) (0 : 0)
    Arguments: InputFieldSeparator RejectColumnNamesSeparatedByIFS File1 File2
    
    Output: Counting from 0, line numbers of matching lines
    
    Command line example:
       $ jconsole j.ijs A B C D
       Trouble in River City
       $ jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2
       0 1 2
    )
    
    remove=: &(] #"1~ [ -.@e.~ [: {. ])
    read=: 1!:1@<
    
    split_rows=: ];._2                  NB. cut at frets determined by last character, a line feed.
    split_columns=: [: <@deb;._2"1 ,.~  NB. use: IFS split csv_lines
    parse=: split_columns split_rows    NB. IFS (parse read) FILENAME
    prepare=: &(ignore@(parse read))    NB. IFS prepare FILENAME
    matches=: I.@(-:"1/)@,:             NB. return the index origin row numbers of matches
    
    
    same=: 2 : 0  NB. use   FILE1 IFS same IGNORABLES FILE2
    :
    IFS=. m
    ignorable=. <@deb;._2 n,IFS
    ignore=. ignorable remove
    x matches&(IFS prepare) y
    )
     
    'IFS ignorables FILE1 FILE2'=: 2}.ARGV
    
    smoutput FILE1 ((IFS same ignorables) :: ('Trouble in River City'"_)) FILE2
    
    exit 0
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    23
    Rep Power
    0
    Originally Posted by b49P23TIvg
    Install j from www.jsoftware.com
    Save the code block as file j.ijs
    Hmm, a logic error is that CSV means comma separated and I've made you specify the comma. Oh well. Chalk it up as another "PIN" number. file.[12] are the files following your keyed sort. (For different lines change -: to -.@-: )

    matching_lines=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' file.1 file.2 )


    Code:
    exit@(0: smoutput)^:(6~:#ARGV) (0 : 0)
    Arguments: InputFieldSeparator RejectColumnNamesSeparatedByIFS File1 File2
    
    Output: Counting from 0, line numbers of matching lines
    
    Command line example:
       $ jconsole j.ijs A B C D
       Trouble in River City
       $ jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2
       0 1 2
    )
    
    remove=: &(] #"1~ [ -.@e.~ [: {. ])
    read=: 1!:1@<
    
    split_rows=: ];._2                  NB. cut at frets determined by last character, a line feed.
    split_columns=: [: <@deb;._2"1 ,.~  NB. use: IFS split csv_lines
    parse=: split_columns split_rows    NB. IFS (parse read) FILENAME
    prepare=: &(ignore@(parse read))    NB. IFS prepare FILENAME
    matches=: I.@(-:"1/)@,:             NB. return the index origin row numbers of matches
    
    
    same=: 2 : 0  NB. use   FILE1 IFS same IGNORABLES FILE2
    :
    IFS=. m
    ignorable=. <@deb;._2 n,IFS
    ignore=. ignorable remove
    x matches&(IFS prepare) y
    )
     
    'IFS ignorables FILE1 FILE2'=: 2}.ARGV
    
    smoutput FILE1 ((IFS same ignorables) :: ('Trouble in River City'"_)) FILE2
    
    exit 0
    Thanks for the reply! I will try this out but I would lke to limit the usage of external software as mush as possible since I would need to run this scripts on different physicals devices. Therefore, I would like to stick with the in-built unix commands if that is possible :S. However, I will try to integrate this. Also, the js output is only the matching rows' line numbers. I would only like to show the rows that differe, and all of their columns. Is there a way to do that?

    Thanks!
  6. #4
  7. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,851
    Rep Power
    481

    differences


    For different lines change -: to -.@-:

    This comment is a parenthetical remark in the first paragraph of my first post. j is now open source and may become part of the linux distribution---at least as optional software. I tried to find a gawk solution but was unable to get split() to work. I'm sure it's easy in python, which would be part of the standard distribution, or in perl or ruby, maybe sed if you find someone who understands the stack. I mention python first because it would be easiest for me. I was trying to use awk split() to parse command line arguments. There are many alternatives.
    Code:
    bash$ a=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2 )
    bash$ echo $a
    3 4
    bash$ for i in $a;do echo $i ; done
    3
    4
    bash$
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2011
    Posts
    23
    Rep Power
    0
    Originally Posted by b49P23TIvg
    For different lines change -: to -.@-:

    This comment is a parenthetical remark in the first paragraph of my first post. j is now open source and may become part of the linux distribution---at least as optional software. I tried to find a gawk solution but was unable to get split() to work. I'm sure it's easy in python, which would be part of the standard distribution, or in perl or ruby, maybe sed if you find someone who understands the stack. I mention python first because it would be easiest for me. I was trying to use awk split() to parse command line arguments. There are many alternatives.
    Code:
    bash$ a=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2 )
    bash$ echo $a
    3 4
    bash$ for i in $a;do echo $i ; done
    3
    4
    bash$
    Oh ok. I achieved this with perl. But the problem is that the algorithm takes up too much memory when running on big csv files.

    This is because, I go through each rows in each file twice ( i know this sounds sub-optimal but I am just a beginner is perl and needed to construct a basic solution :P). The search algorithm works wonder but only on smaller files. I am currently trying to find a way to make it much more optimal with little success. I will post the perl code ( i cant find an attachment link) so you can have a look and maybe comment on how you might make it more optimal. Thanks for the advise!

    Code:
    use warnings;
    use strict;
    use Getopt::Long;
    use File::Find;
    
    ## @cols_key -> Key columns of input arguments.
    ## @cols_compare -> Columns to compare between records of both input files.
    ## %key_file1 -> Saves content of second input file.
    my (@cols_key, @cols_ignore, @cols_compare, %cmp_file1, %cmp_file2, %key_file1, %key_file2, $cols_key, $cols_ignore, %columns, $ind, %cols_key, %cols_ignore);
    
    ## Check input data. Correct number of arguments and options.
    if ( @ARGV != 5 or 
            ! GetOptions(
                    q[keys=s] => \$cols_key,
                    q[ignore=s] => \$cols_ignore
            )
    ) {
            die <<EOF;
    Usage: perl $0 --keys=column1,column2,... --ignore=column3,column4,... input-file-1 input-file-2 datestamp
    EOF
    }
    
    ## Open input files. End if any error.
    my $if_1 = shift @ARGV or die qq[Cannot copy input file 1: $!\n];
    my $if_2 = shift @ARGV or die qq[Cannot copy input file 2: $!\n];
    ## Accept datestamp for list report
    #my $date_stamp = shift @ARGV or die qq[Cannot accept datestamp: $!\n];
    #my $list_file ="list_report_${date_stamp}.txt";
    
    
    
    
    #================================================================================================================================================#
    ##Compare File 1 to File 2##
    #================================================================================================================================================#
    open my $ifh1, "<", $if_1 or die qq[Cannot open input-file-1 for reading: $!\n];
    while ( my $line = <$ifh1> ){
          
          printf "Still running first!";
          ## First line is header, get name fields of it and save them in a hash.
            if ( $. == 1 ) {
                   chomp $line;
                    my $i = 0;
                    %columns = map { $_ => $i++ } split /\s*,\s*/, $line;
    
                    ## Get input options into arrays to process later.
                    %cols_key = map {$_ => 1} split /,/, $cols_key;
                    @cols_key = split /,/, $cols_key;
                    %cols_ignore = map { $_ => 1 } split /,/, $cols_ignore;
                    @cols_ignore = split /,/, $cols_ignore;
                    
                    #Extract comparison columns
                     @cols_compare = grep { (! (exists $cols_key{ $_ }) ) } keys %columns;
                     @cols_compare = grep { (! (exists $cols_ignore{ $_ }) )} @cols_compare;
    
                     next;
            }
          #  print "Keys:";
           # foreach my $key (@cols_key)
            #{
             #   print $key;
              #  print "\n";
         #   }
         #   foreach my $ignore (@cols_ignore)
          #  {
           #     print $ignore;
            #    print "\n";
          #  }
           # print "Compare: ";
           # foreach my $comp (@cols_compare)
            #{
             #   print $comp;
              #  print "\n";
          #  }
         #  last;
            ## Omit blank lines if exists, and header too.
            next if $line =~ /\A\s*\z/;
            
            ## Remove leading and trailing spaces.
            $line =~ s/\A\s*(.*)\s*\z/$1/;
            
            ## Get fields.
            my @f1 = split /,\s*/, $line;
            #printf @f1;
            ## Get positions of columns.
          #  my @columns_key_pos = map { $columns{ $_ } } @cols_key;
           # my @columns_comp_pos = map { $columns{ $_ } } @cols_compare;
            #my @compare_values;
            
          #  foreach $i @columns_comp_pos
           # {
            #    @compare_values[$i] = $f1[$i]; 
            #}
            #foreach my $pos (@columns_pos)
            #{
            #    print $pos;
             #   print "\n";
           # }
            
           # printf $key_file1{ "@f1[map { $columns{ $_ }} @cols_key]" };
           # printf @f1[ map { $columns{ $_ } } @cols_key ];
            
            ##save key columns, the key of the array is position key column
            ##TODO: There is still something wrong here!!!
            push @{ $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]" } }, "@f1[ map { $columns{ $_ } } @cols_key ]";
            #printf  ${ $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]" } }[0];
            ##Save entire line and columns to compare
            push @{ $cmp_file1{ "@f1[ map { $columns{ $_ } } @cols_compare ]" } }, $line, "@f1[ map { $columns{ $_ } } @cols_compare ]";
            
            open my $ifh2, "<", $if_2 or die qq[Cannot open input-file-1 for reading: $!\n];
            
            
             while ( my $line2 = <$ifh2> ) 
             {
                
                if ( $. == 1 ) {
                    next;
                } 
                ##keep a indicator for current row to mark if found or not
                $ind = '0';
    
                
                ## Omit blank lines if exists, and header too.
                next if $line2 =~ /\A\s*\z/;
                
                ## Get fields.
                 ## Remove leading and trailing spaces.
                $line2 =~ s/\A\s*(.*)\s*\z/$1/;
                
                #TODO: Wrong!!
                my @f2 = split /,\s*/, $line2;
                push @{ $key_file2{ "@f2[ map { $columns{ $_ } } @cols_key ]" } }, "@f2[ map { $columns{ $_ } } @cols_key ]";
                push @{ $cmp_file2{ "@f2[ map { $columns{ $_ } } @cols_compare ]" } }, $line2, "@f2[ map { $columns{ $_ } } @cols_compare ]";
                
                
               # printf @f2[ map { $columns{ $_ } } @cols_key ];
    
                ## Check if the values in the key columns are the same  in both files.  If they are then compare the compare columns.
                ##If rows are different then output both the rows.
                
          
                
                if( exists $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]"})
                {
                     if( ${ $key_file1{"@f1[ map { $columns{ $_ } } @cols_key ]" } }[0] eq ${ $key_file2{"@f2[ map { $columns{ $_ } } @cols_key ]" } }[0] ) 
                     {
                           
                            if ( ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] ne ${ $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" } }[1] )
                            {
                                
    
                              #  printf "\n";
                                #printf { $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]"}}[1];
                              #  printf @f2[ map { $columns{ $_ } } @cols_compare ];
                               # printf "\nnumber 1\n";
                                printf "\n"; 
                                printf  ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[0];
                                printf "\n"; 
                                printf $line2;
                                printf "\n";
                                $ind = "1";
    
                            }
                            
                            if ( ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] eq "@f2[ map { $columns{ $_ } } @cols_compare ]" )
                            {
                                  $ind = "1";
                            }
                            
                     }
                 }
                 
                 if ( "$ind" == "1" )
                 {
                    delete $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" };
                    last;
                 }
                  
            }
            close $ifh2;
            
            ##TODO: Fix this!
            if ( "$ind" != "1" )
            {
                $f1[0] .= qq[+];
                ## Print rows not deleted of the hash. They only exists in one file, so print them alone with the "+" sign.
                for ( keys %cmp_file1 ) 
                {
                  ( my $reg = ${ $cmp_file1{ $_ } }[0] ) =~ s/\A([^,]*)/$1+/;
                  printf "%s\n\n", $reg;
                  delete $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" };
                }
            }
            
            
    
    }
    close $ifh1;
    
    %cmp_file1= ();
    %cmp_file2= ();
    %key_file1= ();
    %key_file2= ();
    
    #================================================================================================================================================#
    ##Compare File 2 to File 1##
    #================================================================================================================================================#
    open my $file2, "<", $if_2 or die qq[Cannot open input-file-2 for reading: $!\n];
    while ( my $line2 = <$file2> ){
          
          printf "Still running second!";
          ## First line is header, get name fields of it and save them in a hash.
            if ( $. == 1 ) {
                   chomp $line2;
                     next;
            }
    
            ## Omit blank lines if exists, and header too.
            next if $line2 =~ /\A\s*\z/;
            
            ## Remove leading and trailing spaces.
            $line2 =~ s/\A\s*(.*)\s*\z/$1/;
            
            ## Get fields.
            my @f2 = split /,\s*/, $line2;
    
            ##save key columns, the key of the array is position key column
            ##TODO: There is still something wrong here!!!
            push @{ $key_file2{ "@f2[ map { $columns{ $_ } } @cols_key ]" } }, "@f2[ map { $columns{ $_ } } @cols_key ]";
            ##Save entire line and columns to compare
            push @{ $cmp_file2{ "@f2[ map { $columns{ $_ } } @cols_compare ]" } }, $line2, "@f2[ map { $columns{ $_ } } @cols_compare ]";
            
            open my $file1, "<", $if_1 or die qq[Cannot open input-file-1 for reading: $!\n];
            
            
             while ( my $line = <$file1> ) 
             {
                
                if ( $. == 1 ) {
                    next;
                } 
                ##keep a indicator for current row to mark if found or not
                $ind = '0';
    
                
                ## Omit blank lines if exists, and header too.
                next if $line =~ /\A\s*\z/;
                
                ## Get fields.
                 ## Remove leading and trailing spaces.
                $line =~ s/\A\s*(.*)\s*\z/$1/;
                
                #TODO: Wrong!!
                my @f1 = split /,\s*/, $line;
                push @{ $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]" } }, "@f1[ map { $columns{ $_ } } @cols_key ]";
                push @{ $cmp_file1{ "@f1[ map { $columns{ $_ } } @cols_compare ]" } }, $line, "@f1[ map { $columns{ $_ } } @cols_compare ]";
                
    
                ## Check if the values in the key columns are the same  in both files.  If they are then compare the compare columns.
                ##If rows are different then output both the rows.
                
          
                
                if( exists $key_file2{ "@f2[ map { $columns{ $_ } } @cols_key ]"})
                {
                     if( ${ $key_file2{"@f2[ map { $columns{ $_ } } @cols_key ]" } }[0] eq ${ $key_file1{"@f1[ map { $columns{ $_ } } @cols_key ]" } }[0] ) 
                     {
                           
                            if ( ${ $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" } }[1] ne ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] )
                            {
                                $ind = "1";
                            }
                            
                            if ( ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] eq "@f2[ map { $columns{ $_ } } @cols_compare ]" )
                            {
                                  $ind = "1";
                            }
                            
                     }
                 }
                 
                 if ( "$ind" == "1" )
                 {
                    delete $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" };
                 }
                  
            }
            close $file1;
            
            ##TODO: Fix this!
            if ( "$ind" != "1" )
            {
                $f2[0] .= qq[+];
                ## Print rows not deleted of the hash. They only exists in one file, so print them alone with the "+" sign.
                for ( keys %cmp_file2 ) 
                {
                  ( my $reg = ${ $cmp_file2{ $_ } }[0] ) =~ s/\A([^,]*)/$1+/;
                  printf "%s\n\n", $reg;
                  delete $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" };
                }
            }
    }
    close $file2;
  10. #6
  11. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,851
    Rep Power
    481

    Maybe submit to perl thread


    I'm not a perl programmer.
    This j solution loads both files together. No go.
    Need to process line by line.

IMN logo majestic logo threadwatch logo seochat tools logo