The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Operating Systems
> Linux Help
|
[BASH]Ignore certain fields when comparing two csv files
Discuss [BASH]Ignore certain fields when comparing two csv files in the Linux Help forum on Dev Shed. [BASH]Ignore certain fields when comparing two csv files Linux Help forum discussing topics including usage, troubleshooting, modules, and distributions. Linux is an open source OS, based on UNIX.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

October 4th, 2011, 01:31 PM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 23
Time spent in forums: 5 h 5 m 10 sec
Reputation Power: 0
|
|
|
[BASH]Ignore certain fields when comparing two csv files
Hi,
I am working on creating a csv files comparison tool. using shell script.
Basically, I have two files ( the column headers are just examples and taken dynamically from file):
File1:
Code:
M_DEALNUM,M_PTFOLIO,M_TERN_FMLY,M_TRN_GRP,M_TRN_TYPE,UW_VEGA1,UW_VEGA2,VOL_SKEW_UP
,,,,,,,,,
4120,XAG,CURR,OPT,SMP,COM,0,21,-120334
4121,XAG,CURR,OPT,SMP,COM,0,35,-120335
File2:
Code:
M_DEALNUM,M_PTFOLIO,M_TERN_FMLY,M_TRN_GRP,M_TRN_TYPE,UW_VEGA1,UW_VEGA2,VOL_SKEW_UP
,,,,,,,,,
4120,XAG,CURR,OPT,SMP,ORG,0,21,-120334
4121,XAG,CURR,OPT,SMP,COM,0,36,-120335
4122,XAG,CURR,OPT,SMP,COM,0,45,-120338
I would ask the user to input which columns header(s) (ex. M_DEALNUM, VOL_SKEW_UP) to use as the key for each row.
I would also ask the user to provide the columns that should not be compared for differences (ex. M_TRN_TYPE).
So lets say the keys are: M_DEALNUM, VOL_SKEW_UP
Columns to ignore: M_TRN_TYPE
The output of the above file should be:
file3:
Code:
4121,XAG,CURR,OPT,SMP,COM,0,36,-120335
4121,XAG,CURR,OPT,SMP,COM,0,36,-120335
+4122,XAG,CURR,OPT,SMP,COM,0,45,-120338
Since roow 2 is different in file 1 and file 2, the same row in corresponding files are ouputted to file 3. Row 3 is only in file 2 so append it to file 3 with a plus sign ( just for illustration, doesnt have to be a plus sign).
I created a sort command to sort file1 and file2 based on the user inputted keys.
[code]
RPT=${RPT_before/.csv_before/_comparison_output.csv}
RPT_after=${RPT_before/before/after}
#################################################################################################### ####################################################
#TODO: Wprking on sorting and then comparing reports throuhg unix:
##Sort report based on keys given.
declare -a CoL_INDEX
while read rpt_line
do
IFS=',' read -ra COL <<< "$rpt_line"
break;
done < $OUTPUT_DIR$RPT_before
#echo $COL[*]
z=0
for x in "${KEYS[@]}";do
echo $x
index=0
while [ "$index" -lt "${#COL[@]}" ]; do
if [[ "${COL[$index]}" == "$x" ]]
then
COL_INDEX[$z]=$((index+1))
echo ${COL_INDEX[${z}]}
let "z++"
break;
fi
let "index++"
done
done
#build the position string to append to sort
pos=""
y=0
ind=0
while [ "$ind" -lt "${#COL_INDEX[@]}" ]; do
pos="${pos}-k ${COL_INDEX[${ind}]},${COL_INDEX[${ind}]} "
let "ind++"
done
#create a tmp file to intake sort result
echo $pos
TCH=`touch $OUTPUT_DIR${RPT_before}_tmp`
"sort -n -b -t ',' $pos $OUTPUT_DIR$RPT_before -o $OUTPUT_DIR${RPT_before}_tmp"
MV=`$OUTPUT_DIR$RPT_before.tmp $OUTPUT_DIR$RPT_before`
[code]
Basically, I need to do the compasion now with considering the columns to ignore. I know there is a diff command but am not sure if I can ignore certain fields to compare.
Can you please suggest a way to do this?
Thanks!
regards,
akaballa123
|

October 6th, 2011, 12:19 AM
|
 |
Contributing User
|
|
|
|
|
A bash command to return matching line numbers
Install j from www.jsoftware.com
Save the code block as file j.ijs
Hmm, a logic error is that CSV means comma separated and I've made you specify the comma. Oh well. Chalk it up as another "PIN" number. file.[12] are the files following your keyed sort. (For different lines change -: to -.@-: )
matching_lines=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' file.1 file.2 )
Code:
exit@(0: smoutput)^:(6~:#ARGV) (0 : 0)
Arguments: InputFieldSeparator RejectColumnNamesSeparatedByIFS File1 File2
Output: Counting from 0, line numbers of matching lines
Command line example:
$ jconsole j.ijs A B C D
Trouble in River City
$ jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2
0 1 2
)
remove=: &(] #"1~ [ -.@e.~ [: {. ])
read=: 1!:1@<
split_rows=: ];._2 NB. cut at frets determined by last character, a line feed.
split_columns=: [: <@deb;._2"1 ,.~ NB. use: IFS split csv_lines
parse=: split_columns split_rows NB. IFS (parse read) FILENAME
prepare=: &(ignore@(parse read)) NB. IFS prepare FILENAME
matches=: I.@(-:"1/)@,: NB. return the index origin row numbers of matches
same=: 2 : 0 NB. use FILE1 IFS same IGNORABLES FILE2
:
IFS=. m
ignorable=. <@deb;._2 n,IFS
ignore=. ignorable remove
x matches&(IFS prepare) y
)
'IFS ignorables FILE1 FILE2'=: 2}.ARGV
smoutput FILE1 ((IFS same ignorables) :: ('Trouble in River City'"_)) FILE2
exit 0
|

October 6th, 2011, 08:08 AM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 23
Time spent in forums: 5 h 5 m 10 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by b49P23TIvg Install j from www.jsoftware.com
Save the code block as file j.ijs
Hmm, a logic error is that CSV means comma separated and I've made you specify the comma. Oh well. Chalk it up as another "PIN" number. file.[12] are the files following your keyed sort. (For different lines change -: to -.@-: )
matching_lines=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' file.1 file.2 )
Code:
exit@(0: smoutput)^:(6~:#ARGV) (0 : 0)
Arguments: InputFieldSeparator RejectColumnNamesSeparatedByIFS File1 File2
Output: Counting from 0, line numbers of matching lines
Command line example:
$ jconsole j.ijs A B C D
Trouble in River City
$ jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2
0 1 2
)
remove=: &(] #"1~ [ -.@e.~ [: {. ])
read=: 1!:1@<
split_rows=: ];._2 NB. cut at frets determined by last character, a line feed.
split_columns=: [: <@deb;._2"1 ,.~ NB. use: IFS split csv_lines
parse=: split_columns split_rows NB. IFS (parse read) FILENAME
prepare=: &(ignore@(parse read)) NB. IFS prepare FILENAME
matches=: I.@(-:"1/)@,: NB. return the index origin row numbers of matches
same=: 2 : 0 NB. use FILE1 IFS same IGNORABLES FILE2
:
IFS=. m
ignorable=. <@deb;._2 n,IFS
ignore=. ignorable remove
x matches&(IFS prepare) y
)
'IFS ignorables FILE1 FILE2'=: 2}.ARGV
smoutput FILE1 ((IFS same ignorables) :: ('Trouble in River City'"_)) FILE2
exit 0
|
Thanks for the reply! I will try this out but I would lke to limit the usage of external software as mush as possible since I would need to run this scripts on different physicals devices. Therefore, I would like to stick with the in-built unix commands if that is possible :S. However, I will try to integrate this. Also, the js output is only the matching rows' line numbers. I would only like to show the rows that differe, and all of their columns. Is there a way to do that?
Thanks!
|

October 6th, 2011, 09:06 AM
|
 |
Contributing User
|
|
|
|
|
differences
For different lines change -: to -.@-:
This comment is a parenthetical remark in the first paragraph of my first post. j is now open source and may become part of the linux distribution---at least as optional software. I tried to find a gawk solution but was unable to get split() to work. I'm sure it's easy in python, which would be part of the standard distribution, or in perl or ruby, maybe sed if you find someone who understands the stack. I mention python first because it would be easiest for me. I was trying to use awk split() to parse command line arguments. There are many alternatives.
Code:
bash$ a=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2 )
bash$ echo $a
3 4
bash$ for i in $a;do echo $i ; done
3
4
bash$
|

October 6th, 2011, 09:22 AM
|
|
Registered User
|
|
Join Date: Sep 2011
Posts: 23
Time spent in forums: 5 h 5 m 10 sec
Reputation Power: 0
|
|
Quote: | Originally Posted by b49P23TIvg For different lines change -: to -.@-:
This comment is a parenthetical remark in the first paragraph of my first post. j is now open source and may become part of the linux distribution---at least as optional software. I tried to find a gawk solution but was unable to get split() to work. I'm sure it's easy in python, which would be part of the standard distribution, or in perl or ruby, maybe sed if you find someone who understands the stack. I mention python first because it would be easiest for me. I was trying to use awk split() to parse command line arguments. There are many alternatives.
Code:
bash$ a=$( jconsole j.ijs , 'M_TRN_TYPE,UW_VEGA1' /tmp/1 /tmp/2 )
bash$ echo $a
3 4
bash$ for i in $a;do echo $i ; done
3
4
bash$
|
Oh ok. I achieved this with perl. But the problem is that the algorithm takes up too much memory when running on big csv files.
This is because, I go through each rows in each file twice ( i know this sounds sub-optimal but I am just a beginner is perl and needed to construct a basic solution :P). The search algorithm works wonder but only on smaller files. I am currently trying to find a way to make it much more optimal with little success. I will post the perl code ( i cant find an attachment link) so you can have a look and maybe comment on how you might make it more optimal. Thanks for the advise!
Code:
use warnings;
use strict;
use Getopt::Long;
use File::Find;
## @cols_key -> Key columns of input arguments.
## @cols_compare -> Columns to compare between records of both input files.
## %key_file1 -> Saves content of second input file.
my (@cols_key, @cols_ignore, @cols_compare, %cmp_file1, %cmp_file2, %key_file1, %key_file2, $cols_key, $cols_ignore, %columns, $ind, %cols_key, %cols_ignore);
## Check input data. Correct number of arguments and options.
if ( @ARGV != 5 or
! GetOptions(
q[keys=s] => \$cols_key,
q[ignore=s] => \$cols_ignore
)
) {
die <<EOF;
Usage: perl $0 --keys=column1,column2,... --ignore=column3,column4,... input-file-1 input-file-2 datestamp
EOF
}
## Open input files. End if any error.
my $if_1 = shift @ARGV or die qq[Cannot copy input file 1: $!\n];
my $if_2 = shift @ARGV or die qq[Cannot copy input file 2: $!\n];
## Accept datestamp for list report
#my $date_stamp = shift @ARGV or die qq[Cannot accept datestamp: $!\n];
#my $list_file ="list_report_${date_stamp}.txt";
#=================================================================================================== =============================================#
##Compare File 1 to File 2##
#=================================================================================================== =============================================#
open my $ifh1, "<", $if_1 or die qq[Cannot open input-file-1 for reading: $!\n];
while ( my $line = <$ifh1> ){
printf "Still running first!";
## First line is header, get name fields of it and save them in a hash.
if ( $. == 1 ) {
chomp $line;
my $i = 0;
%columns = map { $_ => $i++ } split /\s*,\s*/, $line;
## Get input options into arrays to process later.
%cols_key = map {$_ => 1} split /,/, $cols_key;
@cols_key = split /,/, $cols_key;
%cols_ignore = map { $_ => 1 } split /,/, $cols_ignore;
@cols_ignore = split /,/, $cols_ignore;
#Extract comparison columns
@cols_compare = grep { (! (exists $cols_key{ $_ }) ) } keys %columns;
@cols_compare = grep { (! (exists $cols_ignore{ $_ }) )} @cols_compare;
next;
}
# print "Keys:";
# foreach my $key (@cols_key)
#{
# print $key;
# print "\n";
# }
# foreach my $ignore (@cols_ignore)
# {
# print $ignore;
# print "\n";
# }
# print "Compare: ";
# foreach my $comp (@cols_compare)
#{
# print $comp;
# print "\n";
# }
# last;
## Omit blank lines if exists, and header too.
next if $line =~ /\A\s*\z/;
## Remove leading and trailing spaces.
$line =~ s/\A\s*(.*)\s*\z/$1/;
## Get fields.
my @f1 = split /,\s*/, $line;
#printf @f1;
## Get positions of columns.
# my @columns_key_pos = map { $columns{ $_ } } @cols_key;
# my @columns_comp_pos = map { $columns{ $_ } } @cols_compare;
#my @compare_values;
# foreach $i @columns_comp_pos
# {
# @compare_values[$i] = $f1[$i];
#}
#foreach my $pos (@columns_pos)
#{
# print $pos;
# print "\n";
# }
# printf $key_file1{ "@f1[map { $columns{ $_ }} @cols_key]" };
# printf @f1[ map { $columns{ $_ } } @cols_key ];
##save key columns, the key of the array is position key column
##TODO: There is still something wrong here!!!
push @{ $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]" } }, "@f1[ map { $columns{ $_ } } @cols_key ]";
#printf ${ $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]" } }[0];
##Save entire line and columns to compare
push @{ $cmp_file1{ "@f1[ map { $columns{ $_ } } @cols_compare ]" } }, $line, "@f1[ map { $columns{ $_ } } @cols_compare ]";
open my $ifh2, "<", $if_2 or die qq[Cannot open input-file-1 for reading: $!\n];
while ( my $line2 = <$ifh2> )
{
if ( $. == 1 ) {
next;
}
##keep a indicator for current row to mark if found or not
$ind = '0';
## Omit blank lines if exists, and header too.
next if $line2 =~ /\A\s*\z/;
## Get fields.
## Remove leading and trailing spaces.
$line2 =~ s/\A\s*(.*)\s*\z/$1/;
#TODO: Wrong!!
my @f2 = split /,\s*/, $line2;
push @{ $key_file2{ "@f2[ map { $columns{ $_ } } @cols_key ]" } }, "@f2[ map { $columns{ $_ } } @cols_key ]";
push @{ $cmp_file2{ "@f2[ map { $columns{ $_ } } @cols_compare ]" } }, $line2, "@f2[ map { $columns{ $_ } } @cols_compare ]";
# printf @f2[ map { $columns{ $_ } } @cols_key ];
## Check if the values in the key columns are the same in both files. If they are then compare the compare columns.
##If rows are different then output both the rows.
if( exists $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]"})
{
if( ${ $key_file1{"@f1[ map { $columns{ $_ } } @cols_key ]" } }[0] eq ${ $key_file2{"@f2[ map { $columns{ $_ } } @cols_key ]" } }[0] )
{
if ( ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] ne ${ $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" } }[1] )
{
# printf "\n";
#printf { $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]"}}[1];
# printf @f2[ map { $columns{ $_ } } @cols_compare ];
# printf "\nnumber 1\n";
printf "\n";
printf ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[0];
printf "\n";
printf $line2;
printf "\n";
$ind = "1";
}
if ( ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] eq "@f2[ map { $columns{ $_ } } @cols_compare ]" )
{
$ind = "1";
}
}
}
if ( "$ind" == "1" )
{
delete $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" };
last;
}
}
close $ifh2;
##TODO: Fix this!
if ( "$ind" != "1" )
{
$f1[0] .= qq[+];
## Print rows not deleted of the hash. They only exists in one file, so print them alone with the "+" sign.
for ( keys %cmp_file1 )
{
( my $reg = ${ $cmp_file1{ $_ } }[0] ) =~ s/\A([^,]*)/$1+/;
printf "%s\n\n", $reg;
delete $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" };
}
}
}
close $ifh1;
%cmp_file1= ();
%cmp_file2= ();
%key_file1= ();
%key_file2= ();
#=================================================================================================== =============================================#
##Compare File 2 to File 1##
#=================================================================================================== =============================================#
open my $file2, "<", $if_2 or die qq[Cannot open input-file-2 for reading: $!\n];
while ( my $line2 = <$file2> ){
printf "Still running second!";
## First line is header, get name fields of it and save them in a hash.
if ( $. == 1 ) {
chomp $line2;
next;
}
## Omit blank lines if exists, and header too.
next if $line2 =~ /\A\s*\z/;
## Remove leading and trailing spaces.
$line2 =~ s/\A\s*(.*)\s*\z/$1/;
## Get fields.
my @f2 = split /,\s*/, $line2;
##save key columns, the key of the array is position key column
##TODO: There is still something wrong here!!!
push @{ $key_file2{ "@f2[ map { $columns{ $_ } } @cols_key ]" } }, "@f2[ map { $columns{ $_ } } @cols_key ]";
##Save entire line and columns to compare
push @{ $cmp_file2{ "@f2[ map { $columns{ $_ } } @cols_compare ]" } }, $line2, "@f2[ map { $columns{ $_ } } @cols_compare ]";
open my $file1, "<", $if_1 or die qq[Cannot open input-file-1 for reading: $!\n];
while ( my $line = <$file1> )
{
if ( $. == 1 ) {
next;
}
##keep a indicator for current row to mark if found or not
$ind = '0';
## Omit blank lines if exists, and header too.
next if $line =~ /\A\s*\z/;
## Get fields.
## Remove leading and trailing spaces.
$line =~ s/\A\s*(.*)\s*\z/$1/;
#TODO: Wrong!!
my @f1 = split /,\s*/, $line;
push @{ $key_file1{ "@f1[ map { $columns{ $_ } } @cols_key ]" } }, "@f1[ map { $columns{ $_ } } @cols_key ]";
push @{ $cmp_file1{ "@f1[ map { $columns{ $_ } } @cols_compare ]" } }, $line, "@f1[ map { $columns{ $_ } } @cols_compare ]";
## Check if the values in the key columns are the same in both files. If they are then compare the compare columns.
##If rows are different then output both the rows.
if( exists $key_file2{ "@f2[ map { $columns{ $_ } } @cols_key ]"})
{
if( ${ $key_file2{"@f2[ map { $columns{ $_ } } @cols_key ]" } }[0] eq ${ $key_file1{"@f1[ map { $columns{ $_ } } @cols_key ]" } }[0] )
{
if ( ${ $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" } }[1] ne ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] )
{
$ind = "1";
}
if ( ${ $cmp_file1{ "@f1[map { $columns{ $_ } } @cols_compare]" } }[1] eq "@f2[ map { $columns{ $_ } } @cols_compare ]" )
{
$ind = "1";
}
}
}
if ( "$ind" == "1" )
{
delete $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" };
}
}
close $file1;
##TODO: Fix this!
if ( "$ind" != "1" )
{
$f2[0] .= qq[+];
## Print rows not deleted of the hash. They only exists in one file, so print them alone with the "+" sign.
for ( keys %cmp_file2 )
{
( my $reg = ${ $cmp_file2{ $_ } }[0] ) =~ s/\A([^,]*)/$1+/;
printf "%s\n\n", $reg;
delete $cmp_file2{ "@f2[map { $columns{ $_ } } @cols_compare]" };
}
}
}
close $file2;
|

October 6th, 2011, 10:19 AM
|
 |
Contributing User
|
|
|
|
|
Maybe submit to perl thread
I'm not a perl programmer.
This j solution loads both files together. No go.
Need to process line by line.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|