### Thread: How to compare and match two rows in a file with PERL

1. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0

#### How to compare and match two rows in a file with PERL

Hi to all,
I have data like this
1.First I want to find how many columns (from AB1 to AB5) are different for P1 and P2 ,
Eq means: Both P1 and P2 should contain same letters (alleles) or if any one of P1 or P2 contains Z/Z or -/- I should consider them as eq only.
2.I will compare lines column values from 1 with P2 across all the columns (from AB1 to AB5) in horizontal way and continue for remaining lines from 2 to 5. if they match I would like to give 1 else 0 and I would like to continue this till my programme encounters second set of P1 and P2.
3.I will make sum for lines 1 to 5 across all the columns from columns AB1 to AB5, but I will include only columns showing different for P1 and P2 in my sum count. Now I am working on this with sumif formula.
4.I will calculate percentage of matching lines 1 to 5 with P2 by dividing sum came from SUMIF with number of different markers between P1 and P2.
5.I want to repeat this for remaining set of P1 and P2.
I am expecting like this
LINES XY1 XY2 XY3 XY4 XY5
P1 EQ NE EQ EQ EQ SUM %
P2 1
1 0 0 1 0 0 0 0
2 0 1 0 0 1 1 100
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 1 1 0 0 0 1 100

I am using this code to solve this problem
Code:
```use strict;
use warnings;
sub compare_alleles {
return 1 if grep {\$_ eq '-/-' or \$_ eq 'Z/Z' } @_;
return \$_[0] eq \$_[1] ? 1 : 0;
}
my \$format;
my (@p1, @p2);
my @unequal;
while (<>) {
unless (/^(P?\d)/) {
my @widths;
push @widths, \$+[1] - \$-[1] while /(\S+\s*)/g;
pop @widths;
push @widths, \$widths[-1], \$widths[-1];
\$format = join '', map("%-\${_}s", @widths, ''), "\n";
print;
next;
}
my @fields = split;
if (\$fields[0] eq 'P1') {
@p1 = @fields;
}
elsif (\$fields[0] eq 'P2') {
@p2 = @fields;
printf \$format, 'P1', map (compare_alleles(\$p1[\$_], \$p2[\$_]) ? 'eq' : 'nq', 1..5), 'SUM', '%';
printf \$format, 'P2', map('', 1..5), '', '1';
@unequal = grep { not compare_alleles(\$p1[\$_], \$p2[\$_]) } 1..5;
}
else {
my @columns = (\$fields[0], map { \$fields[\$_] eq \$p2[\$_] ? 1 : 0 } 1..5);
my \$sum = 0;
\$sum += \$_ for @columns[@unequal];
my \$percent = \$sum == 0 ? 0 : \$sum * 100 / @unequal;
printf \$format, @columns, \$sum, \$percent;
}
}```
I am getting error like use of unintialized vlaue \$_ in concatenation at line19 after using this code
If anyone help me to fix this error would be highly appreciated.
Regards,
Genetist.
2. Is this input or desired output?

LINES XY1 XY2 XY3 XY4 XY5
P1 EQ NE EQ EQ EQ SUM %
P2 1
1 0 0 1 0 0 0 0
2 0 1 0 0 1 1 100
3 0 0 0 0 1 0 0
4 1 0 0 0 0 0 0
5 1 1 0 0 0 1 100
Edit:

I'll cut to the chase since I'm heading to work:

That looks like your desired output, but the error you cited (just a warning) is that some of the input data is undefined.

There may be empty rows, the field separator may be different than you expect in those cases, etc. We can't guess as to why you receive the warning if you don't provide the input file to test with.
Last edited by keath; October 17th, 2013 at 07:00 AM.
3. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0
Dear Keath,
Thank you very much for your reply to my post. yes that is my expected results. Here i am posting some more data like how my data looks like and how i am expecting.
DATA HAVE
Lines XY1 XY2 XY3 XY4 XY5
P1 A/A A/A G/G C/C A/A
P2 T/T Z/Z -/- G/G A/A
1 T/T Z/Z G/G G/G T/T
2 G/G C/C C/C -/- A/A
3 T/T T/T T/T G/G -/-
4 C/C G/G -/- G/G C/C
5 T/T Z/Z A/A G/G G/G
P1 C/C G/G -/- G/G C/C
P2 T/T Z/Z A/A G/G G/G
1 T/T Z/Z G/G G/G T/T
2 G/G C/C C/C -/- A/A
3 T/T T/T T/T G/G -/-
4 C/C G/G -/- G/G C/C
5 T/T Z/Z A/A G/G G/G

EXPECTED RESULTS for 1st parente set
XY1 XY2 XY3 XY4 XY5 SUM OF NE
NE EQ EQ NE EQ 2

1 1 0 1 0 2 100
0 0 0 0 1 0 0
1 0 0 1 0 2 100
0 0 1 1 0 1 50
1 1 0 1 0 2 100
EXPECTED RESULTS for 1st parente set
NE EQ EQ EQ NE 2

1 1 0 1 0 1 50
0 0 0 0 0 0 0
1 0 0 1 0 1 50
0 0 0 1 0 0 0
1 1 1 1 1 2 100
I wish to upload excel file to help you to understand more better but do not know how to upload file here.
Thank you very much once again
with kind regards,
Genetist
4. No Profile Picture
Contributing User
Devshed Novice (500 - 999 posts)

Join Date
Jun 2012
Location
Paris area, France
Posts
846
Rep Power
499
The problem is that line 19 of the code your posted (where you presumably got the warning) is the closing curly brace of the first unless conditional expression. It is highly unlikely that this line is responsible for this warning. So the warning is probably occurring earlier (because some spaces have probably been eathen up), so that I would think that the warning is probably happening on this line:

Perl Code:
`\$format = join '', map("%-\${_}s", @widths, ''), "\n";`

So it looks like the join is getting non initialized values, which would tend to indicate that @widths has some undef values.

I was just trying to give some possible clues, I can't go further for the time being, because I am not able at this point to run your code on your data.
5. I tried your code using the data provided, and I had no problems with the output at all. There were no warnings.

Perhaps there is a problem with how you are exporting data from Excel?
6. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0
Dear Keath,

Thank you very much for your help and spending your time for my problem. I request you to please explain me how you did that process. Actually i have data in excel and i am transfering this data into text file. when i am running my code, it is asking file path and after that i am getting above said. I think may be you are right problem in data transfer. if you can explain me how you did that process it would be helpful and is it possible to get that output onto separate file?
Thanking you very much,

With kind regards,
Genetist
7. Sure. I just did this:

Code:
```#!/usr/bin/perl
use strict;
use warnings;

use Data::Dumper;

my \$format;
my (@p1, @p2);
my @unequal;

while (<DATA>) {
unless (/^(P?\d)/) {
my @widths;
push @widths, \$+[1] - \$-[1] while /(\S+\s*)/g;
pop @widths;
push @widths, \$widths[-1], \$widths[-1];
\$format = join '', map("%-\${_}s", @widths, ''), "\n";
print;
next;
}

my @fields = split;
#print Dumper \@fields;

if (\$fields[0] eq 'P1') {
@p1 = @fields;
}

elsif (\$fields[0] eq 'P2') {
@p2 = @fields;
printf \$format, 'P1', map (compare_alleles(\$p1[\$_], \$p2[\$_]) ? 'eq' : 'nq', 1..5), 'SUM', '%';
printf \$format, 'P2', map('', 1..5), '', '1';
@unequal = grep { not compare_alleles(\$p1[\$_], \$p2[\$_]) } 1..5;
}

else {
my @columns = (\$fields[0], map { \$fields[\$_] eq \$p2[\$_] ? 1 : 0 } 1..5);
my \$sum = 0;
\$sum += \$_ for @columns[@unequal];
my \$percent = \$sum == 0 ? 0 : \$sum * 100 / @unequal;
printf \$format, @columns, \$sum, \$percent;
}
}

sub compare_alleles {
return 1 if grep {\$_ eq '-/-' or \$_ eq 'Z/Z' } @_;
return \$_[0] eq \$_[1] ? 1 : 0;
}

__DATA__
Lines	XY1	XY2	XY3	XY4	XY5
P1	A/A	A/A	G/G	C/C	A/A
P2	T/T	Z/Z	-/-	G/G	A/A
1	T/T	Z/Z	G/G	G/G	T/T
2	G/G	C/C	C/C	-/-	A/A
3	T/T	T/T	T/T	G/G	-/-
4	C/C	G/G	-/-	G/G	C/C
5	T/T	Z/Z	A/A	G/G	G/G
P1	C/C	G/G	-/-	G/G	C/C
P2	T/T	Z/Z	A/A	G/G	G/G
1	T/T	Z/Z	G/G	G/G	T/T
2	G/G	C/C	C/C	-/-	A/A
3	T/T	T/T	T/T	G/G	-/-
4	C/C	G/G	-/-	G/G	C/C
5	T/T	Z/Z	A/A	G/G	G/G```
I used the perl __DATA__ construct, which allows me to end the program with the data to process. I just pasted the provided data there. The only line I changed then was to tell the while loop to read from that area rather than an external file:
Code:
`while (<DATA>) {`
This is just a simple testing method. There would be no difference if I put the data in an external file.

Your script is not asking for the file's location unless you have modified it in some way from what was provided here.
8. And now I've just tested it with your data in an external file. I named the script 'tst.pl', and the data file 'tst.txt'.

From the command line:
./tst.pl tst.txt
Result:

Code:
```Lines	XY1	XY2	XY3	XY4	XY5
P1    nq  eq  eq  nq  eq  SUM %
P2                            1
1     1   1   0   1   0   2   100
2     0   0   0   0   1   0   0
3     1   0   0   1   0   2   100
4     0   0   1   1   0   1   50
5     1   1   0   1   0   2   100
P1    nq  eq  eq  eq  nq  SUM %
P2                            1
1     1   1   0   1   0   1   50
2     0   0   0   0   0   0   0
3     1   0   0   1   0   1   50
4     0   0   0   1   0   0   0
5     1   1   1   1   1   2   100```
9. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0
Hi Dear Keath,

Thanks lot for your help code is working like miracle and i am so happy for that. Eeverything is perfect it will be good if i get my results out in separate file instead of on commond line because i have to work even more on these results to complete my job 100% perfect.
thanking you very much,

with kind regards,
10. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0
Hi Dear Keath,

I solved that issue getting results into output file and once again thanks very much for spending your valueable time to solve my problem.

With kind Regards,
11. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0

#### need help in this code

Dear Keath,
Thank you very much for your reply and sorry for my late reply.
After using that code on my realtime data with 253 columns i am getting output like this
Lines 1 2
P1 eqeqeqeqeqeqeqeqeqeq eq
what i want to say is how all these EQ (almost 10 EQs) are sitting in column 2 instead of spreading in their respective columns according to our code but i am getting SUM and % are exactly in the way i am expecting. this is the code i changed according to my data
code before
Code:
``` printf \$format, 'P1', map (compare_alleles(\$p1[\$_], \$p2[\$_]) ? 'eq' : 'nq', 1..5), 'SUM', '%';
printf \$format, 'P2', map('', 1..5), '', '1';
@unequal = grep { not compare_alleles(\$p1[\$_], \$p2[\$_]) } 1..5;```
Code:
``` printf \$format, 'P1', map (compare_alleles(\$p1[\$_], \$p2[\$_]) ? 'eq' : 'nq', 1..253), 'SUM', '%';
printf \$format, 'P2', map('', 1..253), '', '1';
@unequal = grep { not compare_alleles(\$p1[\$_], \$p2[\$_]) } 1..253;```
i changed column number to 253 because i have 253 columns.
I just want to know where it went wrong because i just changed only number of columns.
Thanking you very much for your help.
Regards,
Genetist
12. That didn't happen for me. The format stayed the same.

Double check to see if you made some other typo.
13. No Profile Picture
Registered User
Devshed Newbie (0 - 499 posts)

Join Date
Jul 2013
Posts
17
Rep Power
0
Hi Dear Keath,
Thank you very much for your reply. I am also not getting this type of error when i used 5 columns data but if i am using 253 columns data i am getting this above said error i just changed 5 to 253 (before it is 5 and in my real time data i have 253 columns) in this code
Code:
```printf \$format, 'P1', map (compare_alleles(\$p1[\$_], \$p2[\$_]) ? 'eq' : 'nq', 1..253), 'SUM', '%';
printf \$format, 'P2', map('', 1..253), '', '1';
@unequal = grep { not compare_alleles(\$p1[\$_], \$p2[\$_]) } 1..253;```
Thanking you very much,
Regards,
Genetist