#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0

    How to compare and match two rows in a file with PERL


    Hi to all,
    I have data like this
    1.First I want to find how many columns (from AB1 to AB5) are different for P1 and P2 ,
    Eq means: Both P1 and P2 should contain same letters (alleles) or if any one of P1 or P2 contains Z/Z or -/- I should consider them as eq only.
    2.I will compare lines column values from 1 with P2 across all the columns (from AB1 to AB5) in horizontal way and continue for remaining lines from 2 to 5. if they match I would like to give 1 else 0 and I would like to continue this till my programme encounters second set of P1 and P2.
    3.I will make sum for lines 1 to 5 across all the columns from columns AB1 to AB5, but I will include only columns showing different for P1 and P2 in my sum count. Now I am working on this with sumif formula.
    4.I will calculate percentage of matching lines 1 to 5 with P2 by dividing sum came from SUMIF with number of different markers between P1 and P2.
    5.I want to repeat this for remaining set of P1 and P2.
    I am expecting like this
    LINES XY1 XY2 XY3 XY4 XY5
    P1 EQ NE EQ EQ EQ SUM %
    P2 1
    1 0 0 1 0 0 0 0
    2 0 1 0 0 1 1 100
    3 0 0 0 0 1 0 0
    4 1 0 0 0 0 0 0
    5 1 1 0 0 0 1 100

    I am using this code to solve this problem
    Code:
    use strict;
    use warnings;
    sub compare_alleles {
      return 1 if grep {$_ eq '-/-' or $_ eq 'Z/Z' } @_;
      return $_[0] eq $_[1] ? 1 : 0;
    }
    my $format;
    my (@p1, @p2);
    my @unequal;
    while (<>) {
      unless (/^(P?\d)/) {
        my @widths;
        push @widths, $+[1] - $-[1] while /(\S+\s*)/g;
        pop @widths;
        push @widths, $widths[-1], $widths[-1];
        $format = join '', map("%-${_}s", @widths, ''), "\n";
        print;
        next;
      }
      my @fields = split;
      if ($fields[0] eq 'P1') {
        @p1 = @fields;
      }
      elsif ($fields[0] eq 'P2') {
        @p2 = @fields;
        printf $format, 'P1', map (compare_alleles($p1[$_], $p2[$_]) ? 'eq' : 'nq', 1..5), 'SUM', '%';
        printf $format, 'P2', map('', 1..5), '', '1';
        @unequal = grep { not compare_alleles($p1[$_], $p2[$_]) } 1..5;
      }
      else {
        my @columns = ($fields[0], map { $fields[$_] eq $p2[$_] ? 1 : 0 } 1..5);
        my $sum = 0;
        $sum += $_ for @columns[@unequal];
        my $percent = $sum == 0 ? 0 : $sum * 100 / @unequal;
        printf $format, @columns, $sum, $percent;
      }
    }
    I am getting error like use of unintialized vlaue $_ in concatenation at line19 after using this code
    If anyone help me to fix this error would be highly appreciated.
    Thanks in advance
    Regards,
    Genetist.
  2. #2
  3. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,221
    Rep Power
    1809
    Is this input or desired output?

    LINES XY1 XY2 XY3 XY4 XY5
    P1 EQ NE EQ EQ EQ SUM %
    P2 1
    1 0 0 1 0 0 0 0
    2 0 1 0 0 1 1 100
    3 0 0 0 0 1 0 0
    4 1 0 0 0 0 0 0
    5 1 1 0 0 0 1 100
    Edit:

    I'll cut to the chase since I'm heading to work:

    That looks like your desired output, but the error you cited (just a warning) is that some of the input data is undefined.

    There may be empty rows, the field separator may be different than you expect in those cases, etc. We can't guess as to why you receive the warning if you don't provide the input file to test with.
    Last edited by keath; October 17th, 2013 at 07:00 AM.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Dear Keath,
    Thank you very much for your reply to my post. yes that is my expected results. Here i am posting some more data like how my data looks like and how i am expecting.
    DATA HAVE
    Lines XY1 XY2 XY3 XY4 XY5
    P1 A/A A/A G/G C/C A/A
    P2 T/T Z/Z -/- G/G A/A
    1 T/T Z/Z G/G G/G T/T
    2 G/G C/C C/C -/- A/A
    3 T/T T/T T/T G/G -/-
    4 C/C G/G -/- G/G C/C
    5 T/T Z/Z A/A G/G G/G
    P1 C/C G/G -/- G/G C/C
    P2 T/T Z/Z A/A G/G G/G
    1 T/T Z/Z G/G G/G T/T
    2 G/G C/C C/C -/- A/A
    3 T/T T/T T/T G/G -/-
    4 C/C G/G -/- G/G C/C
    5 T/T Z/Z A/A G/G G/G

    EXPECTED RESULTS for 1st parente set
    XY1 XY2 XY3 XY4 XY5 SUM OF NE
    NE EQ EQ NE EQ 2

    1 1 0 1 0 2 100
    0 0 0 0 1 0 0
    1 0 0 1 0 2 100
    0 0 1 1 0 1 50
    1 1 0 1 0 2 100
    EXPECTED RESULTS for 1st parente set
    NE EQ EQ EQ NE 2

    1 1 0 1 0 1 50
    0 0 0 0 0 0 0
    1 0 0 1 0 1 50
    0 0 0 1 0 0 0
    1 1 1 1 1 2 100
    I wish to upload excel file to help you to understand more better but do not know how to upload file here.
    Thank you very much once again
    with kind regards,
    Genetist
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    The problem is that line 19 of the code your posted (where you presumably got the warning) is the closing curly brace of the first unless conditional expression. It is highly unlikely that this line is responsible for this warning. So the warning is probably occurring earlier (because some spaces have probably been eathen up), so that I would think that the warning is probably happening on this line:

    Perl Code:
    $format = join '', map("%-${_}s", @widths, ''), "\n";


    So it looks like the join is getting non initialized values, which would tend to indicate that @widths has some undef values.

    I was just trying to give some possible clues, I can't go further for the time being, because I am not able at this point to run your code on your data.
  8. #5
  9. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,221
    Rep Power
    1809
    I tried your code using the data provided, and I had no problems with the output at all. There were no warnings.

    Perhaps there is a problem with how you are exporting data from Excel?
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Dear Keath,

    Thank you very much for your help and spending your time for my problem. I request you to please explain me how you did that process. Actually i have data in excel and i am transfering this data into text file. when i am running my code, it is asking file path and after that i am getting above said. I think may be you are right problem in data transfer. if you can explain me how you did that process it would be helpful and is it possible to get that output onto separate file?
    Thanking you very much,

    With kind regards,
    Genetist
  12. #7
  13. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,221
    Rep Power
    1809
    Sure. I just did this:

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Data::Dumper;
    
    my $format;
    my (@p1, @p2);
    my @unequal;
    
    while (<DATA>) {
      unless (/^(P?\d)/) {
        my @widths;
        push @widths, $+[1] - $-[1] while /(\S+\s*)/g;
        pop @widths;
        push @widths, $widths[-1], $widths[-1];
        $format = join '', map("%-${_}s", @widths, ''), "\n";
        print;
        next;
      }
      
      my @fields = split;
      #print Dumper \@fields;
      
      if ($fields[0] eq 'P1') {
        @p1 = @fields;
      }
      
      elsif ($fields[0] eq 'P2') {
        @p2 = @fields;
        printf $format, 'P1', map (compare_alleles($p1[$_], $p2[$_]) ? 'eq' : 'nq', 1..5), 'SUM', '%';
        printf $format, 'P2', map('', 1..5), '', '1';
        @unequal = grep { not compare_alleles($p1[$_], $p2[$_]) } 1..5;
      }
      
      else {
        my @columns = ($fields[0], map { $fields[$_] eq $p2[$_] ? 1 : 0 } 1..5);
        my $sum = 0;
        $sum += $_ for @columns[@unequal];
        my $percent = $sum == 0 ? 0 : $sum * 100 / @unequal;
        printf $format, @columns, $sum, $percent;
      }
    }
    
    sub compare_alleles {
      return 1 if grep {$_ eq '-/-' or $_ eq 'Z/Z' } @_;
      return $_[0] eq $_[1] ? 1 : 0;
    }
    
    __DATA__
    Lines	XY1	XY2	XY3	XY4	XY5
    P1	A/A	A/A	G/G	C/C	A/A
    P2	T/T	Z/Z	-/-	G/G	A/A
    1	T/T	Z/Z	G/G	G/G	T/T
    2	G/G	C/C	C/C	-/-	A/A
    3	T/T	T/T	T/T	G/G	-/-
    4	C/C	G/G	-/-	G/G	C/C
    5	T/T	Z/Z	A/A	G/G	G/G
    P1	C/C	G/G	-/-	G/G	C/C
    P2	T/T	Z/Z	A/A	G/G	G/G
    1	T/T	Z/Z	G/G	G/G	T/T
    2	G/G	C/C	C/C	-/-	A/A
    3	T/T	T/T	T/T	G/G	-/-
    4	C/C	G/G	-/-	G/G	C/C
    5	T/T	Z/Z	A/A	G/G	G/G
    I used the perl __DATA__ construct, which allows me to end the program with the data to process. I just pasted the provided data there. The only line I changed then was to tell the while loop to read from that area rather than an external file:
    Code:
    while (<DATA>) {
    This is just a simple testing method. There would be no difference if I put the data in an external file.

    Your script is not asking for the file's location unless you have modified it in some way from what was provided here.
  14. #8
  15. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,221
    Rep Power
    1809
    And now I've just tested it with your data in an external file. I named the script 'tst.pl', and the data file 'tst.txt'.

    From the command line:
    ./tst.pl tst.txt
    Result:

    Code:
    Lines	XY1	XY2	XY3	XY4	XY5
    P1    nq  eq  eq  nq  eq  SUM %
    P2                            1
    1     1   1   0   1   0   2   100
    2     0   0   0   0   1   0   0
    3     1   0   0   1   0   2   100
    4     0   0   1   1   0   1   50
    5     1   1   0   1   0   2   100
    P1    nq  eq  eq  eq  nq  SUM %
    P2                            1
    1     1   1   0   1   0   1   50
    2     0   0   0   0   0   0   0
    3     1   0   0   1   0   1   50
    4     0   0   0   1   0   0   0
    5     1   1   1   1   1   2   100
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Hi Dear Keath,

    Thanks lot for your help code is working like miracle and i am so happy for that. Eeverything is perfect it will be good if i get my results out in separate file instead of on commond line because i have to work even more on these results to complete my job 100% perfect.
    thanking you very much,

    with kind regards,
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Hi Dear Keath,

    I solved that issue getting results into output file and once again thanks very much for spending your valueable time to solve my problem.

    With kind Regards,
  20. #11
  21. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0

    need help in this code


    Dear Keath,
    Thank you very much for your reply and sorry for my late reply.
    After using that code on my realtime data with 253 columns i am getting output like this
    Lines 1 2
    P1 eqeqeqeqeqeqeqeqeqeq eq
    what i want to say is how all these EQ (almost 10 EQs) are sitting in column 2 instead of spreading in their respective columns according to our code but i am getting SUM and % are exactly in the way i am expecting. this is the code i changed according to my data
    code before
    Code:
     printf $format, 'P1', map (compare_alleles($p1[$_], $p2[$_]) ? 'eq' : 'nq', 1..5), 'SUM', '%';
        printf $format, 'P2', map('', 1..5), '', '1';
        @unequal = grep { not compare_alleles($p1[$_], $p2[$_]) } 1..5;
    Code:
     printf $format, 'P1', map (compare_alleles($p1[$_], $p2[$_]) ? 'eq' : 'nq', 1..253), 'SUM', '%';
        printf $format, 'P2', map('', 1..253), '', '1';
        @unequal = grep { not compare_alleles($p1[$_], $p2[$_]) } 1..253;
    i changed column number to 253 because i have 253 columns.
    I just want to know where it went wrong because i just changed only number of columns.
    Thanking you very much for your help.
    Regards,
    Genetist
  22. #12
  23. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,221
    Rep Power
    1809
    That didn't happen for me. The format stayed the same.

    Double check to see if you made some other typo.
  24. #13
  25. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Hi Dear Keath,
    Thank you very much for your reply. I am also not getting this type of error when i used 5 columns data but if i am using 253 columns data i am getting this above said error i just changed 5 to 253 (before it is 5 and in my real time data i have 253 columns) in this code
    Code:
    printf $format, 'P1', map (compare_alleles($p1[$_], $p2[$_]) ? 'eq' : 'nq', 1..253), 'SUM', '%';
        printf $format, 'P2', map('', 1..253), '', '1';
        @unequal = grep { not compare_alleles($p1[$_], $p2[$_]) } 1..253;
    Thanking you very much,
    Regards,
    Genetist

IMN logo majestic logo threadwatch logo seochat tools logo