#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0

    Comparing and matching rows in a column with perl


    Hi to all,
    I have data like this
    DATA HAVE
    pop A B C D E
    P1 T/T C/C C/C T/T C/C
    P2 A/A G/G C/C T/T C/C
    1 A/A G/G C/C T/T C/C
    2 A/A G/G C/C T/T C/C
    3 A/T A/C A/G A/T A/C
    4 T/A T/G T/C T/A T/G
    5 G/A G/T G/C G/A G/T
    6 C/A C/T C/G C/A C/T
    pop A B C D E
    P1 T/T C/C C/C T/T C/C
    P2 A/A G/G C/C T/T C/C
    1 A/A G/G C/C T/T C/C
    2 A/A G/G C/C T/T C/C
    3 A/T A/C A/G A/T A/C
    4 T/A T/G T/C T/A T/G
    5 G/A G/T G/C G/A G/T
    6 C/A C/T C/G C/A C/T

    guidelines to work:
    1. first I want to convert all A/A to A, T/T to T, C/C to C, G/G to G, Z/Z to - and -/- to - and remaining characters with combination of A,T,G,C like A/T,G/T,C/G,T/C etc to H
    2. Now I want to know status from A to E by comparing P1 with P2, if P1=P2 then status from A to E is mono or any one of P1 or P2 contains Z/Z or -/- then status from A to E is mono else status from A to E is poly
    3. I want to match 1 in pop column with P2 in pop column for A to E, if 1 in pop column matches to p2 in pop column and its status is poly only then I would like to give 1 otherwise as such, if it is mono I do not want to do anything.
    4. Now I will calculate # 1s and # H's and finally I will calculate %sim with this formula =((#1*2+#H)/((#1+#H)*2))*100.
    5. I want to repeat the same procedure for second set of parents P1 and P2
    i tried this code for first guideline
    Code:
    #!/usr/bin/perl -w
    use strict;
    open(FILE, "<input.txt") || die "File not found";
    my @lines = <FILE>;
    my @newlines;
    foreach(@lines) {
       $_ =~ s/AA/A/g;
       $_ =~ s/TT/T/g;
       $_ =~ s/GG/G/g;
       $_ =~ s/CC/C/g;
       $_ =~ s/AT/H/g;
       $_ =~ s/AG/H/g;
       $_ =~ s/AC/H/g;
       $_ =~ s/TA/H/g;
       $_ =~ s/TG/H/g;
       $_ =~ s/TC/H/g;
       $_ =~ s/GA/H/g;
       $_ =~ s/GT/H/g;
       $_ =~ s/GC/H/g;
       $_ =~ s/CA/H/g;
       $_ =~ s/CT/H/g;
       $_ =~ s/CG/H/g;
       $_ =~ s/ZZ/-/g;
         
       push(@newlines,$_);
    }
    open(FILE, ">input1.txt") || die "File not found";
    print FILE @newlines;
    close(FILE);
    Code for 2nd guideline
    Code:
    #!/usr/bin/perl
    use warnings;
    use strict;
    use feature qw{ say };
    
    use Data::Dumper;
    
    
    *ARGV = *DATA{IO} unless @ARGV;
    
    my (@parents, @rows);
    
    sub {
        my $header = <>;
        push @parents, map [ split ' ', <> ], 1, 2;
        push @rows,    map [ split ' ', <> ], 1 .. 6; 
    }->() for 1, 2;
    
    for (map @$_, @parents, @rows) {
        s= ([ACTG]) / \1 =$1=x;
        s= ([-Z])   / \1 =-=x;
        s= .        / .  =H=x;
    }
    
    say join "\t", 'pop', ('A' .. 'E') x 2;
    
    print 'P1';
    for my $parent (0, 1) {
        print join "\t", q(), map {
            my $p1 = $parents[ $parent * 2 ][$_];
            my $p2 = $parents[ 1 + $parent * 2 ][$_];
            ($p1 eq $p2 or '-' eq $p1 or '-' eq $p2) ? 'mono' : 'poly';
        } 1 .. 5;
    }
    print "\n";
    
    
    __DATA__
    pop A   B   C   D   E
    P1  T/T C/C C/C T/T C/C
    P2  A/A G/G C/C T/T C/C
    1   A/A G/G C/C T/T C/C
    2   A/A G/G C/C T/T C/C
    3   A/T A/C A/G A/T A/C
    4   T/A T/G T/C T/A T/G
    5   G/A G/T G/C G/A G/T
    6   C/A C/T C/G C/A C/T
    pop A   B   C   D   E
    P1  T/T C/C C/C T/T C/C
    P2  A/A G/G C/C T/T C/C
    1   A/A G/G C/C T/T C/C
    2   A/A G/G C/C T/T C/C
    3   A/T A/C A/G A/T A/C
    4   T/A T/G T/C T/A T/G
    5   G/A G/T G/C G/A G/T
    6   C/A C/T C/G C/A C/T
    Can anyone help me to proceed for remaining guidlines? I know these codes can be in one single programme as newbie in perl it is looking difficult for me. Any help would be highly appreciated
  2. #2
  3. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,258
    Rep Power
    1810
    I started to create a script to meet your requirements, but it isn't clear to me what you mean.

    Code:
    3.	I want to match 1 in pop column with P2 in pop column for A to E,
    	if 1 in pop column matches to p2 in pop column
    	and its status is poly only then I would like to give 1
    	otherwise as such, if it is mono I do not want to do anything.
    For one set of data, you want to only compare row 1 with row P2? And you want to give 1 where? In each matching field? Do you not want to do this for rows 2 through 6?

    Code:
    4.	Now I will calculate # 1s and # H's 
    	and finally I will calculate %sim with this formula
    	
    	=((#1*2+#H)/((#1+#H)*2))*100
    Number of 1s (that we just calculated by comparing row 1 and row P2) and number of H's where? H's in all rows? H's in just row 1?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Dear Keath,
    Thank you very much for your reply. 3rd step problem is actually in pop column p1 and P2 are parents numbers from 1 to 6 are their progeny.
    pop A B C D E
    P1 POLY POLY MONO MONO MONO
    P2 A G C T C
    1 A G C T C
    2 A G C T C
    3 A G C - C
    4 H H H H H
    5 H H H H H
    6 H H H H H
    Now i tested parents and progeny with markers from A to E. Now i want to compare these progeny from 1 to 6 with the P2 (parent 2) across A to E. if result in my second step is Poly at P1 and my 1st progeny letter matches to the letter of P2 then i would like to give 1 otherwise P2 parent letter only, if result at P1 is mono i do not want to do anything.,
    Now i want like this
    pop A B C D E
    P1 POLY POLY MONO MONO MONO
    P2
    1 1 1 C/C T/T C/C
    2 1 1 C/C T/T C/C
    3 H H H H H
    4 H H H H H
    5 H H H H H
    6 H H H H H
    Here i given 1 at A for 1st progeny in POP column it means 1st progeny status is poly at P1 and this allele (A) is matching to P2 allele (A) at A like this 1 at B it means 1st progeny status is poly and its allele(G) matching to P2 (G) allele but for the same (1st progeny) status is mono at P1 so i do not want to do anything.
    Then i will count total number of 1's and H's and then i will apply this formula to get % SIM

    =((No.Of.1's*2+No.Of. H's/((No.Of.1's+No.Of. H's)*2))*100

    Hope i explained you well and i can upload sample file but here i do not know how to do that. Really I need to solve this problem and i request you please post here if anything unclear.
    Once again thank you very much
    Kind Regards,
    Genetist
  6. #4
  7. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,258
    Rep Power
    1810
    Now I'm not sure I even got requirement 2 right.

    Based on this sample:
    Code:
    pop	A	B	C	D	E
    P1	POLY	POLY	MONO	MONO	MONO
    P2
    I guess that requirement 2 is a field by field comparison. Where you said " if P1=P2 then status from A to E is mono", you mean P1:A = P2:A then status A is mono. If P2:B = P2:B then B is mono. Etc.

    And you didn't answer to the number of H's and number of 1s. Is the formula to be applied to each child, 1 through 6; or is this a grand total for the whole data set.

    Such a uniform number of children.

    Anyway, I'll try my best to interpret this.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Hi Keath,
    I am sorry for that
    yes you are right it is field by field comparision
    if P1=P2 and any one of P1 or P2 contains Z/Z or -/- across A to E then status from A to E is mono else status from A to E is poly.

    2. after comparision of POP from 1 to 5 with P2 then i will get letters like 1,H and other characters across A to E. Now i want to count total 1's and H's for pop (1 to 5) across A to E and then i will apply that formula for %sim calculation.
    if you do not have any problem i will send you excel sheet explaining clearly step by step, if you provide me your i.d or other thing.
    Thanking you very much

    Regards,
    Genetist
  10. #6
  11. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,258
    Rep Power
    1810
    Using your sample data:

    Code:
    pop A   B   C   D   E
    P1  T/T C/C C/C T/T C/C
    P2  A/A G/G C/C T/T C/C
    1   A/A G/G C/C T/T C/C
    2   A/A G/G C/C T/T C/C
    3   A/T A/C A/G A/T A/C
    4   T/A T/G T/C T/A T/G
    5   G/A G/T G/C G/A G/T
    6   C/A C/T C/G C/A C/T
    Translates to:
    Code:
    pop	A	B	C	D	E
    P1	POLY	POLY	MONO	MONO	MONO
    P2	A	G	C	T	C
    1	A	G	C	T	C
    2	A	G	C	T	C
    3	H	H	H	H	H
    4	H	H	H	H	H
    5	H	H	H	H	H
    6	H	H	H	H	H
    I believe.

    Then you provided this:
    Now i tested parents and progeny with markers from A to E. Now i want to compare these progeny from 1 to 6 with the P2 (parent 2) across A to E. if result in my second step is Poly at P1 and my 1st progeny letter matches to the letter of P2 then i would like to give 1 otherwise P2 parent letter only, if result at P1 is mono i do not want to do anything.,
    Now i want like this
    Code:
    pop	A	B	C	D	E
    P1	POLY	POLY	MONO	MONO	MONO
    P2	
    1	1	1	C/C	T/T	C/C
    2	1	1	C/C	T/T	C/C
    3	H	H	H	H	H
    4	H	H	H	H	H
    5	H	H	H	H	H
    6	H	H	H	H	H
    Since columns A and B are poly, why did you not set the values in rows 3-6 to the values of P2 for A and B?

    Is this not what is wanted:
    Code:
    pop	A	B	C	D	E
    P1	POLY	POLY	MONO	MONO	MONO
    P2	
    1	1	1	C/C	T/T	C/C
    2	1	1	C/C	T/T	C/C
    3	A	G	H	H	H
    4	A	G	H	H	H
    5	A	G	H	H	H
    6	A	G	H	H	H
  12. #7
  13. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,258
    Rep Power
    1810
    Well anyway, here's what I hacked together as best I could understand your requirements.

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Array::Utils qw/array_minus/;
    
    my  %enc = (
    	'A/A' => 'A',
    	'T/T' => 'T',
    	'C/C' => 'C',
    	'G/G' => 'G',
    	'Z/Z' => '-',
    	'-/-' => '-'
    );
    
    my @valid = qw/A T G C/;
    
    my (@group, $ref);
    
    while (<DATA>) {
    	chomp;
    	my ($pop, @data) = split /\s+/;
    	
    	if ($pop eq 'pop') {
    		# reset group here
    		push @group, {};
    		$ref = $group[$#group];
    		next;
    	} else {
    		# convert the fields
    		foreach my $field (@data) {
    			my $valid = valid_pair($field);
    			die "unrecognized field: $field" unless $valid;
    			$field = $valid;
    		}
    	}
    	
    	$ref->{$pop} = \@data;
    }
    
    
    # 5. I want to repeat the same procedure for second set of parents P1 and P2
    
    foreach my $set (@group) {
    	set_status($set);
    	progeny_to_p2_comparison($set);
    	present_data($set);
    }
    
    =pod
    
    2.	I want to know status from A to E by comparing P1 with P2,
    	if P1=P2 then status from A to E is mono 
    	or any one of P1 or P2 contains Z/Z or -/- then status from A to E is mono
    	else status from A to E is poly
    
    =cut
    
    
    sub set_status {
    	my ($s1) = @_;
    	
    	my $p1 = $s1->{'P1'};
    	my $p2 = $s1->{'P2'};
    	
    	my $n = @$p1;
    	for (my $i=0; $i<$n; $i++) {
    		if ($p1->[$i] ne $p2->[$i] and $p1->[$i] ne '-' and $p2->[$i] ne '-') {
    			$s1->{'status'}[$i] = 'poly';
    		} else {
    			$s1->{'status'}[$i] = 'mono';
    		}
    	}
    }
    
    =pod
    
    3.	I want to match 1 in pop column with P2 in pop column for A to E,
    	if 1 in pop column matches to p2 in pop column
    	and its status is poly only then I would like to give 1
    	otherwise as such, if it is mono I do not want to do anything.
    
    	i want to compare these progeny from 1 to 6 with the P2 (parent 2)
    	across A to E.
    	
    	if result in my second step is Poly at P1
    	and my 1st progeny letter matches to the letter of P2
    	then i would like to give 1
    	otherwise P2 parent letter only
    	
    	if result at P1 is mono i do not want to do anything.
    	
    =cut
    
    
    sub progeny_to_p2_comparison {
    	my ($s1) = @_;
    	
    	die "status not yet determined for data set" unless exists $s1->{'status'};
    	
    	my $p2 = $s1->{'P2'};
    	my $status = $s1->{'status'};
    	my $n = @$p2;
    	
    	foreach my $c (1..6) {
    		my $child = $s1->{$c};
    		
    		for (my $i=0; $i<$n; $i++) {
    			if ($status->[$i] eq 'mono') {
    				$s1->{'compare'}[$c-1][$i] = $child->[$i];
    			} else {
    				$s1->{'compare'}[$c-1][$i] = ($p2->[$i] eq $child->[$i]) ? 1 : $p2->[$i];
    			}
    		}
    	}
    }
    
    
    =pod
    
    4.	Now I will calculate # 1s and # H's 
    	and finally I will calculate %sim with this formula
    	
    	=((#1*2+#H)/((#1+#H)*2))*100
    
    	Now i want like this
    
    	pop	A		B		C		D		E
    	P1	POLY	POLY	MONO	MONO	MONO
    	P2	
    	1	1		1		C/C		T/T		C/C
    	2	1		1		C/C		T/T		C/C
    	3	H		H		H		H		H
    	4	H		H		H		H		H
    	5	H		H		H		H		H
    	6	H		H		H		H		H
    
    	Here i given 1 at A for 1st progeny in POP column 
    	it means 1st progeny status is poly at P1 and this 
    	allele (A) is matching to P2 allele (A) at A like this 1 at B
    	it means 1st progeny status is poly and its allele(G)
    	matching to P2 (G) allele but for the same (1st progeny) status is mono
    	at P1 so i do not want to do anything.
    	
    	Then i will count total number of 1's and H's 
    	and then i will apply this formula to get % SIM
    
    =cut
    
    
    sub calculate_sim {
    	my ($s1) = @_;
    	
    	die "data not yet availabe for data set" unless exists $s1->{'compare'};
    	
    	my $ones = number_of_ones($s1);
    	my $h 	 = number_of_h($s1);
    	
    	my $sim =(($ones*2+$h)/(($ones+$h)*2))*100;
    	return $sim;
    }
    
    sub number_of_ones {
    	my ($s1) = @_;
    	
    	my $compare = $s1->{'compare'};
    	my $number;
    	
    	foreach my $c (@$compare) {
    		$number += grep {$_ eq '1'} @$c;
    	}
    	
    	return $number;
    }
    
    sub number_of_h {
    	my ($s1) = @_;
    		
    	my $compare = $s1->{'compare'};
    	my $number;
    	
    	foreach my $c (@$compare) {
    		$number += grep {$_ eq 'H'} @$c;
    	}
    	
    	return $number;
    }
    
    sub present_data {
    	my ($s1) = @_;
    	
    	my @order = qw/P1 P2/;
    	foreach my $n (@order){
    		print "$n:\t", join("\t", @{$s1->{$n}}), "\n";
    	}
    	print "ST:\t", join("\t", @{$s1->{'status'}}), "\n";
    	
    	my $n = 1;
    	foreach my $c (@{$s1->{'compare'}}) {
    		print $n++.":\t", join("\t", @$c), "\n";
    	}
    	
    	my $sim = calculate_sim($s1);
    	print "Sim: $sim\n";
    	
    	print '=' x 44, "\n";
    }
    
    sub valid_pair {
    	my ($p) = @_;
    	
    	return $enc{$p} if exists $enc{$p};
    	
    	my @pair = split '/', $p, 2;
        my @minus = array_minus(@pair, @valid);
    	return 0 if @minus;
    	return 'H';
    }
    
    __DATA__
    pop	A	B	C	D	E
    P1	T/T	C/C	C/C	T/T	C/C
    P2	A/A	G/G	C/C	T/T	C/C
    1	A/A	G/G	C/C	T/T	C/C
    2	A/A	G/G	C/C	T/T	C/C
    3	A/T	A/C	A/G	A/T	A/C
    4	T/A	T/G	T/C	T/A	T/G
    5	G/A	G/T	G/C	G/A	G/T
    6	C/A	C/T	C/G	C/A	C/T
    pop	A	B	C	D	E
    P1	T/T	C/C	C/C	T/T	C/C
    P2	A/A	G/G	C/C	T/T	C/C
    1	A/A	G/G	C/C	T/T	C/C
    2	A/A	G/G	C/C	T/T	C/C
    3	A/T	A/C	A/G	A/T	A/C
    4	T/A	T/G	T/C	T/A	T/G
    5	G/A	G/T	G/C	G/A	G/T
    6	C/A	C/T	C/G	C/A	C/T
    If I was to continue working on this, I would definitely change the data set into an object. All the calculations and methods are intrinsic to the data set, so there is no need to pass the information around as I am doing here.

    This is easier to install, but has no other advantage. With an object, all calculations could be performed at initialization, making the code much easier to read.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Dear Keath,
    Thank you very much for your help and spending your valuable time on my problem. I had look on your code provided to me. I think still in 3rd step i am getting problem i knew this purely my problem that i am unable to explain you better i am 100% sure you can help me well if i explained you my problem very well.
    this is my 3rd step
    3. I want to match 1 in pop column with P2 in pop column for A to E, if 1 in pop column matches to p2 in pop column and its status is poly only then I would like to give 1 otherwise as such, if it is mono I do not want to do anything.

    my data is like this
    pop A B C D E
    P1 POLY POLY MONO MONO MONO
    P2 A G C T C
    1 A G C T C
    now i want to compare 1(bolded) allele (A) in POP column
    with P2 (bolded) allele (A) for coulmn A (bolded) according to my requirement 1 allele (A) of POP column is matched to P2 allele (A) in column A and status of column A is POLY at P1 then i would like to give 1, if status at P1 for column is poly and allele of 1 in pop column is not matching i would like to give the same letter (A or G or T or C what ever it is present there) like letter present for 1 for columns C to E. If status is mono at P1 so we should not want to do anything. Like this i will continue to 1 in pop column across other columns also (B to E)
    Expected out come is like this
    A B C D E
    POLY POLY MONO MONO MONO

    1 1 C T C
    1 1 C T C
    H H H H H
    H H H H H
    H H H H H
    H H H H H
    But your subroutines for calculating No. Of. 1's and H's are working perfectly. I found everything is o.k expect that 3 rd step.
    I really so much thank full to you for solving this issue.
    Thanks in advance,
    Regards,
    Geneitist
  16. #9
  17. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,258
    Rep Power
    1810
    Sorry, I don't understand your requirement still. I'll tell you where the language is not clear:

    In a previous post, you said this about step 3 (in answer to a question I asked):
    i want to compare these progeny from 1 to 6 with the P2 (parent 2)
    across A to E.
    So I thought you wanted to check each row: 1 through 6.

    Now look at your comment today. There is no mention of any other row. Only comparison of row P2 to 1. So to be clear, do you want the routine repeated for each 'progeny' row, or are you telling me to only compare row 1?

    ---

    I'm trying to read your comments very carefully and see where I may be misunderstanding. The only obvious difference in the output from what you wanted is that rows 2 through 6 have a different output in columns A and B.

    If you look about two posts ago, I asked you very specifically about that (post #6) . The guidance you gave was:

    if result in my second step is Poly at P1 and my 1st progeny letter matches to the letter of P2 then i would like to give 1 otherwise P2 parent letter only
    So the result should either be a 1, or the value that was in P2. If you give the value of P2, then you get the result I provided.

    Today you said this:
    if status at P1 for column is poly and allele of 1 in pop column is not matching i would like to give the same letter (A or G or T or C what ever it is present there) like letter present for 1 for columns C to E.
    "what ever it is present there" is not clear. Whatever is present in P2, or the row being tested?

    The only change that needs to be made would be to change one line in the progeny_to_p2_comparison subroutine.
    Code:
    $s1->{'compare'}[$c-1][$i] = ($p2->[$i] eq $child->[$i]) ? 1 : $p2->[$i];
    change to

    Code:
    $s1->{'compare'}[$c-1][$i] = ($p2->[$i] eq $child->[$i]) ? 1 : $child->[$i];
    Last edited by keath; December 18th, 2013 at 09:44 AM.
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    15
    Rep Power
    0
    Dear Keath,
    Thank you very much for your help and patiency in helping me. Your changes in this
    Code:
    $s1->{'compare'}[$c-1][$i] = ($p2->[$i] eq $child->[$i]) ? 1 : $child->[$i];
    is working perfectly and in the way i want and many thanks for this effeort
    for example Now i want to calculate Number of 1's and H's for all the childrens from 1 to 6 in pop column from A to E
    P1: T C C T C
    P2: A G C T C
    ST: poly poly mono mono mono
    1:00 1 1 C T C
    2:00 1 1 C T C
    3:00 H H H H H
    4:00 H H H H H
    5:00 H H H H H
    6:00 H H H H H
    For example for 1(bolded) in Pop column i got Number of 1's are 2 and No of H's are 0 across A to E (bolded), like this i want to calculate No.Of. 1's and H's from 1 to 6 in Pop column across A to E.
    Now my data will look like this after counting No.Of. 1's and H's
    P1: T C C T C
    P2: A G C T C
    ST: poly poly mono mono mono #1's #H's
    1 1 1 C T C 2 0
    2 1 1 C T C 2 0
    3 H H H H H 0 5
    4 H H H H H 0 5
    5 H H H H H 0 5
    6 H H H H H 0 5

    After counting 1's and H's for all the children from 1 to 6 in pop column across A to E, then i will calculate % sim using this formula
    =((No.Of.1's*2+No.Of.H's)/((No.Of.1's+No.Of.H's)*2))*100
    after this step i will get data like this
    POP #1's #H's %sim
    1 2 0 100
    2 2 0 100
    3 0 5 50
    4 0 5 50
    5 0 5 50
    6 0 5 50

    Then i will get everything as per my requirement.
    Thanking you very much for your help and patiency towards my problem.
    Thanking you very much,
    with kind regards,
    Genetist

IMN logo majestic logo threadwatch logo seochat tools logo