1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Rep Power

    Filtering Out Duplicates then Unique

    Hi All,

    We are trying to filter an excel sheet with our data using PERL. The problem is that we are novices to programming.

    Here's what our data looks like:

    comp100126_c0_seq1 ENSTGUT00000011556
    comp10013_c0_seq1 ENSTGUT00000000063
    comp100132_c0_seq1 ENSTGUT00000003824
    comp100132_c0_seq1 ENSTGUT00000003824
    comp100145_c0_seq1 ENSTGUT00000010208
    comp10016_c0_seq1 ENSTGUT00000008432
    comp10017_c0_seq1 ENSTGUT00000012346
    comp10017_c0_seq1 ENSTGUT00000002652
    comp100173_c0_seq1 ENSTGUT00000000274
    comp10021_c0_seq1 ENSTGUT00000016972
    comp10021_c0_seq1 ENSTGUT00000011149

    Here's what we are trying to do. In column A, we want to pull out all the the rows that are duplicates - for example comp100132_c0_seq1. With that, we need all the associated columns pulled along with it (there's an additional 9 columns not pictured here). From there, once the duplicates are pulled out, we want the duplicates that hit a unique character in column B. For example comp10021_c0_seq1 hits 2 different characters and we want to pull that out.

    On a small scale we've been able to do this in Excel, but our file has over 200,000 rows and Excel cannot handle that large of a file on our computers.

    We thought about merging the two columns listed above before using them for programming, but we are not sure where to proceed.

    Any help would be greatly appreciated.

    Thank You!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Paris area, France
    Rep Power
    Since you file appears to be pre-sorted, you could use the following algorithm (adapted from a module that I have written and plan to load some day on the CPAN, I hope that I haven't added a bug while adapting it):

    Perl Code:
    my $previous_key = "";
    my ($line, $previous_line);
    my $dup_flag = 0;
    while ($line = <$IN>) {
    	chomp $line;
    	my $key $ (split / /, $line, 2)[0]
    	if ($key eq $previous_key) {
    		# this line is a duplicate
    		print $DUPL $previous_line, "\n" unless $dup_flag;
    		print $DUPL $line, "\n";	
    		$dup_flag = 1;
    		$dupl_number ++;
    	} else {
    		print $OUT $previous_line;
    		$dup_flag = 0;
    	$previous_line = $line;
    	$previous_key = $key;
    print $OUT $line unless $dup_flag;

    However, the reason I have done this module it that I am using very large files for which the best method can't be used.

    The best and more general method is to read the file and load the comparison key into a hash for each line. If the key has already been seen, it is a duplicate. This method is described in the FAQ. With this method, the file needs not be sorted, but it cannot work if the file is so big that the keys will not fit in memory, which is why I had to use the other method described above. With 200,000 records, you don't have this problem.

IMN logo majestic logo threadwatch logo seochat tools logo