#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0

    Reading a tab seperated column, and pulling specific columns to a new file


    I have a text file with columns separated by tabs. I need to go through, find the columns with headers matching what I need(I know the names of the columns I need), then take those columns and put them in a new file.

    So for example

    ColumnA ColumnB ColumnC ColumnD
    Data. Data. Data. Data



    I need to pull the data from columnA and columnD, then put them in a new file, as well as the data in the column. How would I go about doing this. I have managed to open the file in Perl, and I can create a file. But other than that, I can't seem to figure out how to go through the columns and get the ones I need.

    I'm using activeperl on windows right now, but am trying to switch to Linux.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    Hi,

    you should split the lines on tabs:

    Perl Code:
    chomp $line; # remove trailing end of line character
    my @fields = split /\t/, $line;


    Then, when reading the header line, check each field whether is it one of the column that you need, and record in an array the list of subscripts corresponding to the data you need.

    Then split of the data lines the same was as above and use these subscripts to fetch the part that you need.
  4. #3
  5. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,254
    Rep Power
    1810
    Assuming the fields are tab delimited

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Data::Dumper;
    
    #make a list of the fields you want
    my @wanted_fields = qw/ColumnA ColumnC/;
    
    # First line in the data file is the header
    # retrieve it and turn it into an array
    my @fields = split /\t/, <DATA>;
    chomp @fields;
    #print Dumper \@fields;
    
    while (<DATA>) {
    	chomp;
    	
    	# Read each remaining line from the file and turn it into a hash
    	# assign each field to its matching header name
    	my %row;
    	@row{@fields} = split /\t/;
    	#print Dumper \%row;
    	
    	# use map to find each wanted field in the hash
    	# for the matching data on this row
    	my @wanted_data = map {$row{$_}} @wanted_fields;
    	print join("\t", @wanted_data), "\n";
    }
    
    __DATA__
    ColumnA	ColumnB	ColumnC	ColumnD
    DataA	DataB	DataC	DataD
    DataA	DataB	DataC	DataD
    DataA	DataB	DataC	DataD
    DataA	DataB	DataC	DataD
    DataA	DataB	DataC	DataD
    Last edited by keath; May 5th, 2013 at 08:42 AM.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    Wow, that seems to do the trick. now I have to get it to do that with an opened text file, then print the output to a new file. That shouldn't be hard at all.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    I think I may have broken it.

    My script is


    Code:
    #File Iterator Version 2.5
    
    #DEVSHED FORUMS CODE USED
    
    
    #!/usr/bin/perl
    
    
    use strict;
    use warnings;
    use Data::Dumper;
    
    
    my $file = $ARGV[0];
    
    open(FH, "< $file") or die "Cannot open $file for reading: $!";
    my @lines;
    while(<FH>)
    {
    	push (@lines, $_);
    }
    
    close FH or die "Cannot close $file: $!";
    
    
    #Create new file for writing;
    open(my $OFILE, '>Output.txt') or die "Cannot create file for output: $!";
    
    #List of Wanted Columns, and respective outputs for these columns
    my @wanted_fields = qw/ACC# NAME "Gene Symbol" MOD_TYPE RSD/;
    my @output_fields = qw/Acc Name Symbol Type Residue/;
    
    #Retrieve wanted fields from first line, as it is the header
    my @fields = split /\t/, @lines;
    chomp @fields;
    #print Dumper \@fields;
    
    
    #Print the headers to the output file
    print $OFILE join("\t",@output_fields), "\n";
    
    
    #Iterate Through the remainder of the @lines array and add the wanted columns to the text file
    
    while(@lines)
    {
    	chomp;
    	
    	#Read each line from the array and turn it into a hash.
    	#assign each column to matching header.
    	my %row;
    	@row{@fields} = split /\t/;
    	#print Dumper \%row
    	
    	#Use map to find wanted column in the hash for matching data on row
    	my @wanted_data = map{$row{$_}} @wanted_fields;
    	print $OFILE join("\t", @wanted_data), "\n";
    }
    
    close $OFILE or die "Error closing $OFILE: S!";

    But instead of writing the proper data to the file, it hangs writing endless streams of data into the text file. I got it to work correctly by copying the data from the file into the __DATA__ section on the original script, but for obvious reasons this is not the solution. What am I doing wrong?

    I think it might be because the file has 2 lines before the header on it. How would I remove the first two lines, and then read the file?

    EDIT: When I try and read the file, I get "Uninitialized value $_ used in scalar chomp"
  10. #6
  11. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,254
    Rep Power
    1810
    You read all the lines into memory, which I don't recommend, but fine.

    Then this:

    Code:
    while(@lines)
    {
       ...
    }
    While @lines is true.

    What's going to make lines not be true? Only when the array is empty.

    Since you are not pulling lines out of the array, it will never evaluate to zero, so will loop forever.

    If you are going to do it with an array, use a for loop.

    Comments on this post

    • Laurent_R agrees
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    I managed to fix it. It works(at least for one file)

    Code:
    #!/usr/bin/perl
    #use Data::Dumper
    
    $file = @ARGV[0];
    open(FH, "< $file") or die "Cannot open $file for reading: $!";
    my @array = <FH>;
    close FH or die "Could not open file: $!";
    
    open(OUT, ">$file") or die "Cannot open $file for write access: $!";
    print OUT splice(@array,3);
    close OUT or die "Could not close file: $!";
    
    open(MYFILE,"< $file") or die "Cannot open $file for read access: $!";
    
    #Create new file for writing;
    open(my $OFILE, '>Output.txt') or die "Cannot create file for output: $!";
    
    #List of Wanted Columns, and respective outputs for these columns
    my @wanted_fields = ("ACC#", "NAME", "MOD_TYPE", "Gene Symbol");
    my @output_fields = qw/Acc Name Type Symbol/;
    
    #Retrieve wanted fields from first line, as it is the header
    my @fields = split /\t/, <MYFILE>;
    chomp @fields;
    #print Dumper \@fields;
    
    
    #Print the headers to the output file
    print $OFILE join("\t",@output_fields), "\n";
    
    
    
    while(<MYFILE>)
    {
    	chomp;
    	
    	#Read each line from the array and turn it into a hash.
    	#assign each column to matching header.
    	my %row;
    	@row{@fields} = split /\t/;
    	
    	#Use map to find wanted column in the hash for matching data on row
    	my @wanted_data = map{$row{$_}} @wanted_fields;
    	print $OFILE join("\t", @wanted_data), "\n";
    	
    }
    
    close $OFILE or die "Error closing $OFILE: $!";
    The only issue is that some of the other files I have to read are just a little bit different (Acc as opposed to ACC#, or ACCESSION even) can I have the code that looks for the column look for a partial match, or do I have to write separate code for each one? It wouldn't be too hard to write separate code, then have a bootstrapper that calls the different ones depending on the file name(if I can determine the file name)

    Thanks for all the help you have given me. I'm new to perl, coming from C# so its a bit hard.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    830
    Rep Power
    496
    I totally agree with Keath. Just a few additional or complementary remarks on your code.

    Perl Code:
    while(<FH>)
    {
    	push (@lines, $_);
    }


    Avoid doing that unless you have a very good reason to do it: there is no point in your case to store your all file into an array, whereas you can simply iterate over the lines one by one. You are using uselessly a lot of memory, and you also lose on performance by copying twice every piece of data. If your file gets really big, you'll simply run out of memory.

    But if you really have to load your file into an array (sometimes you do, because you need to go back and forth into the data, for example, or update it several time until you are ready to print it out, or you need to sort it before proceeding), then try to do it in a more perlish way:

    Perl Code:
    my @lines = <FH>;


    In your case, only the first line (the header line) deserves a special processing, because you need to parse it to figure out which are the fields you will keep in the rest of your data, and then you just go though the lines one by one.

    So it could be:

    Perl Code:
    my $header = <FH>; #get the first line
    my @fields_names = split, /\t/, $header;
    #...


    Or even:

    Perl Code:
    my @fields_names = split, /\t/, <FH>;


    Then you need a bit of work, some simple or nested foreach or map command, to list in an array the fields that you need to keep or print out. There a numerous ways to do that, I'll leave it to you to find out (but please don't hesitate to ask if you don't succeed).

    Once you have figured the array of subscripts of the fields you need, say @col_to_keep, which could be something like (1, 4, 3), just read the rest of the file and do something like:

    Perl Code:
    while (my $line = <FH>) {
         my @splitted_line = split /\t/, $line;
         my @output = @splitted_line[@col_to_keep];
         print OUT "@output";
    }


    or, simpler:

    Perl Code:
    print OUT join " ", (split /\t/, $_)[@col_to_keep] while <FH>;


    Of course, I won't come back to the error of using a while on the array pointed out by Keath, or, maybe I will. You could do it if your command was removing the read element from the array (for example with a shift command). Foir example,

    Perl Code:
    while (my $line = shift @lines) { #...


    should work fine, because shift will progressively deplete @lines until if becomes empty, at which point while will fail and the loop will stop.

    But the real good way of scanning though the elements of the array is:

    Perl Code:
    foreach (@lines) { # do something with $_


    or

    Perl Code:
    for (@lines) { # do something with $_


    (for and foreach are exactly equivalent in this context, but I personally tend prefer foreach because it conveys quite well in English the idea that you will visit once every element. But I also use for, especially when, for some reason, I wish to have a slightly more compact syntax.)

    Comments on this post

    • keath agrees : I prefer foreach also
  16. #9
  17. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,254
    Rep Power
    1810
    Alright. Sorry for leading you astray with that hash and header stuff. I like to use any data given to me, and I like to name my fields when possible.

    If the position of the wanted columns is consistent across files, but the names are different, you can use numbers instead and an array slice to get the data you want.

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Data::Dumper;
    
    # make a list of the fields you want
    # first field index 0
    my @wanted_fields = qw/0 2/;
    
    # no need to open file listed as argument
    # will be read from magic <> operator
    
    #Create new file for writing;
    open my $outfh, ">", 'Output.txt' or die "Cannot create output file: $!";
    
    # pull first line (header) out of file. Not using it.
    my $header = <>;
    
    while (<>) {
    	chomp;
    	my @row = split /\t/;
    	#print Dumper \@row;
    	print  $outfh join("\t", @row[@wanted_fields]), "\n";
    }
  18. #10
  19. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2013
    Posts
    18
    Rep Power
    0
    Originally Posted by keath
    Alright. Sorry for leading you astray with that hash and header stuff. I like to use any data given to me, and I like to name my fields when possible.

    If the position of the wanted columns is consistent across files, but the names are different, you can use numbers instead and an array slice to get the data you want.

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Data::Dumper;
    
    # make a list of the fields you want
    # first field index 0
    my @wanted_fields = qw/0 2/;
    
    # no need to open file listed as argument
    # will be read from magic <> operator
    
    #Create new file for writing;
    open my $outfh, ">", 'Output.txt' or die "Cannot create output file: $!";
    
    # pull first line (header) out of file. Not using it.
    my $header = <>;
    
    while (<>) {
    	chomp;
    	my @row = split /\t/;
    	#print Dumper \@row;
    	print  $outfh join("\t", @row[@wanted_fields]), "\n";
    }
    haha you're fine. I appreciate the help. like I said, managed to get it working with some tweaks. I think the files have the same header order, but im not sure. and I wouldn't know without looking at the files first. I think this script is going to run on a Computing Cluster and should be automated. hopefully The header order is the same. again thanks for the help and tips.

IMN logo majestic logo threadwatch logo seochat tools logo