#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    5
    Rep Power
    0

    Parse tab delimited text file


    Hi,

    I am trying to parse a tab delimited text file using Perl in order to detect any missing values.

    Below is just an example file. The file below may have N columns and N rows.

    Also, in the case of ROW2, it has a fourth tab but no value. ROW3 has no tabs after the 'w' value for COLUMN1. I.e. some columns may have undefined values or blank values.

    Code:
    #IGNORE COLUMN1 COLUMN2 COLUMN3 COLUMN4
    ROW1    x   y   z   a
    ROW2    b   c   d   
    ROW3    w

    I want to be able to detect whether a particular ROWn has values missing for the respective row COLUMNn.

    I.e. I want to be able to detect that ROW3 has missing values for COLUMN2, COLUMN3 and COLUMN4. And that ROW2 has a missing value for COLUMN4.


    When I say missing values, I mean non-blank values excluding tabs and whitespace.

    I've been trying to parse this file without any luck so far.

    Any help is appreciated!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    You need to post your script and any errors/warnings that it produces and explain how the output differs from what you expect.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    5
    Rep Power
    0
    I've made a start with using Text::CSV, but I can't seem to be able access the column x row references to be able to check if they have a value or not. At the moment, I just get the entire file in a single array element -
    Code:
    say @$row[0];
    I am struggling to be able to check whether a given row has COLUMN n number of text values that are separated by a tab.


    Code:
    #!/usr/bin/perl
    use warnings;
    use strict;
    use v5.12;
    use Text::CSV;
    
    my $csv = Text::CSV->new ({
         escape_char         => '"',
         sep_char            => '\t',
         eol                 => $\,
         binary              => 1,
         blank_is_undef      => 1,
         empty_is_undef      => 1,
         });
    
    open (my $file, "<", "tabfile.txt") or die "cannot open: $!";
    while (my $row = $csv->getline ($file)) {
        say @$row[0];
    }
    close($file);
    I want to be able to determine that ROW3 that values missing for COLUMN2, COLUMN3 and COLUMN4 and that ROW2 has a value missing for COLUMN4.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    $\ is the OUTPUT_RECORD_SEPARATOR and its default value is undef, which is why your entire file was loaded into the first element.

    $/ is the INPUT_RECORD_SEPARATOR and is what you probably meant to use. In cases like this I prefer to be more explicit and use "\n" instead of $/.

    Comments on this post

    • ss2012 agrees
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    5
    Rep Power
    0
    Originally Posted by FishMonger
    $\ is the OUTPUT_RECORD_SEPARATOR and its default value is undef, which is why your entire file was loaded into the first element.

    $/ is the INPUT_RECORD_SEPARATOR and is what you probably meant to use. In cases like this I prefer to be more explicit and use "\n" instead of $/.
    Hey fishmonger,

    Thanks for your reply but when I've changed
    Code:
    eol=>$/
    I still get the entire file output and therefore can't seem to perform any level of parsing that I would like to as mentioned in my first post. Is Text::CSV even the way to go?
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    Yes, Text::CSV is the proper module to use, but you could also parse it manually.

    Are you saying that after changing $\ to $/, you're still getting the entire file in @$row[0]?

    If that's true, then it sounds like your file doesn't have any \n chars.

    You may want to use a hex editor to view the file to confirm the line endings.
  12. #7
  13. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    5
    Rep Power
    0
    Originally Posted by FishMonger
    Yes, Text::CSV is the proper module to use, but you could also parse it manually.

    Are you saying that after changing $\ to $/, you're still getting the entire file in @$row[0]?

    If that's true, then it sounds like your file doesn't have any \n chars.

    You may want to use a hex editor to view the file to confirm the line endings.
    I've just checked the file and it does show newline characters. In fact, I manually created the file in a text editor and manually pressed carriage return in order to populate the next row/line of the file.

    After the change you suggested, I attempted to iterate
    Code:
    say @$row[n];
    where n was 0,1,2,3 and 4 and respectively I got the result below...



    Code:
    mypc:Documents root# perl check.pl
    #IGNORE COLUMN1 COLUMN2 COLUMN3 COLUMN4
    ROW1    x   y   z   a
    ROW2    b   c   d   
    ROW3    w
    mypc:Documents root# perl check.pl 
    Use of uninitialized value in say at check.pl line 15, <$file> line 1.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 2.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 3.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 4.
    
    mypc:Documents root# perl check.pl
    Use of uninitialized value in say at check.pl line 15, <$file> line 1.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 2.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 3.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 4.
    
    mypc:Documents root# perl check.pl
    Use of uninitialized value in say at check.pl line 15, <$file> line 1.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 2.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 3.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 4.
    
    mypc:Documents root# perl check.pl
    Use of uninitialized value in say at check.pl line 15, <$file> line 1.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 2.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 3.
    
    Use of uninitialized value in say at check.pl line 15, <$file> line 4.
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    The output shows that it sees/processes 4 lines.

    Add the Data::Dumper module and change:
    Code:
    say @$row[n];
    to:
    Code:
    say Dumper $row;
    Last edited by FishMonger; October 12th, 2013 at 07:10 PM.
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Posts
    5
    Rep Power
    0
    Originally Posted by FishMonger
    The output shows that you have 4 lines.

    Add the Data:umper module and change:
    Code:
    say @$row[n];
    to:
    Code:
    say Dumper $row;
    haha oops! my file somehow did not have tabs in it!! grrrr..so sorry about the mix up. your suggestion actually worked - $/.

    Sorry!!!
  18. #10
  19. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    I just noticed this:
    Code:
    sep_char            => '\t',
    The use of single quotes is causing it to not interpolate the tab char. You need to use double quotes.

IMN logo majestic logo threadwatch logo seochat tools logo