#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2013
    Posts
    5
    Rep Power
    0

    Ignore lines from a file mattching a pattern


    Hi,

    I have a csv file in which I would like to count # of times a matching pattern shows up and to ignore lines with more than 2 matching pattern.

    For EG The file looks like

    apple,fruit,price is 1.25,12-13-2013
    potato,vegetable,this is good,12-13-2013
    kiwi,circile,price is 1.25,12-13-2013
    pumpkin,orange,this is good,12-13-2013
    berry,red,price is 1.25,12-13-2013


    in this file I would like to look for the 3rd value in each line and count how many times they are matching and ignore if the match is for more than 2times.

    In the above line 1st, 3rd and 5th line should be ignored as they have matching ( price is 1.25).

    output should be :
    potato,vegetable,this is good,12-13-2013
    pumpkin,orange,this is good,12-132013

    Any help pls

    Thanks

    any help pls.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Hi,
    you can use a hash to record the number of times the specific field is coming up. For example, something like this:

    Perl Code:
    my %counter;
    open my $INFILE, "<", $file or die "cannot open $file $!";
    while (<$INFILE>) {
         my $field3 = (split /,/, $_)[2];
         $counter{$field3} ++;
    }


    Once you are done, the %counter hash has the information you need. You can just read again the file and output only the lines where the counter for the third field matches your condition.

    A possibly more efficient alternative (depending on the size of the input file) would be to store in the hash both the line and the counter, but using a hash of hashes (HoH) or a hash or arrays (HoA). This is an example using an HoH (or actually a HoHoA):

    Perl Code:
    use strict;
    use warnings;
    use Data:<img src="http://images.devshed.com/fds/smilies/biggrin.gif" border="0" alt="" title="Big Grin" class="inlineimg" />umper;
     
    my %HoH;
    while (<DATA>) {
         chomp;
         my $field3 = (split /,/, $_)[2];
         $HoH{$field3}{'counter'} ++;
         push @{$HoH{$field3}{lines}}, [$_];
    }
     
    print Dumper \%HoH
     
    __DATA__
    apple,fruit,price is 1.25,12-13-2013
    potato,vegetable,this is good,12-13-2013
    kiwi,circile,price is 1.25,12-13-2013
    pumpkin,orange,this is good,12-13-2013
    berry,red,price is 1.25,12-13-2013


    Now the data structure in the hash is as follows:

    Code:
    $VAR1 = {
              'this is good' => {
                                  'lines' => [
                                               [
                                                 'potato,vegetable,this is good,12-13-2013'
                                               ],
                                               [
                                                 'pumpkin,orange,this is good,12-13-2013'
                                               ]
                                             ],
                                  'counter' => 2
                                },
              'price is 1.25' => {
                                   'lines' => [
                                                [
                                                  'apple,fruit,price is 1.25,12-13-2013'
                                                ],
                                                [
                                                  'kiwi,circile,price is 1.25,12-13-2013'
                                                ],
                                                [
                                                  'berry,red,price is 1.25,12-13-2013'
                                                ]
                                              ],
                                   'counter' => 3
                                 }
            };
    You just need to print the lines where the counter matches your condition. I leave it to you to do it as a sort of homework, but it you don't succeed to do it, show what you've tried, we can help you further.

    (Note that chomping the input is not really necessary in your actual program, I did it only to have a nicer Dumper output.)
    Last edited by Laurent_R; December 14th, 2013 at 05:22 AM.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2013
    Posts
    30
    Rep Power
    1
    Laurent_R has given a good solution, but since you said you have a CSV file, you could also think of using any of these modules Text::CSV or Text::CSV_XS instead on splitting on comma.

    Just saying..
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Hi,

    actually, thinking again about it, the second solution I presented is too complicated. A simple hash of arrays is sufficient, and the counter is not really useful, since you only need to count the lines in the inner arrays.

    You could change the main loop as follows:

    Perl Code:
    while (<DATA>) {
    	 chomp;
         my $field3 = (split /,/, $_)[2];
         push @{$HoA{$field3}}, $_;
    }


    Now the data structure is the following:

    Code:
    $VAR1 = {
              'this is good' => [
                                  'potato,vegetable,this is good,12-13-2013',
                                  'pumpkin,orange,this is good,12-13-2013'
                                ],
              'price is 1.25' => [
                                   'apple,fruit,price is 1.25,12-13-2013',
                                   'kiwi,circile,price is 1.25,12-13-2013',
                                   'berry,red,price is 1.25,12-13-2013'
                                 ]
            };
    which is quite simpler.

    Now, to figure out how many lines have the 'price is 1.25' field, just do the following:

    Perl Code:
    my $line_count =  scalar @{$HoA{ 'price is 1.25'}}; # $line_count is now 3


    Using Text::CSV or Text::CSV_XS, as suggested by 2teez, is usually a good idea, and it should really be done as soon as the CSV gets anywhere more complicated than the very simplest case (such as, for example, having quoted fields with field separator inside), but for the very simplest cases such as here, I am usually happy with simply splitting on the field separator.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2013
    Posts
    5
    Rep Power
    0
    Thanks for your post.

    The issue is the file I have is a sample. the csv files I am going to be having are going to be different and they will have many lines.

    So I will need to look for the third field in csv and keep having a loop for matches I find.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Well, it depends how many exactly is "many lines". The solution outlined above should work for millions of lines. But probably not for hundreds of millions of lines.

    There are some possible optimizations (like stopping recording the lines as soon as you have passed the limit), but whether they will bring a real advantage will depend on the redundancy of the data.

IMN logo majestic logo threadwatch logo seochat tools logo