#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    14
    Rep Power
    0

    Detecting dupes in column 1


    Hello,
    I have a dictionary database built by hand. The structure is as under:
    Code:
    a=x
    b=d
    It so often happens that when I merge various data bases, the left hand side i.e. the keyword gets repeated:
    Code:
    a=x
    a=y
    I have written a program in PERL which takes the input file with such duplicates and produces an output file with two headers:
    Code:
    SINGLETON
    DUPES
    However when I run the program on a very large file, it very often misses the dupes out and I have to run it several times to get it to run. I have racked my brains to find the bug in the program but cannot identify it. I am still learning Perl because of which I may have made an error. Anyway in which the program could be modified so as to make it run at one single shot. I am giving the program below:
    Code:
    #!/usr/bin/perl
    
    $dupes = $singletons = "";		# This goes at the head of the file
    
    do {
        $dupefound = 0;			# These go at the head of the loop
        $text = $line = $prevline = $name = $prevname = "";
        do {
    	$line = <>;
    	$line =~ /^(.+)\=.+$/ and $name = $1;
    	$prevline =~ /^(.+)\=.+$/ and $prevname = $1;
    	if ($name eq $prevname) { $dupefound += 1 }
    	$text .= $line;
    	$prevline = $line;
        } until ($dupefound > 0 and $text !~ /^(.+?)\=.*?\n(?:\1=.*?\n)+\z/m) or eof;
        if ($text =~ s/(^(.+?)\=.*?\n(?:\2=.*?\n)+)//m) { $dupes .= $1 }
        $singletons .= $text;
    } until eof;
    print "SINGLETONS\n$singletons\n\DUPES\n$dupes";
    I feel that I have goofed up in the loop function.
    Many thanks for your help and if the corrected program could be commented that would be of great hep and stop me from making a similar mistake in the future.
    Ideally what I need is that once the dupes are found, all dupes could be catted on a single line.
    Thus the dupes could be
    Code:
    a=x
    a=y
    a=z
    b=c
    b=d
    They could be catted to one single line
    Code:
    a=x,y,z
    b=c,d
    Many thanks
  2. #2
  3. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,264
    Rep Power
    1810
    I'll have some comments after breakfast. Consider this as an alternative for a few minutes:

    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Data::Dumper;
    
    my %data;
    
    while (<DATA>) {
    	chomp;
    	my ($k,$v) = split /=/;
    	
    	if (exists $data{$k}) {
    		push @{$data{$k}}, $v;
    	} else {
    		$data{$k} = [$v];
    	}
    }
    
    foreach my $k (sort keys %data) {
    	my $vstr = join ',', sort @{$data{$k}};
    	print "$k=$vstr\n";
    }
    
    __DATA__
    a=x
    a=y
    a=z
    b=c
    b=d
  4. #3
  5. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,264
    Rep Power
    1810
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Data::Dumper;
    
    # create a hash
    my %data;
    
    #collect information
    
    while (<DATA>) {
    	# remove newline character
    	chomp;
    	
    	# split each line into key and value components
    	my ($k,$v) = split /=/;
    	
    	if (exists $data{$k}) {
    		# if the key is already stored in the hash
    		# push the value into the array it points to
    		push @{$data{$k}}, $v;
    	} else {
    		# it's a new key
    		# store the key and set it to point to an array containing it's one value
    		$data{$k} = [$v];
    	}
    }
    
    # look at the data structure with Data::Dumper
    # can comment out this next line, this is only for inspection and troubleshooting
    print Dumper \%data;
    
    #display of information in a report of your format
    print "SINGLETON\n";
    
    my @dupes;
    foreach my $k (sort keys %data) {
    	my $vstr = join ',', sort @{$data{$k}};
    	
    	# see how many elements are in the array
    	if (@{$data{$k}} > 1) {
    		# if there is more than one, store it for later display
    		push @dupes, "$k=$vstr\n";
    	} else {
    		# else print it now in the singleton area
    		print "$k=$vstr\n";
    	}
    }
    
    print "\nDUPES\n";
    # printing the dupes is easy 
    # because the lines we constructed already contain newline characters
    print foreach @dupes;
    
    __DATA__
    a=x
    a=y
    a=z
    b=c
    b=d
    c=a
    You used the term 'database'.

    If you mean a file-based application such as mySQL, SQLite, PostgreSQL, etc; then you should be using the features of the database such as 'unique' constraints on a field so that duplicates can not be inserted.

    If your 'database' is just a text file, then you will have to use external means such as this to maintain it. An actual database isn't that hard to work with though, and you can work with it using perl, so you should consider that method as a better alternative.

    Also, this script is not eliminating duplicate values, only keys. It would be very simple to eliminate dupe values also by running them through a hash as well, and you can even count the number of duplicates if you want and store their line positions.

    Those things will be left as an exercise for the reader.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,947
    Rep Power
    1225
    There's no need to use an if/else block in the while loop. The push statement is all that's needed.

    Code:
    while (<DATA>) {
    	# remove newline character
    	chomp;
    
    	# split each line into key and value components
    	# and push them onto the HoA
    	my ($k,$v) = split /=/;
    	push @{$data{$k}}, $v;
    }

    Comments on this post

    • keath agrees
  8. #5
  9. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,264
    Rep Power
    1810
    That's true. It's a feature of perl called auto-vivification, and it spoils you.

    I've been working in a language that isn't so accommodating lately.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    14
    Rep Power
    0
    Many thanks for your kind help and also for taking the trouble to comment the steps. I had named the program as clndupes.pl.
    I ran the program and got the following message for line 11

    Code:
    Name "main::DATA" used only once: possible typo at clndupes.pl line 11.
    readline() on unopened filehandle DATA at clndupes.pl line 11.
    $VAR1 = {};
    SINGLETON
    
    DUPES
    What seems to be the issue ?
    I should have mentioned that I work with a Windows OS. Is that the problem ?
    Sorry to hassle you like this and thanks once again.
  12. #7
  13. !~ /m$/
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    May 2004
    Location
    Reno, NV
    Posts
    4,264
    Rep Power
    1810
    No, it shouldn't have anything to do with the OS.

    I put your data into the script at the bottom. It's a convenience method of working on a sample in perl; especially handy when you are working on the logic of a script and want some easy sample data to work with.

    Everything down at the bottom of the script:
    Code:
    __DATA__
    a=x
    a=y
    a=z
    b=c
    b=d
    c=a
    Is treated like a file called data which I read in this loop:

    Code:
    while (<DATA>) {
    }
    In order to work with some other data, you need to substitute that with your own file reading routine. You can simply have:

    Code:
    while (<>) {
    }
    if you want to pass in the filename from the command line.

    Or you can open a specific file by name, e.g.:
    Code:
    open my $fh, '<', 'data.txt' or die "cannot open file: $!";
    while (<$fh>) {
    }
    Last edited by keath; February 9th, 2014 at 09:19 PM.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    14
    Rep Power
    0
    I tested it with your suggestion and it worked perfectly.
    Many thanks. Am still learning. This is a convenient way of handling the data.
    Sorry for bothering you and thanks for your patience.
    Thanks also for providing an alternate way of calling the data.

IMN logo majestic logo threadwatch logo seochat tools logo