#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    14
    Rep Power
    0

    Eliminating Duplicates which are in a transitive Relationship


    I have a large database of homographs with the following structure:
    Code:
    name=name variant
    i.e. a variant of a name is provided on a line separated by a
    Code:
    =
    Example
    Code:
    Mary=Mariah
    Mary=Marie
    Since the database has been manually prepared, it often happens that duplicates have been created where the left hand side variant and right hand side variant are inverted as in the example below:
    Code:
    Mary=Marie
    Marie=Mary
    This results in bloated data in which these duplicates create issues and also slow down the process.
    Can a Perl script remove these dupes and keep only one set?
    Example of Input and Output
    Input:
    Code:
    Mary=Marie
    Marie=Mary
    Mary=Mariam
    Mariam=Mary
    Expected output after removal of dupes
    Code:
    Mary=Marie
    Mary=Mariam
    Many thanks in anticipation for your help and Happy Holidays.
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    9
    Rep Power
    0
    The following code will produce the results you want.
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    my @names = split /\n/, <<EOF;
    Mary=Marie
    Marie=Mary
    Mary=Mariam
    Mariam=Mary
    EOF
    
    my %seen;
    for my $name (@names) {
        my $key = join ':', sort split /=/, $name;
        print "$name\n" unless $seen{$key}++;   
    }
    Output:
    Code:
    Mary=Marie
    Mary=Mariam
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2011
    Posts
    14
    Rep Power
    0
    Many thanks. It worked and that too very fast. Over a hundred thousand records processed in less than 2 seconds.
    I modified slightly the code to open a file and write to file.
    Happy Holidays
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2013
    Posts
    9
    Rep Power
    0
    Glad to help - in just 2 seconds, very good!

IMN logo majestic logo threadwatch logo seochat tools logo