#1
  1. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Posts
    180
    Rep Power
    11

    Merge arrays with almost similar elements


    I have a file looks like this:

    John,Song,William
    Mark,John,Mike,Tim,Micheal
    Issac,John,Sam Song,William
    John,Song,William,Harry
    :

    I would like to merge rows with 60% similar items to other rows so the output will be:

    Mark,John,Mike,Tim,Micheal
    Issac,John,Sam Song,William,Harry

    In this case row 1, 3 and 4 share similarity which is more than 60% and therefore they are merged.

    Thanks in advance,
    Regards,
    NMZ
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    You could create an array of hashes: one array element for each line in your file and in this array element a hash whose keys are the names in the line.

    Then, when you move down in the file, check if 60% of the names are found in any of the hashes.

    If the file is not too long, it should be OK.

    You may want to keep also a single hash of arrays, with the keys being the names and the values a list of the line numbers where such names appear. This not necessary, but it may help reduce the number of searches in the first data structure (depending on how redundant your data is, this could speed up, or not, the whole process).

    The difficulty I can see in any case is to reorganize your data when you decide to merge two lines. Actually, the same lines, but a different order, might give you different output.
  4. #3
  5. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Posts
    180
    Rep Power
    11
    Thank you so much Laurent. I appreciate your advise.
    Would it be possible to show me some codes? I am not really good in perl.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    OK, I gave a quick try, although you don't really give enough information on what to do with the matching lines. So I'm really giving just the algorithmic skeleton doing the most complicated part, you'll have to add some more things. For the time being, the script only prints the matching lines to the screen, but does not merge the names, because I don't know if you want to merge the lines as you go (as soon as you find a match) or only at the end. Depending on how you do it, it will change the behavior of the script, because if you had names to existing lines, this changes the 60% matching pattern (for example, some lines which would match the unchanged first line will no longer match the changed first line, et, reciprocally, some other lines which would not match the untouched first line would match the modified first line ).

    Perl Code:
    use strict;
    use warnings;
     
    my @stored_lines;
    my $stored_line_count = 0;
     
    while (<DATA>) {
    	chomp;
    	s/\s//g; # removing spaces, just in case
    	my @names = split /,/, $_ ;
    	my $limit = 0.6 * @names; # the 60% limit
    	my $found = 0;
    	foreach my $line_ref (@stored_lines) {
    		my $count = 0;
    		foreach my $current_name (@names) {
    			$count++ if defined $$line_ref{$current_name};
    		}
    		if ($count >= $limit) {
    			$found = 1;
    			# do something to merge the lines (not enough information provided). So far, we only print
    			print "We've found a 60% match: @names on line $.", "\n";
    			my $output = join " ", keys %$line_ref;
    			print "and : $output",  "\n";
    			last;
    		}
    	}
    	unless ($found) {
    		# we haven't found a matching line, we need to store the names in the AoH
    		# may be we also want to store the names if we found the line, this has to be decided.
    		#In this case, simply remove the condition on $found
    		foreach my $current_name (@names) {
    			$stored_lines[$stored_line_count]{$current_name} = 1;
    		}
    		$stored_line_count ++;
    	}
    }
     
    __DATA__
    John,Song,William
    Mark,John,Mike,Tim,Micheal
    Issac,John,Sam, Song,William
    John,Song,William,Harry


    The program uses the data at the end of the script (after the __DATA__ tag. You'll have to change it to read from a file.

    This outputs the following results:

    Code:
    $ perl names.pl
    We've found a 60% match: Issac John Sam Song William on line 3
    and : John William Song
    We've found a 60% match: John Song William Harry on line 4
    and : John William Song
    Last edited by Laurent_R; August 18th, 2012 at 05:34 AM.
  8. #5
  9. Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2003
    Posts
    180
    Rep Power
    11
    Thank you Laurent. I appreciate taking the time to assist me.

    Merging:
    Suppose the following is the input data:

    L1: John,Song,William
    L2: Mark,John,Mike,Tim,Micheal
    L3: Issac,John,Sam,Song,William
    L4: John,Song,William,Harry

    In this case, L1 is compared to L2, L3 and L4. If L1 share >60 similarity in names (not necessarily in the same sequence) the two lists are merged and the duplicates are removed.

    For example:
    L1 and L2 have no similarity
    L1 share >60 similarity with L3 and L4 therefore L1, L3 and L4 should be merged as:
    John,Song,William,Issac,John,Sam,Song,William,John,Song,William,Harry

    And then the duplication should be removed:
    John,Song,William,Issac,Sam,Harry

    The data is then updated to be:
    L1: John,Song,William,Issac,Sam,Harry
    L2: Mark,John,Mike,Tim,Micheal

    No more similarity exists so this is the final output otherwise the process are repeated again.

    thank you once again,
    NMZ
    Last edited by nmz; August 19th, 2012 at 03:08 PM.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    776
    Rep Power
    495
    Hi nmz,

    the point is that you don't seem to realize that you might be trying to aim at a moving target.

    Taking your 3 first input lines:

    L1: John,Song,William
    L2: Mark,John,Mike,Tim,Micheal
    L3: Issac,John,Sam,Song,William

    OK, I record L1.

    L2 is quite different, no merging occurs, I record L2. Forget it.

    L3 has John, Song and William in common with line 1. How do you evaluate the 60% margin? L1 compared to L3 or L3 compared to L1? In one case, you get a 100% match, in the other just a bare 60% match. OK match in both cases. But if L3 had been "Issac,John,Sam,Song", it would possibly give a different result. You really have to specify more precisely what kind of match you expect.

    The second point is even more complicated.

    You have:

    L1: John,Song,William
    L2: Mark,John,Mike,Tim,Micheal
    L3: Issac,John,Sam,Song,William

    Again, we forget about L2.

    Assume you are reading L3. You find John, Song and William and decide that this is a match. So, if I update my string immediately L1 becomes "John,Song,William, Isaac,Sam". If I wait for the end of the process, it remains "John,Song,William".

    Now, whether L1 is updated immediately and becomes "John,Song,William, Isaac,Sam" or whether it remains constant until the end of the process has a huge impact on the match we will get on L4 and below.

    If L4 is ""John,William, Isaac,Tom", we will have a match, and Tom will be added to L1. In the other case, the match is not 60%, and Tom will not be added. The rest of the procedure becomes even less well defined, depending on the approach you are taking.

    In brief, you have to define very precisely what occurs in the various cases.

    Your description is not sufficient.

IMN logo majestic logo threadwatch logo seochat tools logo