The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages - More
> Software Design
|
Jaro-Winkler Matching Algorithm
Discuss Jaro-Winkler Matching Algorithm in the Software Design forum on Dev Shed. Jaro-Winkler Matching Algorithm Software design forum discussing design principles and non-language specific algorithms. Get help with logic, algebraic, or relational concepts.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

May 19th, 2003, 10:32 AM
|
|
Junior Member
|
|
Join Date: May 2003
Posts: 3
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
|
Jaro-Winkler Matching Algorithm
Does anyone know where to get the source for this algorithm?
I've searched on the web, but there's nothing but science papers on the algorithm. I'm an average joe and some of the stuff in them science papers goes well over me head
So if any one has it could ya lend us a hand and pass it on, please.
Thanks,
Thomas
|

May 19th, 2003, 01:11 PM
|
|
Junior Member
|
|
Join Date: May 2003
Location: Oxford, England
Posts: 14
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
The source for which language would be useful to know for starters...!!! 
|

May 20th, 2003, 03:32 AM
|
|
Junior Member
|
|
Join Date: May 2003
Posts: 3
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
|
the language doesn't matter. I just write it into the language I want. but if you have it in C, C++, C#, Java or to be honost any language i'll take it cause its impossible to find.
|

May 20th, 2003, 10:42 AM
|
|
Junior Member
|
|
Join Date: Apr 2003
Location: UK
Posts: 13
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
|
What is it?
Do you have links for the papers?
|

May 20th, 2003, 10:55 AM
|
|
Junior Member
|
|
Join Date: May 2003
Posts: 3
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
|
Jaro Winkler algorithm is an improved version of the Edit Distance algorithm http://www.nist.gov/dads/HTML/editdistance.html
pretty much the same thing but with an example
URL
for an understanding of Jaro n' his mate Winklers Algorithm
URL
as for the paper I haven't seen any that explain the algorithm
|

May 21st, 2003, 06:41 PM
|
|
Junior Member
|
|
Join Date: May 2003
Posts: 1
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
|
VB Code
Hi Thomas,
A couple of month ago I've studied different kinds of approximate string matching algorithms, among them Jaro-Winkler. I've developed an VBAccess code to test them prior to developing the ABAP final application.
Basically, it computes what i've understood of the papers you mentioned...
Here is the code (beware because it may be buggy):
****************************
Function jaro(ByVal str1 As String, ByVal str2 As String) As Double
Dim l1, l2, lmin, lmax, m, i, j As Integer
Dim common As Integer
Dim tr As Double
Dim a1, a2 As String
str1 = clean_string(str1)
str2 = clean_string(str2)
l1 = Len(str1)
l2 = Len(str2)
If l1 > l2 Then
aux = l2
l2 = l1
l1 = aux
auxstr = str1
str1 = str2
str2 = auxstr
End If
lmin = l1
lmax = l2
Dim f1(), f2() As Boolean
ReDim f1(l1), f2(l2)
For i = 1 To l1
f1(i) = False
Next i
For j = 1 To l2
f2(j) = False
Next j
m = Int((lmax / 2) - 1)
common = 0
tr = 0
For i = 1 To l1
a1 = Mid(str1, i, 1)
If m >= i Then
f = 1
l = i + m
Else
f = i - m
l = i + m
End If
If l > lmax Then
l = lmax
End If
For j = f To l
a2 = Mid(str2, j, 1)
If (a2 = a1) And (f2(j) = False) Then
common = common + 1
f1(i) = True
f2(j) = True
GoTo linea_exit
End If
Next j
linea_exit:
Next i
Dim wcd, wrd, wtr As Double
l = 1
For i = 1 To l1
If f1(i) Then
For j = l To l2
If f2(j) Then
l = j + 1
a1 = Mid(str1, i, 1)
a2 = Mid(str2, j, 1)
If a1 <> a2 Then
tr = tr + 0.5
End If
Exit For
End If
Next j
End If
Next i
wcd = 1 / 3
wrd = 1 / 3
wtr = 1 / 3
If common <> 0 Then
jaro = wcd * common / l1 + wrd * common / l2 + wtr * (common - tr) / common
Else
jaro = 0
End If
End Function
**************************
Function clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
Function clean_string(str1 As String) As String
str1 = Replace(str1, ".", "")
str1 = Replace(str1, ",", "")
str1 = Replace(str1, "-", "")
str1 = Replace(str1, ";", "")
str1 = Replace(str1, ":", "")
str1 = Replace(str1, "Á", "A")
str1 = Replace(str1, "É", "E")
str1 = Replace(str1, "Í", "I")
str1 = Replace(str1, "Ó", "O")
str1 = Replace(str1, "Ú", "U")
str1 = Replace(str1, " COMPANY ", " CO ")
str1 = Replace(str1, " CORPORATION ", " CO ")
....
EndFunction
If you want I can give you my .mdb file so you can play with it.
|

May 22nd, 2003, 06:14 PM
|
|
Junior Member
|
|
Join Date: Apr 2003
Location: UK
Posts: 13
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
I can't make out from the papers or your source how the algorithm should take into account the fact that a mistake like s instead of z is much more likely than i instead of z - surely this would require some sort of table?
Also a really easy mistake to make is when you make the wrong letter double - eg foll instead of fool. I think this needs taking into account if you are making your own version.
How about putting your source on Sourceforge once it's sorted - it sounds like a really useful algorithm but it's not on the internet anywhere! 'Spira-C 
|

May 1st, 2007, 08:00 AM
|
|
Registered User
|
|
Join Date: Apr 2007
Location: Vermont
Posts: 1
Time spent in forums: 10 m 52 sec
Reputation Power: 0
|
|
|
Oracle implementation
I converted icases' VB code to run on Oracle 10g R1. 10g R2 includes a Jaro-Winkler distance function (as part of the new UTL_MATCH package), but our production database won't be upgrading to that version for at least 6 months. I needed something a bit sooner, so I did the fairly simple conversion from VB to Oracle PL/SQL.
The problem is, it produces slightly different values than Oracle's UTL_MATCH.JARO_WINKLER() function.
Unfortunately I don't have the time to analyze the original function or icases' code to find the problem. If you solve it, please let me know at the email below.
Thanks,
Stew
Code:
CREATE OR REPLACE FUNCTION jaro(str1_in IN VARCHAR2, str2_in IN VARCHAR2) RETURN NUMBER AS
/* Converted from VB code from:
forums.devshed.com/software-design-43/jaro-winkler-matching-algorithm-62100 DOT html
Oracle implementation by Stew Stryker (stewDOTstrykerATdartmouthDOTedu) 4/30/07
*/
str1 VARCHAR2(4000);
str2 VARCHAR2(4000);
len_str1 PLS_INTEGER;
len_str2 PLS_INTEGER;
swap_len PLS_INTEGER;
max_len PLS_INTEGER;
m PLS_INTEGER;
i PLS_INTEGER;
j PLS_INTEGER;
f PLS_INTEGER;
l PLS_INTEGER;
tr NUMBER := 0;
a1 VARCHAR2(4000);
a2 VARCHAR2(4000);
swap_str VARCHAR2(4000);
TYPE bool_list_t IS TABLE OF BOOLEAN INDEX BY PLS_INTEGER;
f1 bool_list_t;
f2 bool_list_t;
wcd NUMBER;
wrd NUMBER;
wtr NUMBER;
common NUMBER := 0;
jaro_value NUMBER := 0;
-- Local function
-- clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
FUNCTION clean_string(p_str1 IN VARCHAR2) RETURN VARCHAR2 IS
ret_value VARCHAR2(4000) := p_str1;
BEGIN
ret_value := REPLACE(ret_value, '.', '');
ret_value := REPLACE(ret_value, ',', '');
ret_value := REPLACE(ret_value, '-', '');
ret_value := REPLACE(ret_value, ';', '');
ret_value := REPLACE(ret_value, ':', '');
ret_value := REPLACE(ret_value, 'Á', 'A');
ret_value := REPLACE(ret_value, 'É', 'E');
ret_value := REPLACE(ret_value, 'Í', 'I');
ret_value := REPLACE(ret_value, 'Ó', 'O');
ret_value := REPLACE(ret_value, 'Ú', 'U');
ret_value := REPLACE(ret_value, ' COMPANY ', ' CO ');
ret_value := REPLACE(ret_value, ' CORPORATION ', ' CO ');
RETURN ret_value;
END clean_string;
-- MAIN
BEGIN
str1 := clean_string(str1_in);
str2 := clean_string(str2_in);
len_str1 := LENGTH(str1);
len_str2 := LENGTH(str2);
IF len_str1 > len_str2
THEN
swap_len := len_str2;
len_str2 := len_str1;
len_str1 := swap_len;
swap_str := str1;
str1 := str2;
str2 := swap_str;
END IF;
max_len := len_str2;
FOR i IN 1 .. len_str1
LOOP
f1(i) := FALSE;
END LOOP;
FOR j IN 1 .. len_str2
LOOP
f2(j) := FALSE;
END LOOP;
m := ROUND((max_len / 2) - 1);
FOR i IN 1 .. len_str1
LOOP
a1 := SUBSTR(str1, i, 1);
IF m >= i
THEN
f := 1;
l := i + m;
ELSE
f := i - m;
l := i + m;
END IF;
IF l > max_len
THEN
l := max_len;
END IF;
FOR j IN f .. l
LOOP
a2 := SUBSTR(str2, j, 1);
IF (a2 = a1)
AND (f2(j) = FALSE)
THEN
common := common + 1;
f1(i) := TRUE;
f2(j) := TRUE;
GOTO linea_exit;
END IF;
END LOOP; -- j
<<linea_exit>>
NULL; -- Add NULL statement to avoid error
END LOOP; -- i
l := 1;
FOR i IN 1 .. len_str1
LOOP
IF f1(i)
THEN
FOR j IN l .. len_str2
LOOP
IF f2(j)
THEN
l := j + 1;
a1 := SUBSTR(str1, i, 1);
a2 := SUBSTR(str2, j, 1);
IF a1 <> a2
THEN
tr := tr + 0.5;
END IF;
GOTO j_loop;
END IF;
END LOOP; -- j
<<j_loop>>
NULL; -- Add NULL statement to avoid error
END IF;
END LOOP; -- i
wcd := 1 / 3;
wrd := 1 / 3;
wtr := 1 / 3;
IF common <> 0
THEN
jaro_value := wcd * common / len_str1 + wrd * common / len_str2 +
wtr * (common - tr) / common;
END IF;
RETURN jaro_value;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('EXCEPTION IN jaro - ' || SQLCODE || ': ' || SQLERRM);
DBMS_OUTPUT.PUT_LINE('Error stack at top level:');
DBMS_OUTPUT.PUT_LINE(dbms_utility.format_error_backtrace);
RETURN jaro_value;
END jaro;
Last edited by stewstryker : May 1st, 2007 at 08:02 AM.
Reason: Fix URL
|

July 18th, 2010, 09:14 PM
|
|
Registered User
|
|
Join Date: Jul 2010
Location: So Cal
Posts: 1
Time spent in forums: 16 m 51 sec
Reputation Power: 0
|
|
|
Jaro Winkler implementation in C#
I'm unsure where icases found the code that was to reproduce the string matching found in Mr. Winkler's original C function but, as HiFidelity said, it really cannot duplicate that algorithm without some kind of lookup table. I used the original code from the Census Bureau and implemented it in a CLR udf for SQL Server 2005/2008. It is a very rough translation of the original C and so it is not optimized but, it does reproduce the results of the original code almost exactly. I've posted a link to this at codeguru.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|