|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| ||||||||||||||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Jaro-Winkler Matching Algorithm
Does anyone know where to get the source for this algorithm?
I've searched on the web, but there's nothing but science papers on the algorithm. I'm an average joe and some of the stuff in them science papers goes well over me head So if any one has it could ya lend us a hand and pass it on, please. Thanks, Thomas |
|
#2
|
|||
|
|||
|
The source for which language would be useful to know for starters...!!!
![]() |
|
#3
|
|||
|
|||
|
the language doesn't matter. I just write it into the language I want. but if you have it in C, C++, C#, Java or to be honost any language i'll take it cause its impossible to find.
|
|
#4
|
|||
|
|||
|
What is it?
Do you have links for the papers? |
|
#5
|
|||
|
|||
|
Jaro Winkler algorithm is an improved version of the Edit Distance algorithm http://www.nist.gov/dads/HTML/editdistance.html
pretty much the same thing but with an example URL for an understanding of Jaro n' his mate Winklers Algorithm URL as for the paper I haven't seen any that explain the algorithm |
|
#6
|
|||
|
|||
|
VB Code
Hi Thomas,
A couple of month ago I've studied different kinds of approximate string matching algorithms, among them Jaro-Winkler. I've developed an VBAccess code to test them prior to developing the ABAP final application. Basically, it computes what i've understood of the papers you mentioned... Here is the code (beware because it may be buggy): **************************** Function jaro(ByVal str1 As String, ByVal str2 As String) As Double Dim l1, l2, lmin, lmax, m, i, j As Integer Dim common As Integer Dim tr As Double Dim a1, a2 As String str1 = clean_string(str1) str2 = clean_string(str2) l1 = Len(str1) l2 = Len(str2) If l1 > l2 Then aux = l2 l2 = l1 l1 = aux auxstr = str1 str1 = str2 str2 = auxstr End If lmin = l1 lmax = l2 Dim f1(), f2() As Boolean ReDim f1(l1), f2(l2) For i = 1 To l1 f1(i) = False Next i For j = 1 To l2 f2(j) = False Next j m = Int((lmax / 2) - 1) common = 0 tr = 0 For i = 1 To l1 a1 = Mid(str1, i, 1) If m >= i Then f = 1 l = i + m Else f = i - m l = i + m End If If l > lmax Then l = lmax End If For j = f To l a2 = Mid(str2, j, 1) If (a2 = a1) And (f2(j) = False) Then common = common + 1 f1(i) = True f2(j) = True GoTo linea_exit End If Next j linea_exit: Next i Dim wcd, wrd, wtr As Double l = 1 For i = 1 To l1 If f1(i) Then For j = l To l2 If f2(j) Then l = j + 1 a1 = Mid(str1, i, 1) a2 = Mid(str2, j, 1) If a1 <> a2 Then tr = tr + 0.5 End If Exit For End If Next j End If Next i wcd = 1 / 3 wrd = 1 / 3 wtr = 1 / 3 If common <> 0 Then jaro = wcd * common / l1 + wrd * common / l2 + wtr * (common - tr) / common Else jaro = 0 End If End Function ************************** Function clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example: Function clean_string(str1 As String) As String str1 = Replace(str1, ".", "") str1 = Replace(str1, ",", "") str1 = Replace(str1, "-", "") str1 = Replace(str1, ";", "") str1 = Replace(str1, ":", "") str1 = Replace(str1, "Á", "A") str1 = Replace(str1, "É", "E") str1 = Replace(str1, "Í", "I") str1 = Replace(str1, "Ó", "O") str1 = Replace(str1, "Ú", "U") str1 = Replace(str1, " COMPANY ", " CO ") str1 = Replace(str1, " CORPORATION ", " CO ") .... EndFunction If you want I can give you my .mdb file so you can play with it. |
|
#7
|
|||
|
|||
|
I can't make out from the papers or your source how the algorithm should take into account the fact that a mistake like s instead of z is much more likely than i instead of z - surely this would require some sort of table?
Also a really easy mistake to make is when you make the wrong letter double - eg foll instead of fool. I think this needs taking into account if you are making your own version. How about putting your source on Sourceforge once it's sorted - it sounds like a really useful algorithm but it's not on the internet anywhere! 'Spira-C ![]() |
|
#8
|
|||
|
|||
|
Oracle implementation
I converted icases' VB code to run on Oracle 10g R1. 10g R2 includes a Jaro-Winkler distance function (as part of the new UTL_MATCH package), but our production database won't be upgrading to that version for at least 6 months. I needed something a bit sooner, so I did the fairly simple conversion from VB to Oracle PL/SQL.
The problem is, it produces slightly different values than Oracle's UTL_MATCH.JARO_WINKLER() function. Unfortunately I don't have the time to analyze the original function or icases' code to find the problem. If you solve it, please let me know at the email below. Thanks, Stew Code:
CREATE OR REPLACE FUNCTION jaro(str1_in IN VARCHAR2, str2_in IN VARCHAR2) RETURN NUMBER AS
/* Converted from VB code from:
forums.devshed.com/software-design-43/jaro-winkler-matching-algorithm-62100 DOT html
Oracle implementation by Stew Stryker (stewDOTstrykerATdartmouthDOTedu) 4/30/07
*/
str1 VARCHAR2(4000);
str2 VARCHAR2(4000);
len_str1 PLS_INTEGER;
len_str2 PLS_INTEGER;
swap_len PLS_INTEGER;
max_len PLS_INTEGER;
m PLS_INTEGER;
i PLS_INTEGER;
j PLS_INTEGER;
f PLS_INTEGER;
l PLS_INTEGER;
tr NUMBER := 0;
a1 VARCHAR2(4000);
a2 VARCHAR2(4000);
swap_str VARCHAR2(4000);
TYPE bool_list_t IS TABLE OF BOOLEAN INDEX BY PLS_INTEGER;
f1 bool_list_t;
f2 bool_list_t;
wcd NUMBER;
wrd NUMBER;
wtr NUMBER;
common NUMBER := 0;
jaro_value NUMBER := 0;
-- Local function
-- clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
FUNCTION clean_string(p_str1 IN VARCHAR2) RETURN VARCHAR2 IS
ret_value VARCHAR2(4000) := p_str1;
BEGIN
ret_value := REPLACE(ret_value, '.', '');
ret_value := REPLACE(ret_value, ',', '');
ret_value := REPLACE(ret_value, '-', '');
ret_value := REPLACE(ret_value, ';', '');
ret_value := REPLACE(ret_value, ':', '');
ret_value := REPLACE(ret_value, 'Á', 'A');
ret_value := REPLACE(ret_value, 'É', 'E');
ret_value := REPLACE(ret_value, 'Í', 'I');
ret_value := REPLACE(ret_value, 'Ó', 'O');
ret_value := REPLACE(ret_value, 'Ú', 'U');
ret_value := REPLACE(ret_value, ' COMPANY ', ' CO ');
ret_value := REPLACE(ret_value, ' CORPORATION ', ' CO ');
RETURN ret_value;
END clean_string;
-- MAIN
BEGIN
str1 := clean_string(str1_in);
str2 := clean_string(str2_in);
len_str1 := LENGTH(str1);
len_str2 := LENGTH(str2);
IF len_str1 > len_str2
THEN
swap_len := len_str2;
len_str2 := len_str1;
len_str1 := swap_len;
swap_str := str1;
str1 := str2;
str2 := swap_str;
END IF;
max_len := len_str2;
FOR i IN 1 .. len_str1
LOOP
f1(i) := FALSE;
END LOOP;
FOR j IN 1 .. len_str2
LOOP
f2(j) := FALSE;
END LOOP;
m := ROUND((max_len / 2) - 1);
FOR i IN 1 .. len_str1
LOOP
a1 := SUBSTR(str1, i, 1);
IF m >= i
THEN
f := 1;
l := i + m;
ELSE
f := i - m;
l := i + m;
END IF;
IF l > max_len
THEN
l := max_len;
END IF;
FOR j IN f .. l
LOOP
a2 := SUBSTR(str2, j, 1);
IF (a2 = a1)
AND (f2(j) = FALSE)
THEN
common := common + 1;
f1(i) := TRUE;
f2(j) := TRUE;
GOTO linea_exit;
END IF;
END LOOP; -- j
<<linea_exit>>
NULL; -- Add NULL statement to avoid error
END LOOP; -- i
l := 1;
FOR i IN 1 .. len_str1
LOOP
IF f1(i)
THEN
FOR j IN l .. len_str2
LOOP
IF f2(j)
THEN
l := j + 1;
a1 := SUBSTR(str1, i, 1);
a2 := SUBSTR(str2, j, 1);
IF a1 <> a2
THEN
tr := tr + 0.5;
END IF;
GOTO j_loop;
END IF;
END LOOP; -- j
<<j_loop>>
NULL; -- Add NULL statement to avoid error
END IF;
END LOOP; -- i
wcd := 1 / 3;
wrd := 1 / 3;
wtr := 1 / 3;
IF common <> 0
THEN
jaro_value := wcd * common / len_str1 + wrd * common / len_str2 +
wtr * (common - tr) / common;
END IF;
RETURN jaro_value;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('EXCEPTION IN jaro - ' || SQLCODE || ': ' || SQLERRM);
DBMS_OUTPUT.PUT_LINE('Error stack at top level:');
DBMS_OUTPUT.PUT_LINE(dbms_utility.format_error_backtrace);
RETURN jaro_value;
END jaro;
Last edited by stewstryker : May 1st, 2007 at 09:02 AM. Reason: Fix URL |
![]() |
| Viewing: Dev Shed Forums > Programming Languages - More > Software Design > Jaro-Winkler Matching Algorithm |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|
|