#1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Posts
    3
    Rep Power
    0

    Jaro-Winkler Matching Algorithm


    Does anyone know where to get the source for this algorithm?

    I've searched on the web, but there's nothing but science papers on the algorithm. I'm an average joe and some of the stuff in them science papers goes well over me head

    So if any one has it could ya lend us a hand and pass it on, please.

    Thanks,

    Thomas
  2. #2
  3. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Location
    Oxford, England
    Posts
    14
    Rep Power
    0
    The source for which language would be useful to know for starters...!!!
  4. #3
  5. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Posts
    3
    Rep Power
    0
    the language doesn't matter. I just write it into the language I want. but if you have it in C, C++, C#, Java or to be honost any language i'll take it cause its impossible to find.
  6. #4
  7. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2003
    Location
    UK
    Posts
    13
    Rep Power
    0
    What is it?

    Do you have links for the papers?
  8. #5
  9. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Posts
    3
    Rep Power
    0
    Jaro Winkler algorithm is an improved version of the Edit Distance algorithm http://www.nist.gov/dads/HTML/editdistance.html

    pretty much the same thing but with an example
    http://www.merriampark.com/ld.htm


    for an understanding of Jaro n' his mate Winklers Algorithm
    http://www.choicemaker.com/MEDD_2001...Conference.pdf

    as for the paper I haven't seen any that explain the algorithm
  10. #6
  11. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2003
    Posts
    1
    Rep Power
    0

    VB Code


    Hi Thomas,

    A couple of month ago I've studied different kinds of approximate string matching algorithms, among them Jaro-Winkler. I've developed an VBAccess code to test them prior to developing the ABAP final application.

    Basically, it computes what i've understood of the papers you mentioned...

    Here is the code (beware because it may be buggy):
    ****************************
    Function jaro(ByVal str1 As String, ByVal str2 As String) As Double
    Dim l1, l2, lmin, lmax, m, i, j As Integer
    Dim common As Integer
    Dim tr As Double
    Dim a1, a2 As String

    str1 = clean_string(str1)
    str2 = clean_string(str2)

    l1 = Len(str1)
    l2 = Len(str2)
    If l1 > l2 Then
    aux = l2
    l2 = l1
    l1 = aux
    auxstr = str1
    str1 = str2
    str2 = auxstr
    End If

    lmin = l1
    lmax = l2

    Dim f1(), f2() As Boolean
    ReDim f1(l1), f2(l2)

    For i = 1 To l1
    f1(i) = False
    Next i
    For j = 1 To l2
    f2(j) = False
    Next j

    m = Int((lmax / 2) - 1)

    common = 0
    tr = 0

    For i = 1 To l1
    a1 = Mid(str1, i, 1)
    If m >= i Then
    f = 1
    l = i + m
    Else
    f = i - m
    l = i + m
    End If
    If l > lmax Then
    l = lmax
    End If

    For j = f To l
    a2 = Mid(str2, j, 1)
    If (a2 = a1) And (f2(j) = False) Then
    common = common + 1
    f1(i) = True
    f2(j) = True
    GoTo linea_exit
    End If
    Next j
    linea_exit:
    Next i
    Dim wcd, wrd, wtr As Double

    l = 1
    For i = 1 To l1
    If f1(i) Then
    For j = l To l2
    If f2(j) Then
    l = j + 1
    a1 = Mid(str1, i, 1)
    a2 = Mid(str2, j, 1)
    If a1 <> a2 Then
    tr = tr + 0.5
    End If
    Exit For
    End If
    Next j
    End If
    Next i

    wcd = 1 / 3
    wrd = 1 / 3
    wtr = 1 / 3
    If common <> 0 Then
    jaro = wcd * common / l1 + wrd * common / l2 + wtr * (common - tr) / common
    Else
    jaro = 0
    End If
    End Function
    **************************
    Function clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
    Function clean_string(str1 As String) As String
    str1 = Replace(str1, ".", "")
    str1 = Replace(str1, ",", "")
    str1 = Replace(str1, "-", "")
    str1 = Replace(str1, ";", "")
    str1 = Replace(str1, ":", "")
    str1 = Replace(str1, "", "A")
    str1 = Replace(str1, "", "E")
    str1 = Replace(str1, "", "I")
    str1 = Replace(str1, "", "O")
    str1 = Replace(str1, "", "U")
    str1 = Replace(str1, " COMPANY ", " CO ")
    str1 = Replace(str1, " CORPORATION ", " CO ")
    ....
    EndFunction

    If you want I can give you my .mdb file so you can play with it.

    Comments on this post

    • stewstryker agrees
  12. #7
  13. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2003
    Location
    UK
    Posts
    13
    Rep Power
    0
    I can't make out from the papers or your source how the algorithm should take into account the fact that a mistake like s instead of z is much more likely than i instead of z - surely this would require some sort of table?

    Also a really easy mistake to make is when you make the wrong letter double - eg foll instead of fool. I think this needs taking into account if you are making your own version.

    How about putting your source on Sourceforge once it's sorted - it sounds like a really useful algorithm but it's not on the internet anywhere! 'Spira-C
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2007
    Location
    Vermont
    Posts
    1
    Rep Power
    0

    Oracle implementation


    I converted icases' VB code to run on Oracle 10g R1. 10g R2 includes a Jaro-Winkler distance function (as part of the new UTL_MATCH package), but our production database won't be upgrading to that version for at least 6 months. I needed something a bit sooner, so I did the fairly simple conversion from VB to Oracle PL/SQL.

    The problem is, it produces slightly different values than Oracle's UTL_MATCH.JARO_WINKLER() function.

    Unfortunately I don't have the time to analyze the original function or icases' code to find the problem. If you solve it, please let me know at the email below.

    Thanks,

    Stew

    Code:
    CREATE OR REPLACE FUNCTION jaro(str1_in IN VARCHAR2, str2_in IN VARCHAR2) RETURN NUMBER AS
        /* Converted from VB code from:
            forums.devshed.com/software-design-43/jaro-winkler-matching-algorithm-62100 DOT html
            
            Oracle implementation by Stew Stryker (stewDOTstrykerATdartmouthDOTedu) 4/30/07
        */
    
        str1 VARCHAR2(4000);
        str2 VARCHAR2(4000);
        len_str1 PLS_INTEGER;
        len_str2 PLS_INTEGER;
        swap_len PLS_INTEGER;
        max_len PLS_INTEGER;
        m PLS_INTEGER;
        i PLS_INTEGER;
        j PLS_INTEGER;
        f PLS_INTEGER;
        l PLS_INTEGER;
        tr NUMBER := 0;
        a1 VARCHAR2(4000);
        a2 VARCHAR2(4000);
        swap_str VARCHAR2(4000);
    
        TYPE bool_list_t IS TABLE OF BOOLEAN INDEX BY PLS_INTEGER;
        f1 bool_list_t;
        f2 bool_list_t;
    
        wcd NUMBER;
        wrd NUMBER;
        wtr NUMBER;
        common NUMBER := 0;
        jaro_value NUMBER := 0;
    
        -- Local function
        -- clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
        FUNCTION clean_string(p_str1 IN VARCHAR2) RETURN VARCHAR2 IS
            ret_value VARCHAR2(4000) := p_str1;
        BEGIN
            ret_value := REPLACE(ret_value, '.', '');
            ret_value := REPLACE(ret_value, ',', '');
            ret_value := REPLACE(ret_value, '-', '');
            ret_value := REPLACE(ret_value, ';', '');
            ret_value := REPLACE(ret_value, ':', '');
            ret_value := REPLACE(ret_value, '', 'A');
            ret_value := REPLACE(ret_value, '', 'E');
            ret_value := REPLACE(ret_value, '', 'I');
            ret_value := REPLACE(ret_value, '', 'O');
            ret_value := REPLACE(ret_value, '', 'U');
            ret_value := REPLACE(ret_value, ' COMPANY ', ' CO ');
            ret_value := REPLACE(ret_value, ' CORPORATION ', ' CO ');
            RETURN ret_value;
        END clean_string;
    
        -- MAIN
    BEGIN
        str1 := clean_string(str1_in);
        str2 := clean_string(str2_in);
    
        len_str1 := LENGTH(str1);
        len_str2 := LENGTH(str2);
    
        IF len_str1 > len_str2
        THEN
            swap_len := len_str2;
            len_str2 := len_str1;
            len_str1 := swap_len;
            swap_str := str1;
            str1 := str2;
            str2 := swap_str;
        END IF;
    
        max_len := len_str2;
    
        FOR i IN 1 .. len_str1
        LOOP
            f1(i) := FALSE;
        END LOOP;
    
        FOR j IN 1 .. len_str2
        LOOP
            f2(j) := FALSE;
        END LOOP;
    
        m := ROUND((max_len / 2) - 1);
    
        FOR i IN 1 .. len_str1
        LOOP
            a1 := SUBSTR(str1, i, 1);
        
            IF m >= i
            THEN
                f := 1;
                l := i + m;
            ELSE
                f := i - m;
                l := i + m;
            END IF;
        
            IF l > max_len
            THEN
                l := max_len;
            END IF;
        
            FOR j IN f .. l
            LOOP
                a2 := SUBSTR(str2, j, 1);
                IF (a2 = a1)
                   AND (f2(j) = FALSE)
                THEN
                    common := common + 1;
                    f1(i) := TRUE;
                    f2(j) := TRUE;
                    GOTO linea_exit;
                END IF;
            END LOOP; -- j
        
            <<linea_exit>>
            NULL; -- Add NULL statement to avoid error
        END LOOP; -- i
    
        l := 1;
        FOR i IN 1 .. len_str1
        LOOP
        
            IF f1(i)
            THEN
                FOR j IN l .. len_str2
                LOOP
                    IF f2(j)
                    THEN
                        l := j + 1;
                        a1 := SUBSTR(str1, i, 1);
                        a2 := SUBSTR(str2, j, 1);
                        IF a1 <> a2
                        THEN
                            tr := tr + 0.5;
                        END IF;
                        GOTO j_loop;
                    END IF;
                END LOOP; -- j
                <<j_loop>>
                NULL; -- Add NULL statement to avoid error        
            END IF;
        END LOOP; -- i
    
        wcd := 1 / 3;
        wrd := 1 / 3;
        wtr := 1 / 3;
    
        IF common <> 0
        THEN
            jaro_value := wcd * common / len_str1 + wrd * common / len_str2 +
                          wtr * (common - tr) / common;
        END IF;
    
        RETURN jaro_value;
    EXCEPTION
        WHEN OTHERS THEN
            DBMS_OUTPUT.PUT_LINE('EXCEPTION IN jaro - ' || SQLCODE || ': ' || SQLERRM);
            DBMS_OUTPUT.PUT_LINE('Error stack at top level:');
            DBMS_OUTPUT.PUT_LINE(dbms_utility.format_error_backtrace);
            RETURN jaro_value;
    END jaro;
    Last edited by stewstryker; May 1st, 2007 at 09:02 AM. Reason: Fix URL
  16. #9
  17. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2010
    Location
    So Cal
    Posts
    1
    Rep Power
    0

    Jaro Winkler implementation in C#


    I'm unsure where icases found the code that was to reproduce the string matching found in Mr. Winkler's original C function but, as HiFidelity said, it really cannot duplicate that algorithm without some kind of lookup table. I used the original code from the Census Bureau and implemented it in a CLR udf for SQL Server 2005/2008. It is a very rough translation of the original C and so it is not optimized but, it does reproduce the results of the original code almost exactly. I've posted a link to this at codeguru.

IMN logo majestic logo threadwatch logo seochat tools logo