Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old May 19th, 2003, 10:32 AM
thomasjkeegan thomasjkeegan is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2003
Posts: 3 thomasjkeegan User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Jaro-Winkler Matching Algorithm

Does anyone know where to get the source for this algorithm?

I've searched on the web, but there's nothing but science papers on the algorithm. I'm an average joe and some of the stuff in them science papers goes well over me head

So if any one has it could ya lend us a hand and pass it on, please.

Thanks,

Thomas

Reply With Quote
  #2  
Old May 19th, 2003, 01:11 PM
theFish theFish is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2003
Location: Oxford, England
Posts: 14 theFish User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
The source for which language would be useful to know for starters...!!!

Reply With Quote
  #3  
Old May 20th, 2003, 03:32 AM
thomasjkeegan thomasjkeegan is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2003
Posts: 3 thomasjkeegan User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
the language doesn't matter. I just write it into the language I want. but if you have it in C, C++, C#, Java or to be honost any language i'll take it cause its impossible to find.

Reply With Quote
  #4  
Old May 20th, 2003, 10:42 AM
HiFidelity HiFidelity is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2003
Location: UK
Posts: 13 HiFidelity User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
What is it?

Do you have links for the papers?

Reply With Quote
  #5  
Old May 20th, 2003, 10:55 AM
thomasjkeegan thomasjkeegan is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2003
Posts: 3 thomasjkeegan User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Jaro Winkler algorithm is an improved version of the Edit Distance algorithm http://www.nist.gov/dads/HTML/editdistance.html

pretty much the same thing but with an example
URL


for an understanding of Jaro n' his mate Winklers Algorithm
URL

as for the paper I haven't seen any that explain the algorithm

Reply With Quote
  #6  
Old May 21st, 2003, 06:41 PM
icases icases is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2003
Posts: 1 icases User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
VB Code

Hi Thomas,

A couple of month ago I've studied different kinds of approximate string matching algorithms, among them Jaro-Winkler. I've developed an VBAccess code to test them prior to developing the ABAP final application.

Basically, it computes what i've understood of the papers you mentioned...

Here is the code (beware because it may be buggy):
****************************
Function jaro(ByVal str1 As String, ByVal str2 As String) As Double
Dim l1, l2, lmin, lmax, m, i, j As Integer
Dim common As Integer
Dim tr As Double
Dim a1, a2 As String

str1 = clean_string(str1)
str2 = clean_string(str2)

l1 = Len(str1)
l2 = Len(str2)
If l1 > l2 Then
aux = l2
l2 = l1
l1 = aux
auxstr = str1
str1 = str2
str2 = auxstr
End If

lmin = l1
lmax = l2

Dim f1(), f2() As Boolean
ReDim f1(l1), f2(l2)

For i = 1 To l1
f1(i) = False
Next i
For j = 1 To l2
f2(j) = False
Next j

m = Int((lmax / 2) - 1)

common = 0
tr = 0

For i = 1 To l1
a1 = Mid(str1, i, 1)
If m >= i Then
f = 1
l = i + m
Else
f = i - m
l = i + m
End If
If l > lmax Then
l = lmax
End If

For j = f To l
a2 = Mid(str2, j, 1)
If (a2 = a1) And (f2(j) = False) Then
common = common + 1
f1(i) = True
f2(j) = True
GoTo linea_exit
End If
Next j
linea_exit:
Next i
Dim wcd, wrd, wtr As Double

l = 1
For i = 1 To l1
If f1(i) Then
For j = l To l2
If f2(j) Then
l = j + 1
a1 = Mid(str1, i, 1)
a2 = Mid(str2, j, 1)
If a1 <> a2 Then
tr = tr + 0.5
End If
Exit For
End If
Next j
End If
Next i

wcd = 1 / 3
wrd = 1 / 3
wtr = 1 / 3
If common <> 0 Then
jaro = wcd * common / l1 + wrd * common / l2 + wtr * (common - tr) / common
Else
jaro = 0
End If
End Function
**************************
Function clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
Function clean_string(str1 As String) As String
str1 = Replace(str1, ".", "")
str1 = Replace(str1, ",", "")
str1 = Replace(str1, "-", "")
str1 = Replace(str1, ";", "")
str1 = Replace(str1, ":", "")
str1 = Replace(str1, "Á", "A")
str1 = Replace(str1, "É", "E")
str1 = Replace(str1, "Í", "I")
str1 = Replace(str1, "Ó", "O")
str1 = Replace(str1, "Ú", "U")
str1 = Replace(str1, " COMPANY ", " CO ")
str1 = Replace(str1, " CORPORATION ", " CO ")
....
EndFunction

If you want I can give you my .mdb file so you can play with it.
Comments on this post
stewstryker agrees!

Reply With Quote
  #7  
Old May 22nd, 2003, 06:14 PM
HiFidelity HiFidelity is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2003
Location: UK
Posts: 13 HiFidelity User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
I can't make out from the papers or your source how the algorithm should take into account the fact that a mistake like s instead of z is much more likely than i instead of z - surely this would require some sort of table?

Also a really easy mistake to make is when you make the wrong letter double - eg foll instead of fool. I think this needs taking into account if you are making your own version.

How about putting your source on Sourceforge once it's sorted - it sounds like a really useful algorithm but it's not on the internet anywhere! 'Spira-C

Reply With Quote
  #8  
Old May 1st, 2007, 08:00 AM
stewstryker stewstryker is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2007
Location: Vermont
Posts: 1 stewstryker User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 10 m 52 sec
Reputation Power: 0
Oracle implementation

I converted icases' VB code to run on Oracle 10g R1. 10g R2 includes a Jaro-Winkler distance function (as part of the new UTL_MATCH package), but our production database won't be upgrading to that version for at least 6 months. I needed something a bit sooner, so I did the fairly simple conversion from VB to Oracle PL/SQL.

The problem is, it produces slightly different values than Oracle's UTL_MATCH.JARO_WINKLER() function.

Unfortunately I don't have the time to analyze the original function or icases' code to find the problem. If you solve it, please let me know at the email below.

Thanks,

Stew

Code:
CREATE OR REPLACE FUNCTION jaro(str1_in IN VARCHAR2, str2_in IN VARCHAR2) RETURN NUMBER AS
    /* Converted from VB code from:
        forums.devshed.com/software-design-43/jaro-winkler-matching-algorithm-62100 DOT html
        
        Oracle implementation by Stew Stryker (stewDOTstrykerATdartmouthDOTedu) 4/30/07
    */

    str1 VARCHAR2(4000);
    str2 VARCHAR2(4000);
    len_str1 PLS_INTEGER;
    len_str2 PLS_INTEGER;
    swap_len PLS_INTEGER;
    max_len PLS_INTEGER;
    m PLS_INTEGER;
    i PLS_INTEGER;
    j PLS_INTEGER;
    f PLS_INTEGER;
    l PLS_INTEGER;
    tr NUMBER := 0;
    a1 VARCHAR2(4000);
    a2 VARCHAR2(4000);
    swap_str VARCHAR2(4000);

    TYPE bool_list_t IS TABLE OF BOOLEAN INDEX BY PLS_INTEGER;
    f1 bool_list_t;
    f2 bool_list_t;

    wcd NUMBER;
    wrd NUMBER;
    wtr NUMBER;
    common NUMBER := 0;
    jaro_value NUMBER := 0;

    -- Local function
    -- clean_string normalizes the data as of COMPANY -> CO or CORPORATION -> CORP. Example:
    FUNCTION clean_string(p_str1 IN VARCHAR2) RETURN VARCHAR2 IS
        ret_value VARCHAR2(4000) := p_str1;
    BEGIN
        ret_value := REPLACE(ret_value, '.', '');
        ret_value := REPLACE(ret_value, ',', '');
        ret_value := REPLACE(ret_value, '-', '');
        ret_value := REPLACE(ret_value, ';', '');
        ret_value := REPLACE(ret_value, ':', '');
        ret_value := REPLACE(ret_value, 'Á', 'A');
        ret_value := REPLACE(ret_value, 'É', 'E');
        ret_value := REPLACE(ret_value, 'Í', 'I');
        ret_value := REPLACE(ret_value, 'Ó', 'O');
        ret_value := REPLACE(ret_value, 'Ú', 'U');
        ret_value := REPLACE(ret_value, ' COMPANY ', ' CO ');
        ret_value := REPLACE(ret_value, ' CORPORATION ', ' CO ');
        RETURN ret_value;
    END clean_string;

    -- MAIN
BEGIN
    str1 := clean_string(str1_in);
    str2 := clean_string(str2_in);

    len_str1 := LENGTH(str1);
    len_str2 := LENGTH(str2);

    IF len_str1 > len_str2
    THEN
        swap_len := len_str2;
        len_str2 := len_str1;
        len_str1 := swap_len;
        swap_str := str1;
        str1 := str2;
        str2 := swap_str;
    END IF;

    max_len := len_str2;

    FOR i IN 1 .. len_str1
    LOOP
        f1(i) := FALSE;
    END LOOP;

    FOR j IN 1 .. len_str2
    LOOP
        f2(j) := FALSE;
    END LOOP;

    m := ROUND((max_len / 2) - 1);

    FOR i IN 1 .. len_str1
    LOOP
        a1 := SUBSTR(str1, i, 1);
    
        IF m >= i
        THEN
            f := 1;
            l := i + m;
        ELSE
            f := i - m;
            l := i + m;
        END IF;
    
        IF l > max_len
        THEN
            l := max_len;
        END IF;
    
        FOR j IN f .. l
        LOOP
            a2 := SUBSTR(str2, j, 1);
            IF (a2 = a1)
               AND (f2(j) = FALSE)
            THEN
                common := common + 1;
                f1(i) := TRUE;
                f2(j) := TRUE;
                GOTO linea_exit;
            END IF;
        END LOOP; -- j
    
        <<linea_exit>>
        NULL; -- Add NULL statement to avoid error
    END LOOP; -- i

    l := 1;
    FOR i IN 1 .. len_str1
    LOOP
    
        IF f1(i)
        THEN
            FOR j IN l .. len_str2
            LOOP
                IF f2(j)
                THEN
                    l := j + 1;
                    a1 := SUBSTR(str1, i, 1);
                    a2 := SUBSTR(str2, j, 1);
                    IF a1 <> a2
                    THEN
                        tr := tr + 0.5;
                    END IF;
                    GOTO j_loop;
                END IF;
            END LOOP; -- j
            <<j_loop>>
            NULL; -- Add NULL statement to avoid error        
        END IF;
    END LOOP; -- i

    wcd := 1 / 3;
    wrd := 1 / 3;
    wtr := 1 / 3;

    IF common <> 0
    THEN
        jaro_value := wcd * common / len_str1 + wrd * common / len_str2 +
                      wtr * (common - tr) / common;
    END IF;

    RETURN jaro_value;
EXCEPTION
    WHEN OTHERS THEN
        DBMS_OUTPUT.PUT_LINE('EXCEPTION IN jaro - ' || SQLCODE || ': ' || SQLERRM);
        DBMS_OUTPUT.PUT_LINE('Error stack at top level:');
        DBMS_OUTPUT.PUT_LINE(dbms_utility.format_error_backtrace);
        RETURN jaro_value;
END jaro;

Last edited by stewstryker : May 1st, 2007 at 08:02 AM. Reason: Fix URL

Reply With Quote
  #9  
Old July 18th, 2010, 09:14 PM
dtcfl dtcfl is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2010
Location: So Cal
Posts: 1 dtcfl User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 16 m 51 sec
Reputation Power: 0
Jaro Winkler implementation in C#

I'm unsure where icases found the code that was to reproduce the string matching found in Mr. Winkler's original C function but, as HiFidelity said, it really cannot duplicate that algorithm without some kind of lookup table. I used the original code from the Census Bureau and implemented it in a CLR udf for SQL Server 2005/2008. It is a very rough translation of the original C and so it is not optimized but, it does reproduce the results of the original code almost exactly. I've posted a link to this at codeguru.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Jaro-Winkler Matching Algorithm

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap