Talk:Jaro–Winkler distance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

???

This article has not yet received a rating on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

This seems to be backwards[edit]

The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 0 equates to no similarity and 1 is an exact match. The Jaro–Winkler similarity is the inversion, (1 − Jaro–Winkler distance).

Multiple matching characters[edit]

The definition as currently provided does not seem to specify what to do in case of multiple possibilities for character matching. For example the S's in the pair $({\mathit {JOHNS_{1}ONS_{2}}},{\mathit {JOHNS_{3}S_{4}ONS_{5}}})$ can be matched as $S_{1}=S_{3},S_{2}=S_{4}$ or as $S_{1}=S_{3},S_{2}=S_{5}$ . The first option yields $m=8,t=1$ while the second option yields $m=8,t=0$ . The definition given by William Winkler in the paper "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage" (1990) also does not seem to provide a choice between greedy or optimal matching. Should this be mentioned on the current page? 132.229.128.62 (talk) 16:11, 27 January 2014 (UTC)[reply]

adjwt[edit]

This makes no mention of the adjacent-weighting array (adjwt in Winkler's strcomp95.c). Is this additional function simply a fudge, or a fundamental part of the algorithm that should be mentioned? 82.2.93.1 (talk) 23:35, 9 December 2007 (UTC)[reply]

I've taken the bull by the other foot, and added a note to the article. 82.2.93.1 (talk) 23:36, 9 December 2007 (UTC)[reply]

As now clarified in the article, the adjwt array and string length compensation are not a part of the published versions of this algorithm, therefore are not reflected in the examples, even though they are in the reference code. 66.220.23.57 (talk) 16:38, 11 December 2007 (UTC)[reply]

Number of transpositions[edit]

I dont understand why t values are zero in (DWAYNE, DUANE) and (DIXON, DICKSONX) examples. I think that part is not clear. The character N and E in DWAYNE are found in DUANE, but in different positions. But it is written that t is zero not 2. If someone can clarify this, I will apreciate it. Bitkidoku (talk) 20:39, 13 May 2008 (UTC)[reply]

The definition of transposition given in the article is very unclear. In comparing CRATE with TRACE we have that all the letters match, but they are in a different order. By swapping the C and the T, we can turn CRATE into TRACE. Such a swap of two elements is called a transposition. In DwAyNE versus DuANE the matching letters are already in the same order D-A-N-E, so no transpositions are needed. --Lambiam 02:19, 19 May 2008 (UTC)[reply]

I agree. The definition of transposition confused me also. I have found that (to the best of my memory) when I wrote an algorithm to compute this, I took t to be the count of letters which where within the max distance allowed for 'matching characters' but characters that did not match exactly. E.g. in the words CAT and ACT, T is a perfect match while A and C are matching but each require a transportation, so t = 2 here. Then, when it comes to actually computing the Jaro distance (or its extensions), simply divide t by 2 before using it. E.g. Set t = t/2. This achieves the desired result but 1) makes any algorithm easier to compute 2) makes it easier to understand. Hope this helps. 131.181.251.10 (talk) 01:59, 25 August 2011 (UTC)[reply]

Question[edit]

One thing is not clear for me : Is Jaro distance (not Jaro-Winkler distance) a metric in the mathematical sense of that term, or not ? --Mrówkotermit (talk) 08:55, 8 June 2009 (UTC)[reply]

Recommend Geometric Mean When Normalizing[edit]

In my experience, the geometric mean gives a much more informative result than the arithmetic mean when normalizing (heterogeneous) numbers. I recommend a variation of the Jaro-Winkler distance, call it the Jaro-Winkler-Young method, if you will:

d_{y}=\left({\frac {m}{|s_{1}|}}*{\frac {m}{|s_{2}|}}*{\frac {m-t}{m}}\right)

Consider the strings DESSERTS and STRESSED, $s_{1}$ and $s_{2}$ respectively. Because the match distance is $3=\left\lfloor {\frac {\max(8,8)}{2}}\right\rfloor -1$ , $m=5=|ESSER|$ and $t=1=|T|$ in this case.

The Jaro-Winkler method gives the result

d_{j}={\frac {1}{3}}\left({\frac {5}{8}}+{\frac {5}{8}}+{\frac {5-1}{5}}\right)={\frac {41}{60}}=0.683

In my opinion, this is a rather high number (in the range of [0,1]) considering the dissimilarity of the strings.

The Jaro-Winkler-Young method gives the result

d_{y}=\left({\frac {5}{8}}*{\frac {5}{8}}*{\frac {5-1}{5}}\right)={\frac {5}{16}}=0.3125

MarkMYoung (talk) 23:40, 2 March 2010 (UTC)[reply]

Mark, I like your addition to this concept. It makes sense. However, I think m would be 6 in this example, not 5 because the first 'S' in 'STRESSED' would match with the first 'S' in 'DESSERTS', and likewise with the second and third S's. So all three S's match plus the two E's and the R. The code I've written which works for all the other examples says its 6 and I've followed it through by hand, but correct me if I'm wrong. 203.9.185.137 (talk) 01:37, 9 August 2011 (UTC)[reply]

I have implemented your formula Mark and tested it on many strings side-by-side to the Jaro-Winkler distance measure and found that your measure tends to penalise very heavily making it very hard to determine a good cut-off for deciding what is considered a close match and what is not. I have found that for the Jaro-Winkler score, a value of about 0.92 tends to be a very good margin - any pair of strings with 0.92 <= JW value < 1 are usually close matches while JW values <0.92 tend to not be similar. (I ran this test on about 600 first+last names which I knew had not been cleaned and contained 'duplicates' as a result of typos, etc.) 121.222.138.79 (talk) 00:15, 22 August 2011 (UTC)[reply]

Why is the floor function in the CRATE-TRACE example giving a non-integer?[edit]

In the CRATE-TRACE example, it says "...in comparing CRATE with TRACE, only 'R' 'A' 'E' are the matching characters, i.e, m=3. Although 'C', 'T' appear in both strings, they are farther than 1.5, i.e., (5/2)-1=1.5. Therefore, t=0." Am I misinterpretting the square brackets with tops missing (at least they are missing on my screen) as floor functions, because this should give a value of 1, not 1.5. That is, Floor(5/2)-1 = 2-1 = 1. Can anyone give me an explanation please? 203.9.185.137 (talk) 04:11, 2 August 2011 (UTC)[reply]

In the C source on census.gov the search range value is an integer, so 1.5 is impossible.

But I have a different question... [s]Maybe I implemented the code wrong, but it appears that the C source will give you different scores based on the order in which you pass the two strings. This seems to be because the step where you find the matching letters is only done from one string's perspective. Anyone agree? [/s] Nevermind I had an error in my code. 67.91.189.250 (talk) 03:02, 4 August 2011 (UTC)[reply]

Formula flawed for m = 0[edit]

I'm writing my own code so that my colleagues can run this in Excel. The problem I'm getting is when two strings have nothing in common given a value m = 0 which causes the code to fail. Looking at the formula given for the Jaro score, division by m when m is zero seems to be an oversight. I'm guessing this should equate to a Jaro score of zero. This should perhaps be mentioned somewhere in the article, because as it stands the formula is flawed. 203.9.185.137 (talk) 23:29, 8 August 2011 (UTC)[reply]

You're right. It is flawed the way it is written. I guess it should be written as a split-definition function where its equal to the given formula for m >= 1 and 0 otherwise. If I was more comfortable with LaTeX code I would fix this but alas... 121.222.138.79 (talk) 00:08, 22 August 2011 (UTC)[reply]

done. Syneil (talk) 12:45, 17 July 2012 (UTC)[reply]

It does not make any sense. If m=0, then stings are not similar at all distance should be 1. Alexei Kopylov (talk) 02:54, 24 March 2017 (UTC)[reply]

Notability[edit]

I added a journal article that is a survey of string distance metrics (Cohen, Ravikumar, & Fienberg, 2003). Jaro-Winkler is discussed in the second page of the article. Between that and Winkler's survey article from 2006, is this sufficient to remove the Notability warning template? Showeropera (talk) 21:13, 27 August 2014 (UTC)[reply]

I'm not sure why the Notability warning template is still there. Google it. It's notable already... I use it every day. I wish I felt empowered to remove a warning template. Who is bold enough to do it? Joe (talk) 17:27, 6 February 2015 (UTC)[reply]

I came here because this is used in Solr; seems quite notable and useful. I removed the tag. @Joedeshon: Feel free to remove such tags in the future. At worst someone else will put it back. 8) -- Beland (talk) 17:48, 17 July 2015 (UTC)[reply]

Named distance, described as similarity[edit]

The article clearly presents it as a string distance:

> The lower the Jaro–Winkler distance for two strings is, the more similar the strings are. The score is normalized such that 1 equates to no similarity and 0 is an exact match. The Jaro–Winkler similarity is given by 1 − Jaro–Winkler distance.

That looks plain untrue tome. The formulas clearly say the opposite: 1 equates to exact match, 0 equates to no similarity. What formulas describe here is a similarity measure, not a distance measure.

Cornuz (talk) 13:29, 5 October 2017 (UTC)[reply]

Jaro-Winkler not satisfying the identity axiom[edit]

It is claimed that "the Jaro–Winkler distance [also] does not satisfy the identity axiom $d(x,y)=0\leftrightarrow x=y$ .". Is this generally true, or true only if $p\geq 0.25$ (the article also limits $l$ from being greater than 4)? If that's the case, then I think it is worth qualifying that claim with this bit of information.

I see the Jaro-Wrinkler distance here the first, so I didn't want to edit it myself.

— manual signature: comment added by ThoAppelsin (talk • contribs) 14:45, 28 July 2020 (UTC)[reply]

I came here to mention this also. If $sim_{w}=1$ then $lp(1-sim_{j})=1-sim_{j}$ and hence either $sim_{j}=1$ or $lp=1$ . The restriction on $p$ should really be that it is strictly less than 0.25 (or $1/l$ ), as mentioned in the reference listed, and under this restriction the Jaro-Winkler distance does satisfy the identity axiom. 203.161.95.162 (talk) 20:22, 18 September 2023 (UTC)Sacha Roscoe[reply]

Calculation of common characters[edit]

The paper Overview of Record Linkage and Current Research Directions by the author of the Jaro-Winkler similarity William E. Winkler defines common characters on page 11 as:

   The definition of common is that the agreeing character must be within half the length of the shorter string

This definition differs from the one used in the article. Does someone know why there are different definitions? — Preceding unsigned comment added by Wynk (talk • contribs) 12:19, 10 December 2020 (UTC)[reply]

Website to play with Jaro Winkler[edit]

This Website is nice to play with different algorithms like Jaro Winkler: https://asecuritysite.com/forensics/simstring

46.126.16.87 (talk) 09:36, 3 September 2022 (UTC) Landev[reply]

Odd number for transpositions[edit]

I wish an example is provided where the number of transpositions before halving is odd. Some implementations do integer division, others do float division. Vitcra (talk) 14:29, 21 June 2023 (UTC)[reply]

Computational complexity[edit]

I came to this page hoping to find the computational complexity (i.e., "big O") but didn't see it. I think adding the computational complexity would be a very helpful addition. Some of the other string similarity algorithm pages list complexity (like Smith–Waterman algorithm), which I appreciated. Showeropera (talk) 14:23, 12 December 2023 (UTC)[reply]