In a world that relies more and more on quick access to information, two application design criteria have become key:
- Data cleansing, to ensure the data being delivered is accurate, and
- User experience (UX) that accounts for potential missteps taken by users
This kind of UX can be complicated to implement. For instance, it may be simple for a human to realize at a glance that someone typing “New Yolk City” likely meant to type “New York City”. But drawing the same conclusion programmatically is far more challenging.
One answer to the reality of imperfect data and mistyped user input is to implement a fuzzy matching solution that can detect typos and alternate spellings. Fuzzy matching is an approximate string matching technique, which enables applications to programmatically determine the probability that two different strings are actually referring to the same thing.
This blog post will:
- Introduce you to fuzzy matching
- Provide a practical example of how to implement fuzzy matching in Python using the FuzzyWuzzy library
Get Started: Install Fuzzy Matching Tools With This Ready-To-Use Python Environment
To follow along with the code in this Python fuzzy matching tutorial, you’ll need to have a recent version of Python installed, along with all the packages used in this post. The quickest way to get up and running is to install the Fuzzy Matching runtime for Windows, Mac or Linux, which contains a version of Python and all the packages you’ll need.
In order to download the ready-to-use phishing detection Python environment, you will need to create an ActiveState Platform account. Just use your GitHub credentials or your email address to register. Signing up is easy and it unlocks the ActiveState Platform’s many benefits for you!
For Windows users, run the following at a CMD prompt to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:
powershell -Command "& $([scriptblock]::Create((New-Object Net.WebClient).DownloadString('https://platform.activestate.com/dl/cli/install.ps1'))) -activate-default Pizza-Team/Fuzzy-Matching"
For Mac or Linux users, run the following to automatically download and install our CLI, the State Tool along with the COVID Simulation runtime into a virtual environment:
sh <(curl -q https://platform.activestate.com/dl/cli/install.sh) --activate-default Pizza-Team/Fuzzy-Matching
As mentioned above, fuzzy matching is an approximate string-matching technique to programatically match similar data. Instead of simply looking at equivalency between two strings to determine if they are the same, fuzzy matching algorithms work to quantify exactly how close two strings are to one another. In doing so, they can help determine the likelihood that two different strings were actually meant to be equivalent. This is often done by incorporating edit distance.
Edit distance is a string metric. This metric provides a manner for detecting the closeness of two strings to one another by identifying the minimum number of alterations that must occur to transform one string into the other. Given the example at the beginning of this piece, “New York City” vs. “New Yolk City”, one can easily tell that simply switching a single letter in the second string (the “l” to an “r”) results in these two strings being the same. Meaning the edit distance is relatively low, and these two strings are very close to one another. In other words, there is a high likelihood that these strings were meant to be the same.
Fuzzy matching has multiple use cases, many of which we encounter on a regular basis. Type “New Yolk” into a GPS app, for example, and it’ll likely yield the suggestion “New York.” Slightly misspell a search term in your favorite search engine, and you’ll likely be provided with results for the term for which you actually meant to perform the search. In other words, implementations leveraging some form of fuzzy matching are all around us, and many times they mean the difference between a positive user experience and a negative one.
Now, let’s take a look at implementing fuzzy matching in Python, using the open source library FuzzyWuzzy.
Fuzzy Matching With Python FuzzyWuzzy
FuzzyWuzzy, an open source string matching library for Python developers, was first developed by SeatGeek to help decipher whether or not two similarly named ticket listings were for the same event. FuzzyWuzzy evaluates the Levenshtein distance (a version of edit distance that accounts for character insertions, deletions and substitutions) to make this possible. In addition, FuzzyWuzzy contains functionality for evaluating string similarity in other circumstances that we’ll touch on below.
Evaluating string similarity with the fuzz.ratio function
To evaluate two different strings using edit distance, we’ll use the fuzz.ratio function within FuzzyWuzzy’s fuzz module. This function returns a similarity score as a value between 0 and 100. The closer the value is to 100, the more similar the two strings are. For example, let’s compare two strings that are identical to one another:
from fuzzywuzzy import fuzz value = fuzz.ratio('New York', 'New York') print('value: ' + str(value))
Executing this script results in the following output:
Now, let’s take a look at ‘New Yolk’ vs. ‘New York’ and see what is returned by the ‘ratio’ function:
value = fuzz.ratio('New Yolk', 'New York') print('value: ' + str(value))
And here is the output:
With just one difference in the relatively short strings of “New York” and “New Yolk”, the returned value falls from 100 to 88. Still, this value indicates that the two strings are highly similar to one another. As a final test, let’s replace ‘New Yolk’ with ‘New Zealand’. And compare this string to ‘New York’.
value = fuzz.ratio('New Zealand', 'New York') print('value: ' + str(value))
This results in the following output:
This lower value indicates a more significant difference between the two strings.
With that said, while fuzz.ratio works well in many situations, it may not be the best option for evaluating similarity between strings with partial matches. For this, FuzzyWuzzy contains the function partial_ratio that may be more applicable.
Ignoring token order in your evaluation
Now, let’s consider the situation in which two strings are provided in differing order. With FuzzyWuzzy, these can be evaluated to return a useful similarity score using the token_sort_ratio function.
value = fuzz.token_sort_ratio('To be or not to be', 'To be not or to be')
The above code returns a value of 100. Essentially, the two strings are tokenized, re-ordered in the same fashion, and evaluated using the fuzz.ratio function. This can prove useful in a variety of cases, including:
- Searching for a famous quote that has been accidentally typed in an incorrect order
- The case of SeatGeek, who were trying to aggregate tickets offered by multiple vendors whose description of the sporting event varied widely.
Consider the following code snippet that also returns a value of 100:
value = fuzz.token_sort_ratio('Chiefs vs. Raiders', 'Raiders vs. Chiefs')
The above functionality represents just a small subset of what FuzzyWuzzy has to offer. To further evaluate its functionality, check out the README, and give it a chance to see how it can help bolster your fuzzy matching implementation.
Although fuzzy matching is far from a perfect science, it can provide application developers with a starting point for cleansing cluttered datasets and developing a more resilient, versatile and intuitive user experience. Fine-tuning a fuzzy matching implementation will almost always require some serious thought, as well as a mixture of different fuzzy matching techniques. But, for any application that must evaluate user text input, or for a dataset in which duplicate entries are an ever-present problem, the juice is certainly worth the squeeze.