Fuzzy String Matching in Python
In this tutorial, you will learn how to approximately match strings and determine how similar they are by going over various examples.
Have you ever wanted to compare strings that were referring to the same thing, but they were written slightly different, had typos or were misspelled? This is a problem that you can encounter in a variety of situations ranging from spelling check software to mapping databases that lack a common key (think of trying to join two tables by company name, and these appear differently in both tables).
For this tutorial, I will focus on a case study in which the database problem mentioned above was addressed. However, before we start, it would be beneficial to show how we can fuzzy match strings. Normally, when you compare strings in Python you can do the following:
%%capture
!pip install -r requirements.txt
Str1 = "Apple Inc."
Str2 = "Apple Inc."
Result = Str1 == Str2
print(Result)
In this case, the variable Result will print True since the strings are an exact match (100% similarity), but see what happens if the case of Str2 changes:
Str1 = "Apple Inc."
Str2 = "apple Inc."
Result = Str1 == Str2
print(Result)
This time we got a False even though the strings look quite the same to the human eye. However, this problem is very straightforward to solve as we can coerce both strings to lower case:
Str1 = "Apple Inc."
Str2 = "apple Inc."
Result = Str1.lower() == Str2.lower()
print(Result)
Nevertheless, this happiness ends as soon as a character is off. For example:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Result = Str1.lower() == Str2.lower()
print(Result)
Situations like the one above can, at times, appear on databases that have been created based on human data entry and in these cases we need more powerful tools to compare strings. One of these tools is called the Levenshtein distance.
The Levenshtein Distance
The Levenshtein distance is a metric to measure how much apart two sequences of words are. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.
The formal definition of the Levenshtein distance between two strings
Where
It is also possible to calculate the Levenshtein similarity ratio based on the Levenshtein distance. This can be done using the following formula:
where
Below you can see how such a formula can be implemented from scratch using a Python function:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
If we apply this function to the earlier example where we where tryng to compare "Apple Inc." to "apple Inc" we will see that these two strinsg are very likley the same since the Levenshtein distance is very small.
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1,Str2)
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
print(Ratio)
As you can see, the function found the 2 differences between the two strings. These were the upper/lower case a and the full stop (period) at the end of the first string as well as a similarity ratio of 84%, which is pretty high. Before moving on further though, I would like to highlight an important notion. If you do very simple string preprocessing before calculating distance you will see the following:
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
Just like that, the distance has been reduced by 1 simply by turning the strings to lower case before comparing and the similarity ratio to almost 95%. This emphasizes the relevance of string preprocessing before performing calculations. If you were, say, choosing if a string is similar to another one based on a similarity threshold of 90%, then "Apple Inc." and "apple Inc" without preprocessing would be marked as not similar.
Even though the example above is a valid way of implementing a function to calculate Levenshtein distance, there is a simpler alternative in Python in the form of the Levenshtein package. The Levenshtein package contains two functions that do the same as the user-defined function above. An example is shown below.
import Levenshtein as lev
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)
Note: You may have noticed that in the user-defined function above there is a comment mentioning that, when the similarity ratio is calculated, the cost of a substitution is 2 instead of 1. That is because lev.ratio() assigns such a cost for a substitution (Think about a substitution as having to do a deletion and an insertion at the same time). However, it is important to note that a substitution has a cost of 1 in lev.distance(). Thus, it is important to keep this in mind if you try to verify the function's outputs manually.
The FuzzyWuzzy Package
This package may have a funny name, but it can be your best friend when the standard Levenshtein distance ratio of similarity between two strings falls short. So far the example that I have been using with "Apple Inc." and "apple Inc" has been relatively simple. After all, there is just one full stop/period of difference if you turn both strings to lower case. However, what happens when something is spelled out of order? What happens when something has considerable spelling variation, but yet it refers to the same thing? That's where the FuzzyWuzzy package comes in since it has functions that allow our fuzzy matching scripts to handle these sorts of cases.
Let's start simple. FuzzyWuzzy has, just like the Levenshtein package, a ratio function that computes the standard Levenshtein distance similarity ratio between two sequences. You can see an example below:
from fuzzywuzzy import fuzz
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)
That ratio of similarity is the same as we expected given the other examples above. However, fuzzywuzzy has more powerful functions that allow us to deal with more complex situations such as substring matching. Here is an example:
Str1 = "Los Angeles Lakers"
Str2 = "Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length
Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order? Luckily for us, fuzzywuzzy has a solution. You can see the example below: