Fuzzy String Matching in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Fuzzy String Matching in Python

    In this tutorial, you will learn how to approximately match strings and determine how similar they are by going over various examples.

    Have you ever wanted to compare strings that were referring to the same thing, but they were written slightly different, had typos or were misspelled? This is a problem that you can encounter in a variety of situations ranging from spelling check software to mapping databases that lack a common key (think of trying to join two tables by company name, and these appear differently in both tables).

    For this tutorial, I will focus on a case study in which the database problem mentioned above was addressed. However, before we start, it would be beneficial to show how we can fuzzy match strings. Normally, when you compare strings in Python you can do the following:

    %%capture
    !pip install -r requirements.txt
    Str1 = "Apple Inc."
    Str2 = "Apple Inc."
    Result = Str1 == Str2
    print(Result)

    In this case, the variable Result will print True since the strings are an exact match (100% similarity), but see what happens if the case of Str2 changes:

    Str1 = "Apple Inc."
    Str2 = "apple Inc."
    Result = Str1 == Str2
    print(Result)

    This time we got a False even though the strings look quite the same to the human eye. However, this problem is very straightforward to solve as we can coerce both strings to lower case:

    Str1 = "Apple Inc."
    Str2 = "apple Inc."
    Result = Str1.lower() == Str2.lower()
    print(Result)

    Nevertheless, this happiness ends as soon as a character is off. For example:

    Str1 = "Apple Inc."
    Str2 = "apple Inc"
    Result = Str1.lower() == Str2.lower()
    print(Result)

    Situations like the one above can, at times, appear on databases that have been created based on human data entry and in these cases we need more powerful tools to compare strings. One of these tools is called the Levenshtein distance.

    The Levenshtein Distance

    The Levenshtein distance is a metric to measure how much apart two sequences of words are. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

    The formal definition of the Levenshtein distance between two strings and can be seen as follows:

    Where denotes 0 when and 1 otherwise. It is important to note that the rows on the minimum above correspond to a deletion, an insertion, and a substitution in that order.

    It is also possible to calculate the Levenshtein similarity ratio based on the Levenshtein distance. This can be done using the following formula:

    where and are the lengths of sequence and sequence respectively.

    Below you can see how such a formula can be implemented from scratch using a Python function:

    import numpy as np
    def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
        """ levenshtein_ratio_and_distance:
            Calculates levenshtein distance between two strings.
            If ratio_calc = True, the function computes the
            levenshtein distance ratio of similarity between two strings
            For all i and j, distance[i,j] will contain the Levenshtein
            distance between the first i characters of s and the
            first j characters of t
        """
        # Initialize matrix of zeros
        rows = len(s)+1
        cols = len(t)+1
        distance = np.zeros((rows,cols),dtype = int)
    
        # Populate matrix of zeros with the indeces of each character of both strings
        for i in range(1, rows):
            for k in range(1,cols):
                distance[i][0] = i
                distance[0][k] = k
    
        # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
        for col in range(1, cols):
            for row in range(1, rows):
                if s[row-1] == t[col-1]:
                    cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
                else:
                    # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                    # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                    if ratio_calc == True:
                        cost = 2
                    else:
                        cost = 1
                distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                     distance[row][col-1] + 1,          # Cost of insertions
                                     distance[row-1][col-1] + cost)     # Cost of substitutions
        if ratio_calc == True:
            # Computation of the Levenshtein Distance Ratio
            Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
            return Ratio
        else:
            # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
            # insertions and/or substitutions
            # This is the minimum number of edits needed to convert string a to string b
            return "The strings are {} edits away".format(distance[row][col])

    If we apply this function to the earlier example where we where tryng to compare "Apple Inc." to "apple Inc" we will see that these two strinsg are very likley the same since the Levenshtein distance is very small.

    Str1 = "Apple Inc."
    Str2 = "apple Inc"
    Distance = levenshtein_ratio_and_distance(Str1,Str2)
    print(Distance)
    Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
    print(Ratio)

    As you can see, the function found the 2 differences between the two strings. These were the upper/lower case a and the full stop (period) at the end of the first string as well as a similarity ratio of 84%, which is pretty high. Before moving on further though, I would like to highlight an important notion. If you do very simple string preprocessing before calculating distance you will see the following: