Implementing text similarity algorithms in Python

Implementing text similarity algorithms in Python

Introduction:

Implementing text similarity algorithms ?? a beginner/intermediate programmer might ask may probably say ” that will be hard”, well don’t worry I’ve got you covered. Have your ever wondered how you search for something on Google and the results with the exact words or similar words appear on search results?. Well that’s simply the work of text similarity algorithms. These algorithms use different methods/processes to determine the similarity between texts/documents.

Getting started:

In this tutorial we will implementing some text similarity algorithms in Python,I’ve chosen 3 algorithms to use as examples in this tutorial.

We will take these algorithms one after the other.

Jaccard similarity index:

This is the simplest in terms of implementing amongst the three. It is also known as intersection over union, this algorithm uses the set union and intersection principles to find the similarity between two sentences. Take for example:

A = “hello word i am alive”

B = “hello world I am coding”

Merely looking at the two sentences we can see the are almost similar except with the difference in the last words “alive” and “coding“. What the Jaccard similarity index algorithm does is simply take the two statements into consideration. First it finds where there’s two sentences intersect and secondly where the unite (what the have in common) from our example sentences above we can see the intersection and union if the sentences.

Union = “hello world I am”

Intersection = “alive coding”

Here is a the algorithms formula:

Jaccard similarity index
jaccard similarity index

the similarity index is gotten by dividing the sum of the intersection by the sum of union.

Implementing it in Python:

We can implement the above algorithm in Python, we do not require any module to do this, though there are modules available for it, well it’s good to get ur hands busy once in a while.

Let’s assign our test sentences.

a = "hello world I am a programmer"
b = "hello world I am not a programmer"

we need to split up the sentences into lists then convert them into sets using python set(iterable) built-in function.

c = set(a.split())
d = set(b.split())

Once we have our sentences converted to sets, we can now start performing set operations.

X = float(len(c&d)) #intersection
Y = float(len(c|d)) #union

We are almost done , let’s calculate the similarity index of the two sentences.

try:
      similarity_ratio = round(((X/Y)*100/1),2)#the similarity index as a percentage
except:
      similarity_ratio = 0

Putting it together:

def Jaccard_similarity_index(x,y):
    x,y = x.lower(),y.lower()
    a = set(x.split())
    b = set(y.split())
    c = float(len(a&b))
    d = float(len(a|b))
    try:similarity_ratio = round(((c/d)*100/1),2)
    except:similarity_ratio = 0
    return(similarity_ratio)
print(Jaccard_similarity_index(a,b))
Result: 85.71

The similarity of text A from text B according to euclidean similarity index is 85.71%.

Cosine similarity index:

From Wikipedia “Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space.It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.”. For example giving two texts ;

A = “hello world I can code”
B = “hello world I can’t code

The two texts are not really the same with the ‘t as the difference now how can we use cosine similaritymatrix to find the difference/similarity between the two?. We used a similar algorithm in make a movie recommender. To find out more about cosine similarity visit Wikipedia.

First it’s good to note a few points before we move forward; from maths we know that the cosine of two vectors is given by:

Which is the dot of the two vectors divided by the cross product of there absolute values. “For text similarity/matching the A&B are usually the term frequency vectors of the document or in our case the sentences ” – Wikipedia.

implementing it in Python:

First let’s break the formula down:

  • We need the dot product.
  • We will also need the cosine of these.

Well enough talk let’s get to it; first we write the program for the dot product of the ith term and also write the code for the cosine similarity index:

import math

def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def cosine_similarity(v, w):
    try:return(dot(v, w) / math.sqrt(dot(v, v) * dot(w, w)))
    except:dot(v, w) /0.1

since we are handling with text we need to convert our text’s into a vector filled with 1(s) and 0(s). The code is commented to show workings.

common_words = None
def string_vector(string_list):
        "given a list of string_list, produce a vector whose i-th element is 1 if common_words[i] is in the list, 0 otherwise "
        return [1 if interest in string_list else 0 for interest in common_words]
           

The main program algorithm is recursive;

def cosine_similarity_index(text1,text2):
    global common_words
    space = []    
    space.append(text1.lower().split())
    space.append(text2.lower().split())
    common_words = sorted(list({ common_word for common_words in space for common_word in common_words }))
    string_matrix = map(string_vector,space)
    string_similarities = [[cosine_similarity(common_word_i, common_word_j)for common_word_j in string_matrix]for common_word_i in string_matrix]
    try:cosine_similarity_index = round(sum(string_similarities[0])*100,1)
    except:cosine_similarity_index = 0
    return(cosine_similarity_index)

When implemented in Python and use with our example the results is:

print(cosine_similarity_index(a,b))
Result:80.0

Levenshtein Distance:

The levenshtein distance also known as edit distance, is one if the popular algorithms used to know how different a word is from another, let’s take for example the words walk and walking the levenshtein distance tells us how different this words are from each other by simply taking into account the number of insertions, deletions or substitutions needed to transform walk into walking. We humans already know that that walking is only different from walk by deleting three characters -ing(deletion) and walk is only different from walking by inserting -ing at the end(Insertions), with the help of an algorithm like levenshtein distance a computer can know the difference too. The mathematical formula is given by:

Levenshtein distance


To read into detail about this algorithm please refer to Wikipedia .

now refer to the the image below to better understand how it works:

image source:https://github.com/trekhleb/javascript-algorithms/issues/311

this are practically how those smart auto-correct features in our editors work.

implementing it in Python:

First we need to create a matrix of dimensions length of X by length of Y.

word2 = word2.lower()
    word1 = word1.lower()
    matrix = [[0 for x in range(len(word2) + 1)] for x in range(len(word1) + 1)]

Next we number the Y and X cold and rows.

for x in range(len(word1) + 1):
        matrix[x][0] = x   
for y in range(len(word2) + 1):
        matrix[0][y] = y

Then we start transversing the matrix to detect/find where there has been a deletion, insertions or substitutions.

for x in range(1, len(word1) + 1):
        for y in range(1, len(word2) + 1):
            if word1[x - 1] == word2[y - 1]:
                matrix[x][y] = min(matrix[x - 1][y] + 1,matrix[x - 1][y - 1],matrix[x][y - 1] + 1)
            else:
                matrix[x][y] = min(matrix[x - 1][y] + 1,matrix[x - 1][y - 1] + 1,matrix[x][y - 1] + 1)        

The levenshtein distance is gotten at the last column and last row of the matrix.

levenshtein_distance = (matrix[len(word1)][len(word2)])    
print(levenshtein_distance)   

Putting it together:

def levenshtein_distance(word1,word2):
    """ levenshetein distance algorithm in python """
    word2 = word2.lower()
    word1 = word1.lower()
    matrix = [[0 for x in range(len(word2) + 1)] for x in range(len(word1) + 1)]
    #print(matrix)
    #print("Length of matrix :",len(matrix))
    #print("Length of index 0 :",len(matrix[0]))
    for x in range(len(word1) + 1):
        matrix[x][0] = x   
    for y in range(len(word2) + 1):
        matrix[0][y] = y
    for x in range(1, len(word1) + 1):
        for y in range(1, len(word2) + 1):
            if word1[x - 1] == word2[y - 1]:
                matrix[x][y] = min(matrix[x - 1][y] + 1,matrix[x - 1][y - 1],matrix[x][y - 1] + 1)
            else:
                matrix[x][y] = min(matrix[x - 1][y] + 1,matrix[x - 1][y - 1] + 1,matrix[x][y - 1] + 1)        
    levenshtein_distance = (matrix[len(word1)][len(word2)])    
    return (levenshtein_distance) 
a = "hello world I am a programmer"
b = "hello world I am not a programmer"
print(levenshtein_distance(a,b))
Result:4

That’s it we are done.

Comparing the 3 algorithms:

a = "hello world I am a programmer"
b = "hello world I am not a programmer"

print("levenshtein_distance :",levenshtein_distance(a,b))
print("cosine_similarity_index :",cosine_similarity_index(a,b))
print("euclidean_similarity :",euclidean_similarity(a,b))

Result:
levenshtein_distance : 4
cosine_similarity_index : 92.6
euclidean_similarity : 85.7

From the comparison it can be seen that cosine similarity algorithm tend to be more accurate than the euclidean similarity index but that doesn’t hold true always.

Implementing these text similarity algorithms ain’t that hard tho, feel free to carry out your own research and feel free to use the comment section, I will get back to you ASAP.

You can see the full code at my GitHub repo.

4 comments

  1. I have been exploring for a little for any high-quality articles or blog posts on this sort of area . Exploring in Yahoo I at last stumbled upon this website. Reading this information So i抦 happy to convey that I have a very good uncanny feeling I discovered exactly what I needed. I most certainly will make sure to don抰 forget this web site and give it a look regularly.

  2. whoah this blog is magnificent i love reading your articles. Keep up the great work! You know, many people are searching around for this information, you can help them greatly.

  3. You can definitely see your enthusiasm in the work you write. The world hopes for more passionate writers like you who aren’t afraid to say how they believe. Always go after your heart.

  4. It is really a nice and useful piece of information. I am glad that you shared this useful information with us. Please keep us informed like this. Thank you for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *