Cosine similarity is a measure that calculates the cosine of the angle between two given n-dimensional vectors in an n-dimensional space. Mathematically, it’s the dot product of the two non-zero vectors divided by the product of their magnitudes. The cosine similarity algorithm is proven to be very relevant when it comes to computing the similarity between two things. It can be used to create a movie recommendation application which suggests movies to a user based on preferences and previous viewing history. Aside from this use case, it can also be used by a company to create a chatbot to response to the most frequently asked questions about the company. In this article, we will discuss the dot product (the backbone to cosine similarity) and how to use cosine similarity to answer questions.
The dot product
One thing you will notice is that the dot product of two vectors is a real number and not a vector. In the second example, the dot product of the vectors is a zero. What does it mean to have a zero-dot product? To answer this question, it is reasonable to define the dot product geometrically, which is
The question above is answered. Thus, the dot product is 0 if the first vector is orthogonal to the second vector as shown below:
COSINE SIMILARITY
The cosine similarity, as explained already, is the dot product of the two non-zero vectors divided by the product of their magnitudes. We can find the cosine similarity equation by solving the dot product equation for cos cos0 :
If two documents are entirely similar, they will have cosine similarity of 1. On the other hand, when the cosine similarity is -1, the documents are perfectly dissimilar. With that said, let us now dive into practice.
Practice
NB: I’m using Python 3.7 and scikit-learn 0.19.2.
We need to define our training questions and answers documents where each question has its corresponding answer in the answer document. Note that the index of a question and its answer is the same. For instance, if the question is at index 1 in the questions document, its answer is at index 1 in the answers document.
questions = [
- ‘How many regions are in Ghana?’,
- ‘What is the favorite food for people in the Ashanti region of Ghana?’,
- ‘What is the name of the king of the Asantes?’,
- ‘What cash crop does Ghana export?’,
- ‘What is the primary occupation in Ghana?’,
- ‘Which country is the leading producer of cocoa in Africa?’,
- ‘Who is the minister of Food and Agriculture in Ghana?’,
- ‘What is crop rotation?’,
- ‘What is a cash crop?’,
- ‘What is arable farming?’,
- ‘What is the dominant native language in Ghana?’,
- ‘What is the current population of Ghana?’,
- ‘What is the capital city of Ghana?’ ]
answers = [
- ‘Ten’,
- ‘Fufu’,
- ‘Otumfuo Osei Tutu I’,
- ‘Cocoa’,
- ‘Farming’,
- “Cote D’Ivoire”,
- ‘Dr. Owusu Afriyie Akoto’,
- ‘The practice of growing a series of different types of crops in the same area in sequenced seasons’,
- ‘An agricultural crop grown for sale to return profit.’,
- ‘A kind of farming in which the land is ploughed and used to grow crops.’,
- ‘Twi’,
- ‘28.8 million’,
- ‘Accra’ ]
The next thing is to use the sklearn “tfidf” vectorizer to transform all the questions into vectors. So, let’s import and instantiate the vectorizer.
- from sklearn.feature_extraction.text import TfidfVectorizer
- vectorizer= TfidfVectorizer()
- X = vectorizer.fit(questions)
- array = X.transform(questions).toarray()
- print(array[0])
- [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
Since we have our documents modeled as vectors (with TF-IDF counts), we can now write a function to compute the cosine similarity of the angle between any given two vectors.
- import numpy as np
- def cosine_similarity(a, b):
- “””Takes 2 vectors a, b and returns the cosine similarity according
- to the definition of the dot product “””
- dot_product = np.dot(a, b)
- norm_a = np.linalg.norm(a)
- norm_b = np.linalg.norm(b)
- return dot_product / (norm_a * norm_b)
When a user asks a question, we will transform the question into a vector with the same length as the question’s vectors.
- test_question =[
- ‘Briefly explain crop rotation’
- ]
- test_vector = X.transform(test_question ).toarray()
Now, we will find the cosine similarity between the test question (the test vector) and each training question (the training vector). We’ll then print the answer to the training question that is most similar to the test question as the answer to the question asked.
- response = ‘ ‘
- most_sim = 0
- for i in range(len(questions)):
- if most_sim < cosine_similarity(array[i], test_vector[0]):
- most_sim = cosine_similarity(array[i], test_vector[0])
- answer_index = i #get the index of the current most similar question
- response = answers[answer_index] #get the answer of the most similar question.
- Print(response)
The outcome:
The practice of growing a series of different types of crops in the same area in sequential seasons.