I am writing the data series on a daily basis
1 year we will have 365 data stories for reading ^^

Cosine !! What you are thinking about ?
Can you remember that you learnt on secondary school year 4 !?
So…. What is it !? Trigonometry !!
Sin(x), Cosine(x), Tan(x)
Is that’s right !?
On the day 4.
I am talking about Minkowski Distance
Have you remenbered Euclidean distance
and how to calculate a height of building
by using Trigonometry
Yes ! today I am talking about one of a measure function
called Cosine Similarity
If you can remember
This Euclidean distance and that Trigonometry
Let’s go to see how Cosine Similarity work !>>>
If not, Don’t worry about it
You can learn more on day 4
I provide on the link below
Link: https://bigdatarpg.com/2021/01/09/day-04-minkowski-distance/
Cosine Similarity
Who used to use it before, Raise your hands up ^^ !!
The easy way to explain this system is
the measure that compare similarity of 2 vectors
by looking at angle between 2 vectors !
Ok That’s means this system require 2 vectors
Lets’ example
If there are 3 vectors
vector A, vector B, and vector C
(the result range of cosine similarity is between [0, 1]In some case has minus that’s means opppsite)
Check how similar
A and B = 0.5
A and C = 0.2
B and C = 0.7
Which pair is the most similar and
which one is the most not similar ?
OK Let’s see how to interprete !!
If result show that equal 1
that’s means 2 vectors are in the same line
they have 0 degree angle with together
If result show that equal 0
that’s means 2 vectors are not in the same line
they have 90 degree angle with together
If result show that equal -1
that’s means 2 vectors are in the same line
but they are 180 degree angle with together
one vector is the oppsite with another vector
Back to the result and question
Which pair is the most similar and
which one is the most not similar ?
So we get
B and C = 0.7 Very similar
A and B = 0.5 Similar but not much !
A and C = 0.2 That is not quite similarIs it easy Right !?
Calculate Cosine Similarity
Let’s see the example
If we have 3 customer
customer A, B, C
and each customer has 3 Features
Feature location
Feature education
Feature flag active customer
customer A = [Bangkok, Undergraduate, Y]
customer B = [Nonthaburi, Undergraduate, Y]
customer C = [Bangkok, Master, N]
We have to transform data from text to numerical
If we set Bangkok = 0, Nonthaburi = 1
Undergraduate = 0, Master = 1
N = 0, Y = 1
Now we have
customer A = [0, 0, 1]
customer B = [1, 0, 1]
customer C = [0, 1, 0]
Let’s see Features
It look like axis location as x-axis
education as y-axis
flag active customer as z-axis
combine into 1 vector of 1 customer
From Cosine Similarity
Sim(A,B) = Cos(degree) = (A dot B) / (||A|| * ||B||)
where
||A|| is Euclidean norm
||A|| = sqrt(x1**2 + x2**2 + … + xn**2)
Let’s calculate Euclidean distance
Euclidean distance(A, B) = sqrt(0**2 + 0**2 + 1**2) * sqrt(1**2 + 0**2 + 1**2)
Euclidean distance(A, C) = sqrt(0**2 + 0**2 + 1**2) * sqrt(0**2 + 1**2 + 0**2)
Euclidean distance(B, C) = sqrt(1**2 + 0**2 + 1**2) * sqrt(0**2 + 1**2 + 0**2)
Euclidean distance(A, B) = 1.4142
Euclidean distance(A, C) = 1.0000
Euclidean distance(B, C) = 1.4142
Let’s calculate dot product
What is dot product !???
Sum of product of each axis
So we have
A dot B = (0*1) + (0*0) + (1*1)
A dot C = (0*0) + (0*1) + (1*0)
B dot C = (0*0) + (0*1) + (1*0)
A dot B = 1
A dot C = 0
B dot C = 0
Combine them
Sim(A,B) = 1 / 1.4142
Sim(A,C) = 0 / 0.000
Sim(B,C) = 0 / 1.4142
Sim(A,B) = 0.707
Sim(A,C) = 0.000
Sim(B,C) = 0.000
As a result of Cosine Similarity
We found that
A and B = 0.707 Very similar
A and C = 0.000 It’s 90 deegree absolute different
B and C = 0.000 It’s 90 deegree absolute different
Um the result is quite not meaningful
Because we transform discrete data to numerical data
and we represent binary vector for customer
In this case
If our features has binary
Cosine Similarity can be rewrite to
A simple variation of cosine similarity
named Tanimoto distance
that is frequently used in information retrieval and biology taxonomy
For Tanimoto distance
instead of using Euclidean Norm
When we have binary vector
So we have
Sim(A,B) = (A dot B) / (A dot A) + (B dot B) – (A dot B)
Applications on Cosine Similarity
Example
– Clustering discrete data
– Check similarity of chemical molecule
– Clustering on continuous data
– Clustering customers
– Search documents by kewords
– Recommendation engine
– Check similarity of documents
– Search chatbot intent
– Check similarity of image
– Customer profiling
– Miscellaneous
Thank you my beloved fanpage
Please like share and comment
Made with Love by Boyd
This series is designed for everyone who are interested in dataor work in data field that are busy.Content may have swap between easy and hard.Combined with Coding, Math, Data, Business, and Misc- Do not hesitate to feedback me
– If some content wrong I have to say apologize in advance
– If you have experiences in this content, please kindly share to everyone ^^ ❤️
– Sorry for my poor grammar, I will practice more and more
– I am going to deliver more english content afterward
Follow me
Youtube: https://youtube.com/c/BigDataRPG
Fanpage: https://www.facebook.com/bigdatarpg/
Medium: https://www.medium.com/bigdataeng
Github: https://www.github.com/BigDataRPG
Kaggle: https://www.kaggle.com/boydbigdatarpg
Linkedin: https://www.linkedin.com/in/boyd-sorratat
Twitter: https://twitter.com/BoydSorratat
GoogleScholar: https://scholar.google.com/citations?user=9cIeYAgAAAAJ&hl=en