Data Science

Day 06 – Levenshtein Distance

I am writing the data series daily
1 year we will have 365 data stories for reading ^^

How to compare string distance ??

Imagine that you work in the NLP task
(NLP – Natural Language Processing)
And you want to compare how similar of 2 strings
What will you do !??

For example
If we want to compare 2 strings

  1. PrayutZa
  2. PrawitZa

Let’s see
To compare how similar!
It looks like you measure the distance between 2 strings.
The more the distance it means
2 string are far from together
The less the distance it means
2 string are close together

Is it easy !? Right ^^

So. .. the question is how to measure it !??

What we learned from the past !?
Hamming Distance!
They compare binary vector
Question: how to turn string to binary vector !?

PrayutZa
The first string >> binary vector !?

But the intuition of Hamming Distance is
How to compare each axis step by step
For example
X1 = [1, 0]
X2 = [1, 1]
We compare the first axis
1 and 1 are equal it means they are similar

The second axis !!
0 and 1 are different by 1 unit, so they are not similar !!!

Is it possible
That we can compare each string by Hamming distance !?
From example above
If we want to compare 2 strings

  1. PrayutZa
  2. PrawitZa

Let’s compare one by one
(If we do not care case sensitive)
P and P >> same character get 0 point
r and r >> same character get 0 point
a and a >> same character get 0 point
y and w >> different character get 1 point
u and i >> different character get 1 point
t and t >> same character get 0 point
Z and Z >> same character get 0 point
a and a >> same character get 0 point

With Hamming Distance
If we apply the rule from them
We will get how is likely equal
= 1 – (sum(point) / length of strings)

From calculation above
If we apply Hamming Distance
We will get
PrayutZa and PrawitZa
Similar score
= 1 – (2/8)
= 0.75 !!
Or 75 % Similar by Hamming Distance score !!

But what if 2 strings have different length

  1. PrayatINWZa007
  2. PrawitINWZaaIO

They are of different lengths!???
So…, How to measure how similar !?

** Note !!!
If you can remember Hamming Distance
Let’s go to see how Levenshtein distance work !>>>

If not, Don’t worry about it
You can learn more on day 1
I provide on the link below
Link: https://bigdatarpg.com/2021/01/07/day-01-hamming-distance/


Levenshtein Distance

Levenshtein Distance is invented by Russian and mathematician
named Vladimir Levenshtein, who considered this distance in 1965
Levenshtein Distance may also refer to Edit Distance
So…. What is Levenshtein Distance??

Levenshtein Distance consists of 3 components

  1. Insertion
  2. Substitution
  3. Deletion

Let’s see the simple example
If we have 2 strings to compare

  1. PrayutZa
  2. PrawitZa

Let’s compare each character one by one
(If we do not care case sensitive)
P and P >> same character get 0 point
r and r >> same character get 0 point
a and a >> same character get 0 point
y and w >> different character get 1 point (Substitution y and w)
u and i >> different character get 1 point (Substitution u and i)
t and t >> same character get 0 point
Z and Z >> same character get 0 point
a and a >> same character get 0 point

After that, we compute the score by
sum(point) / length of strings
= 2/8
= 0.25
Or 25 percents for a different score
If we need a similar score
We use 1 minus score
= 1 – 0.25
= 0.75
Or 75 % similar score calculated by Levenshtein Distance

Recap !!
We use the concept of Substitution 2 location
In order to correct each character
And that is a cost for the correctness ^^

The more cost for the correctness the far distance between 2 strings
Is it an easy concept of how to compare strings!?

Let’s see another example

  1. Prayut
  2. Payet

In this case, 2 strings are of different length
Prayut >> 6 unit
Payet >> 5 unit

Let’s compare one by one
But this case, we have 2 scenarios
Scenario 1
Step 1
P and P >> same character get 0 point
Step 2
r and a !! >> different character get 1 point (Substitution r and a)
BUT !! WAIT !!!
If we check on step 3 of Prayut >> a
So ! this step we use the property of Insertion
We have
Prayut
P_ayet

If we insert “” character between P and a Instep 3 we will compare a on Prayut And a on P_ayet With apply property Step 2 r and !! >> different character get 1 point (Insertion)

Now step 3
a and a >> same character get 0 point
step 4
y and y >>same character get 0 point
step 5
u and e >> different character get 1 point (Substitution y and w)

step 6
t and t >> same character get 0 point

After that, we compute the score by
sum(point) / length of strings
= 2/8
= 0.25
Or 25 percents for a different score
If we need a similar score
We use 1 minus score
= 1 – 0.25
= 0.75
Or 75 % similar score calculated by Levenshtein Distance

With Levenshtein Distance
It is very useful in NLP applications ^^

Exercise
Let’s calculate Levenshtein Distance if

  1. PrayatINWZa007
  2. PrawitINWZaaIO

** Remark **
Score of Levenshtein Distance
If there are many points of correctness
Levenshtein Distance care the only minimum of points of correctness
For example
If we compare strings and
There are 3 ways for correct strings
Way 1 cost 3 point
Way 2 cost 2 point
Way 3 cost 4 point
We pick 2 point for calculation ^^

Applications on Levenshtein Distance

Example

  • String matching
  • Spell checking
  • Search documents by keywords
  • Check similarity of documents
  • Search chatbot intent
  • Customer profiling
  • Miscellaneous


Thank you my beloved fanpage
Please like share and comment

Made with Love ❤️ by Boyd

This series is designed for everyone interested in data
or work in a data field but is busy.
Content may have a swap between easy and hard.
Combined with Coding, Math, Data, Business, and Misc

  • Do not hesitate to feedback to me 🌟
  • If some content wrong I have to say apologize in advance 🙇‍♂‍
  • If you have experiences in this content, please kindly share it with everyone ^^❤️
  • Sorry for my poor grammar, I will practice more and more 😅
  • I am going to deliver more English content afterward 😀

Follow me 🙂
Youtube: https://youtube.com/c/BigDataRPG
Fanpage: https://www.facebook.com/bigdatarpg/
Medium: https://www.medium.com/bigdataeng
Github: https://www.github.com/BigDataRPG
Kaggle: https://www.kaggle.com/boydbigdatarpg
Linkedin: https://www.linkedin.com/in/boyd-sorratat
Twitter: https://twitter.com/BoydSorratat
GoogleScholar: https://scholar.google.com/citations?user=9cIeYAgAAAAJ&hl=en

#BigDataRPG #DataScienceDaily #LevenshteinDistance

ใส่ความเห็น

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  เปลี่ยนแปลง )

Google photo

You are commenting using your Google account. Log Out /  เปลี่ยนแปลง )

Twitter picture

You are commenting using your Twitter account. Log Out /  เปลี่ยนแปลง )

Facebook photo

You are commenting using your Facebook account. Log Out /  เปลี่ยนแปลง )

Connecting to %s

%d bloggers like this: