
I am writing the data series daily
1 year we will have 365 data stories for reading ^^
How to compare string distance ??
Imagine that you work in the NLP task
(NLP – Natural Language Processing)
And you want to compare how similar of 2 strings
What will you do !??
For example
If we want to compare 2 strings
- PrayutZa
- PrawitZa
Let’s see
To compare how similar!
It looks like you measure the distance between 2 strings.
The more the distance it means
2 string are far from together
The less the distance it means
2 string are close together
Is it easy !? Right ^^
So. .. the question is how to measure it !??
What we learned from the past !?
Hamming Distance!
They compare binary vector
Question: how to turn string to binary vector !?
PrayutZa
The first string >> binary vector !?
But the intuition of Hamming Distance is
How to compare each axis step by step
For example
X1 = [1, 0]
X2 = [1, 1]
We compare the first axis
1 and 1 are equal it means they are similar
The second axis !!
0 and 1 are different by 1 unit, so they are not similar !!!
Is it possible
That we can compare each string by Hamming distance !?
From example above
If we want to compare 2 strings
- PrayutZa
- PrawitZa
Let’s compare one by one
(If we do not care case sensitive)
P and P >> same character get 0 point
r and r >> same character get 0 point
a and a >> same character get 0 point
y and w >> different character get 1 point
u and i >> different character get 1 point
t and t >> same character get 0 point
Z and Z >> same character get 0 point
a and a >> same character get 0 point
With Hamming Distance
If we apply the rule from them
We will get how is likely equal
= 1 – (sum(point) / length of strings)
From calculation above
If we apply Hamming Distance
We will get
PrayutZa and PrawitZa
Similar score
= 1 – (2/8)
= 0.75 !!
Or 75 % Similar by Hamming Distance score !!
But what if 2 strings have different length
- PrayatINWZa007
- PrawitINWZaaIO
They are of different lengths!???
So…, How to measure how similar !?
** Note !!!
If you can remember Hamming Distance
Let’s go to see how Levenshtein distance work !>>>
If not, Don’t worry about it
You can learn more on day 1
I provide on the link below
Link: https://bigdatarpg.com/2021/01/07/day-01-hamming-distance/
Levenshtein Distance
Levenshtein Distance is invented by Russian and mathematician
named Vladimir Levenshtein, who considered this distance in 1965
Levenshtein Distance may also refer to Edit Distance
So…. What is Levenshtein Distance??
Levenshtein Distance consists of 3 components
- Insertion
- Substitution
- Deletion
Let’s see the simple example
If we have 2 strings to compare
- PrayutZa
- PrawitZa
Let’s compare each character one by one
(If we do not care case sensitive)
P and P >> same character get 0 point
r and r >> same character get 0 point
a and a >> same character get 0 point
y and w >> different character get 1 point (Substitution y and w)
u and i >> different character get 1 point (Substitution u and i)
t and t >> same character get 0 point
Z and Z >> same character get 0 point
a and a >> same character get 0 point
After that, we compute the score by
sum(point) / length of strings
= 2/8
= 0.25
Or 25 percents for a different score
If we need a similar score
We use 1 minus score
= 1 – 0.25
= 0.75
Or 75 % similar score calculated by Levenshtein Distance
Recap !!
We use the concept of Substitution 2 location
In order to correct each character
And that is a cost for the correctness ^^
The more cost for the correctness the far distance between 2 strings
Is it an easy concept of how to compare strings!?
Let’s see another example
- Prayut
- Payet
In this case, 2 strings are of different length
Prayut >> 6 unit
Payet >> 5 unit
Let’s compare one by one
But this case, we have 2 scenarios
Scenario 1
Step 1
P and P >> same character get 0 point
Step 2
r and a !! >> different character get 1 point (Substitution r and a)
BUT !! WAIT !!!
If we check on step 3 of Prayut >> a
So ! this step we use the property of Insertion
We have
Prayut
P_ayet
If we insert “” character between P and a Instep 3 we will compare a on Prayut And a on P_ayet With apply property Step 2 r and !! >> different character get 1 point (Insertion)
Now step 3
a and a >> same character get 0 point
step 4
y and y >>same character get 0 point
step 5
u and e >> different character get 1 point (Substitution y and w)
step 6
t and t >> same character get 0 point
After that, we compute the score by
sum(point) / length of strings
= 2/8
= 0.25
Or 25 percents for a different score
If we need a similar score
We use 1 minus score
= 1 – 0.25
= 0.75
Or 75 % similar score calculated by Levenshtein Distance
With Levenshtein Distance
It is very useful in NLP applications ^^
Exercise
Let’s calculate Levenshtein Distance if
- PrayatINWZa007
- PrawitINWZaaIO
** Remark **
Score of Levenshtein Distance
If there are many points of correctness
Levenshtein Distance care the only minimum of points of correctness
For example
If we compare strings and
There are 3 ways for correct strings
Way 1 cost 3 point
Way 2 cost 2 point
Way 3 cost 4 point
We pick 2 point for calculation ^^
Applications on Levenshtein Distance
Example
- String matching
- Spell checking
- Search documents by keywords
- Check similarity of documents
- Search chatbot intent
- Customer profiling
- Miscellaneous
Thank you my beloved fanpage
Please like share and comment
Made with Love ❤️ by Boyd
This series is designed for everyone interested in data
or work in a data field but is busy.
Content may have a swap between easy and hard.
Combined with Coding, Math, Data, Business, and Misc
- Do not hesitate to feedback to me 🌟
- If some content wrong I have to say apologize in advance 🙇♂
- If you have experiences in this content, please kindly share it with everyone ^^❤️
- Sorry for my poor grammar, I will practice more and more 😅
- I am going to deliver more English content afterward 😀
Follow me 🙂
Youtube: https://youtube.com/c/BigDataRPG
Fanpage: https://www.facebook.com/bigdatarpg/
Medium: https://www.medium.com/bigdataeng
Github: https://www.github.com/BigDataRPG
Kaggle: https://www.kaggle.com/boydbigdatarpg
Linkedin: https://www.linkedin.com/in/boyd-sorratat
Twitter: https://twitter.com/BoydSorratat
GoogleScholar: https://scholar.google.com/citations?user=9cIeYAgAAAAJ&hl=en