**I am writing the data series daily1 year we will have 365 data stories for reading ^^**

**How to compare string distance ??**

Imagine that you work in the NLP task

(NLP – Natural Language Processing)

And you want to compare how similar of 2 strings

What will you do !??

For example

If we want to compare 2 strings

- PrayutZa
- PrawitZa

Let’s see

To compare how similar!

It looks like you measure the distance between 2 strings.

The more the distance it means

2 string are far from together

The less the distance it means

2 string are close together

Is it easy !? Right ^^

So. .. the question is how to measure it !??

What we learned from the past !?

Hamming Distance!

They compare binary vector

Question: how to turn string to binary vector !?

PrayutZa

The first string >> binary vector !?

But the intuition of Hamming Distance is

How to compare each axis step by step

For example

X1 = [1, 0]

X2 = [1, 1]

We compare the first axis

1 and 1 are equal it means they are similar

The second axis !!

0 and 1 are different by 1 unit, so they are not similar !!!

Is it possible

That we can compare each string by Hamming distance !?

From example above

If we want to compare 2 strings

- PrayutZa
- PrawitZa

Let’s compare one by one

(If we do not care case sensitive)

P and P >> same character get 0 point

r and r >> same character get 0 point

a and a >> same character get 0 point

y and w >> different character get 1 point

u and i >> different character get 1 point

t and t >> same character get 0 point

Z and Z >> same character get 0 point

a and a >> same character get 0 point

With Hamming Distance

If we apply the rule from them

We will get how is likely equal

= 1 – (sum(point) / length of strings)

From calculation above

If we apply Hamming Distance

We will get

PrayutZa and PrawitZa

Similar score

= 1 – (2/8)

= 0.75 !!

Or 75 % Similar by Hamming Distance score !!

But what if 2 strings have different length

- PrayatINWZa007
- PrawitINWZaaIO

They are of different lengths!???

So…, How to measure how similar !?

** Note !!!

If you can remember Hamming Distance

Let’s go to see how Levenshtein distance work !>>>

If not, Don’t worry about it

You can learn more on day 1

I provide on the link below

Link: https://bigdatarpg.com/2021/01/07/day-01-hamming-distance/

Levenshtein Distance

Levenshtein Distance is invented by Russian and mathematician

named Vladimir Levenshtein, who considered this distance in 1965

Levenshtein Distance may also refer to Edit Distance

So…. What is Levenshtein Distance??

Levenshtein Distance consists of 3 components

- Insertion
- Substitution
- Deletion

Let’s see the simple example

If we have 2 strings to compare

- PrayutZa
- PrawitZa

Let’s compare each character one by one

(If we do not care case sensitive)

P and P >> same character get 0 point

r and r >> same character get 0 point

a and a >> same character get 0 point

y and w >> different character get 1 point (Substitution y and w)

u and i >> different character get 1 point (Substitution u and i)

t and t >> same character get 0 point

Z and Z >> same character get 0 point

a and a >> same character get 0 point

After that, we compute the score by

sum(point) / length of strings

= 2/8

= 0.25

Or 25 percents for a different score

If we need a similar score

We use 1 minus score

= 1 – 0.25

= 0.75

Or 75 % similar score calculated by Levenshtein Distance

Recap !!

We use the concept of Substitution 2 location

In order to correct each character

And that is a cost for the correctness ^^

The more cost for the correctness the far distance between 2 strings

Is it an easy concept of how to compare strings!?

Let’s see another example

- Prayut
- Payet

In this case, 2 strings are of different length

Prayut >> 6 unit

Payet >> 5 unit

Let’s compare one by one

But this case, we have 2 scenarios

Scenario 1

Step 1

P and P >> same character get 0 point

Step 2

r and a !! >> different character get 1 point (Substitution r and a)

BUT !! WAIT !!!

If we check on step 3 of Prayut >> a

So ! this step we use the property of Insertion

We have

Prayut

P_ayet

If we insert “*” character between P and a Instep 3 we will compare a on Prayut And a on P_ayet With apply property Step 2 r and* !! >> different character get 1 point (Insertion)

Now step 3

a and a >> same character get 0 point

step 4

y and y >>same character get 0 point

step 5

u and e >> different character get 1 point (Substitution y and w)

step 6

t and t >> same character get 0 point

After that, we compute the score by

sum(point) / length of strings

= 2/8

= 0.25

Or 25 percents for a different score

If we need a similar score

We use 1 minus score

= 1 – 0.25

= 0.75

Or 75 % similar score calculated by Levenshtein Distance

With Levenshtein Distance

It is very useful in NLP applications ^^

Exercise

Let’s calculate Levenshtein Distance if

- PrayatINWZa007
- PrawitINWZaaIO

** Remark **

Score of Levenshtein Distance

If there are many points of correctness

Levenshtein Distance care the only minimum of points of correctness

For example

If we compare strings and

There are 3 ways for correct strings

Way 1 cost 3 point

Way 2 cost 2 point

Way 3 cost 4 point

We pick 2 point for calculation ^^

### Applications on Levenshtein Distance

Example

- String matching
- Spell checking
- Search documents by keywords
- Check similarity of documents
- Search chatbot intent
- Customer profiling
- Miscellaneous

Thank you my beloved fanpage

Please like share and comment

### Made with Love ❤️ by Boyd

This series is designed for everyone interested in data

or work in a data field but is busy.

Content may have a swap between easy and hard.

Combined with Coding, Math, Data, Business, and Misc

- Do not hesitate to feedback to me 🌟
- If some content wrong I have to say apologize in advance 🙇♂
- If you have experiences in this content, please kindly share it with everyone ^^❤️
- Sorry for my poor grammar, I will practice more and more 😅
- I am going to deliver more English content afterward 😀

**Follow me 🙂**

Youtube: https://youtube.com/c/BigDataRPG

Fanpage: https://www.facebook.com/bigdatarpg/

Medium: https://www.medium.com/bigdataeng

Github: https://www.github.com/BigDataRPG

Kaggle: https://www.kaggle.com/boydbigdatarpg

Linkedin: https://www.linkedin.com/in/boyd-sorratat

Twitter: https://twitter.com/BoydSorratat

GoogleScholar: https://scholar.google.com/citations?user=9cIeYAgAAAAJ&hl=en