python – Find the similarity metric between two strings

The Question :

317 people think this question is useful

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.

e.g.

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

The Question Comments :
  • I don’t think “probability” is quite the right term here. In any event, see stackoverflow.com/questions/682367/…
  • The word you are looking for is ratio, not probability.
  • Take a look at Hamming distance.
  • The phrase is ‘similarity metric’, but there are multiple similarity metrics (Jaccard, Cosine, Hamming, Levenshein etc.) said so you need to specify which. Specifically you want a similarity metric between strings; @hbprotoss listed several.
  • @NPE the link is broken

The Answer 1

608 people think this answer is useful

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

The Answer 2

72 people think this answer is useful

I think maybe you are looking for an algorithm describing the distance between strings. Here are some you may refer to:

  1. Hamming distance
  2. Levenshtein distance
  3. Damerau–Levenshtein distance
  4. Jaro–Winkler distance

The Answer 3

53 people think this answer is useful

Solution #1: Python builtin

use SequenceMatcher from difflib

pros: native python library, no need extra package.
cons: too limited, there are so many other good algorithms for string similarity out there.

example :
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

Solution #2: jellyfish library

its a very good library with good coverage and few issues. it supports:
– Levenshtein Distance
– Damerau-Levenshtein Distance
– Jaro Distance
– Jaro-Winkler Distance
– Match Rating Approach Comparison
– Hamming Distance

pros: easy to use, gamut of supported algorithms, tested.
cons: not native library.

example:

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

The Answer 4

29 people think this answer is useful

Fuzzy Wuzzy is a package that implements Levenshtein distance in python, with some helper functions to help in certain situations where you may want two distinct strings to be considered identical. For example:

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100

The Answer 5

13 people think this answer is useful

You can create a function like:

def similar(w1, w2):
    w1 = w1 + ' ' * (len(w2) - len(w1))
    w2 = w2 + ' ' * (len(w1) - len(w2))
    return sum(1 if i == j else 0 for i, j in zip(w1, w2)) / float(len(w1))

The Answer 6

9 people think this answer is useful

Package distance includes Levenshtein distance:

import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3

The Answer 7

7 people think this answer is useful

Note, difflib.SequenceMatcher only finds the longest contiguous matching subsequence, this is often not what is desired, for example:

>>> a1 = "Apple"
>>> a2 = "Appel"
>>> a1 *= 50
>>> a2 *= 50
>>> SequenceMatcher(None, a1, a2).ratio()
0.012  # very low
>>> SequenceMatcher(None, a1, a2).get_matching_blocks()
[Match(a=0, b=0, size=3), Match(a=250, b=250, size=0)]  # only the first block is recorded

Finding the similarity between two strings is closely related to the concept of pairwise sequence alignment in bioinformatics. There are many dedicated libraries for this including biopython. This example implements the Needleman Wunsch algorithm:

>>> from Bio.Align import PairwiseAligner
>>> aligner = PairwiseAligner()
>>> aligner.score(a1, a2)
200.0
>>> aligner.algorithm
'Needleman-Wunsch'

Using biopython or another bioinformatics package is more flexible than any part of the python standard library since many different scoring schemes and algorithms are available. Also, you can actually get the matching sequences to visualise what is happening:

>>> alignment = next(aligner.align(a1, a2))
>>> alignment.score
200.0
>>> print(alignment)
Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-Apple-
|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-|||-|-
App-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-elApp-el

The Answer 8

6 people think this answer is useful

The builtin SequenceMatcher is very slow on large input, here’s how it can be done with diff-match-patch:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

The Answer 9

2 people think this answer is useful

You can find most of the text similarity methods and how they are calculated under this link: https://github.com/luozhouyang/python-string-similarity#python-string-similarity Here some examples;

  • Normalized, metric, similarity and distance

  • (Normalized) similarity and distance

  • Metric distances

  • Shingles (n-gram) based similarity and distance
  • Levenshtein
  • Normalized Levenshtein
  • Weighted Levenshtein
  • Damerau-Levenshtein
  • Optimal String Alignment
  • Jaro-Winkler
  • Longest Common Subsequence
  • Metric Longest Common Subsequence
  • N-Gram
  • Shingle(n-gram) based algorithms
  • Q-Gram
  • Cosine similarity
  • Jaccard index
  • Sorensen-Dice coefficient
  • Overlap coefficient (i.e.,Szymkiewicz-Simpson)

The Answer 10

1 people think this answer is useful

There are many metrics to define similarity and distance between strings as mentioned above. I will give my 5 cents by showing an example of Jaccard similarity with Q-Grams and an example with edit distance.

The libraries

from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance  import edit_distance

Jaccard Similarity

1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))

and we get:

0.33333333333333337

And for the Apple and Mango

1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))

and we get:

0.0

Edit Distance

edit_distance('Apple', 'Appel')

and we get:

2

And finally,

edit_distance('Apple', 'Mango')

and we get:

5

Cosine Similarity on Q-Grams (q=2)

Another solution is to work with the textdistance library. I will provide an example of Cosine Similarity

import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')

and we get:

0.5

The Answer 11

0 people think this answer is useful

Textdistance:

TextDistance – python library for comparing distance between two or more sequences by many algorithms. It has Textdistance

  • 30+ algorithms
  • Pure python implementation
  • Simple usage
  • More than two sequences comparing
  • Some algorithms have more than one implementation in one class.
  • Optional numpy usage for maximum speed.

Example1:

import textdistance
textdistance.hamming('test', 'text')

Output:

1

Example2:

import textdistance

textdistance.hamming.normalized_similarity('test', 'text')

Output:

0.75

Thanks and Cheers!!!

The Answer 12

0 people think this answer is useful

Here’s what i thought of:

import string

def match(a,b):
    a,b = a.lower(), b.lower()
    error = 0
    for i in string.ascii_lowercase:
            error += abs(a.count(i) - b.count(i))
    total = len(a) + len(b)
    return (total-error)/total

if __name__ == "__main__":
    print(match("pple inc", "Apple Inc."))

Add a Comment