VANAUDEL ANALYTIX
Follow Us
  • Home
  • About
  • Python Blog
  • SAS Blog
  • Open Data Science
  • Consulting
  • Contact

Fuzzy Matching using Fuzzywuzzy

1/26/2019

1 Comment

 
Have you ever wanted to determine just how similar two strings are? Using the equal sign returns whether two strings are the same or not. It does not give any information or measure regarding how similar or dissimilar two strings are.
Picture
Python’s Fuzzywuzzy library contains many methods that can be used to compute a similarity measure for two strings. The Fuzzywuzzy library contains a module called fuzz that contains several methods that can be used to compare two strings and return a value from 0 to 100 as a measure of similarity.
Picture
Strings that are completely different would have a similarity score of 0 whereas strings that a completely identical would have a similarity score of 100. Strings with some level of similarity would fall between 0 and 100.

Fuzzywuzzy uses Levenshtein Distance to calculate the differences between sequences. This library also utilises the difflib library under the hood of all of its calculations. 

The string comparison methods within the fuzz module of the Fuzzywuzzy library are:
  • QRatio
  •  UQRatio
  • UWRatio
  • WRatio
  • partial_ratio
  • partial_token_set_ratio
  • partial_token_sort_ratio
  • ratio
  • token_set_ratio
  • token_sort_ratio

​The following code demonstrates how some of these methods work on strings of varying similarity.
​
Picture
Fuzzywuzzy also contains a module called process which contains methods for determining how similar one string is to a list of other strings.

The methods in the process module are as follows:
  • extract
  • extractBests
  • extractOne
  • extractWithoutOrder
  • dedupe

The following code demonstrates how some of these methods work.
​
Picture
The Fuzzywuzzy library has many use cases. Imagine trying to determine how similar names, addresses, telephone numbers and dates are between two datasets. This library can be used to find matching records between datasets where a similarity score makes more sense over a definitive True or False regarding equality.
​

Happy Learning!
The SAS Mom
View my profile on LinkedIn
1 Comment
Data Cabling Olathe link
8/2/2022 09:55:45 pm

Lovely blog you have here

Reply



Leave a Reply.

    Author

    My name is Vanessa Afolabi also known as @TheSASMom. I am a Data Scientist fluent in SAS, R, Python and SQL with a passion for Machine Learning and Research.

    Archives

    September 2019
    August 2019
    January 2019

    Categories

    All

    RSS Feed

Powered by Create your own unique website with customizable templates.