0

I'm sorry if the title is misleading, but I didn't really know how to explain what I am searching for. I have a dataset containing two columns representing names and surnames of a bunch of people. These might be inserted in multiple records. However, sometimes the name is put in the surname field and viceversa. Also, there might be some typing mistakes. I was thinking about merging these into a single string (NameSurname) in order to find similarities between records and fix the fields. I have looked at some string similarity metrics, but I see that the most popular ones look at consecutive characters and would fail to recognize SurnameName and NameSurname as the same string. Is there any metric robust to this? Thank you a lot in advance.

cilewu
  • 1

0 Answers0