How to deal with spelling errors NLP

Question

I have some data where the main column is the description of one product. The main task is to extract the name of some product from this column, where it sometimes is spelled wrong and amended in other words. I have more than a thousand possible product names.

Currently I'm just using regular expressions with a list of product names to find and extract the product names in each row of the dataset, but it's not working well in the cases of misspelled product names.

Since I already have more than 50,000 rows in which the product was extracted by hand in cases where there is some product from the list in the column. I am wondering if after breaking each description into multiple lines (tokenization) could apply some classification/search method to detect if there is some product and which is in the description.

Example: Medical product used in cancer treatment "PRODUCT XXXXYYYY" where the XXXX and YYYY are the first and second name of the product.

I think most of the description would be unused, because as in the example above, simply being a medical product would not help at all since there are several possibilities, but the presence of the wrongly spelled words that describe a product specific would be helpful. I am open to suggestions on how to deal with these spelling errors.

Peter · Answer 1 · 2021-11-27T21:52:12.527

Other options would be to...

Compare similar text sequences
Compare similar string sequences
Use fuzzy matching

Fuzzy Matching:

library(fuzzyjoin)
# https://stackoverflow.com/questions/26405895/how-can-i-match-fuzzy-match-strings-from-two-datasets
a <- data.frame(name = c('Ace Co', 'Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),
                price = c(10, 13, 2, 1, 15, 1))
b <- data.frame(name = c('Ace Co.', 'Bayes Inc.', 'asdf'),
                qty = c(9, 99, 10))
Find matches
stringdist_join(a, b, 
                by = "name",
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 99, 
                distance_col = "dist"
) %>%
  group_by(name.x) %>%
  top_n(1, -dist)

This gives a "distance" between one and another word. So if you know the real product name, you might be able to find the erroneous names via the distance.

# A tibble: 6 x 5
# Groups:   name.x [6]
  name.x price name.y       qty   dist
  <fct>  <dbl> <fct>      <dbl>  <dbl>
1 Ace Co    10 Ace Co.        9 0.0476
2 Bayes     13 Bayes Inc.    99 0.167 
3 asd        2 asdf          10 0.0833
4 Bcy        1 Bayes Inc.    99 0.378 
5 Baes      15 Bayes Inc.    99 0.2   
6 Bays       1 Bayes Inc.    99 0.2

Alternatively, you could look into similarity in string sequences:

If your product names are not to common from their string sequences and the erroneous names get only part of the name wrong, you could try something like:

library(dplyr)
library(tidytext)
library(fuzzyjoin)
library(tokenizers)
##############################
Compare text sequences
text1=as.character("Hi my name is Bixi and I like cycling a lot. It is just great!")
mytext1=data_frame(text1)
text2=as.character("Hi my name is Lissi and I'm good in swimming. It is just great!")
mytext2=data_frame(text2)
ngram1 = unnest_tokens(mytext1, ngram, text1, token = "ngrams", n = 4)
ngram2 = unnest_tokens(mytext2, ngram, text2, token = "ngrams", n = 4)
Find matching sequence(s)
semi_join(ngram1,ngram2)
##############################
Compare sequences of single letters
ngram3=tokenize_character_shingles(mytext1$text1, n = 10, n_min = 10, strip_non_alphanum = FALSE)
ngram4=tokenize_character_shingles(mytext2$text2, n = 10, n_min = 10, strip_non_alphanum = FALSE)
ngram3=as.data.frame(ngram3)
ngram4=as.data.frame(ngram4)
Find matching sequences of single letters
semi_join(ngram3,ngram4)

You may have a look at my Github. There are some more related options.

score 5 · Answer 2 · answered Dec 22 '19 at 16:59

2 approaches to correct misspellings:

Make your own dictionary of corrections, for example:

mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization', 'pokémon': 'pokemon'}

def correct_spelling(x, dic):
    for word in dic.keys():
        x = x.replace(word, dic[word])
    return x

df['treated_question'] = df['treated_question'].apply(lambda x: correct_spelling(x, mispell_dict))

OR

there is an automated library that deals with the most common ones, check here

score 3 · Answer 3 · answered Dec 22 '19 at 21:43

3

You can try Flashtext to easily create alternatives as you find and FuzzyWuzzy to get the similarity between word tokens by n grams

answered Dec 22 '19 at 21:43

Syenix

369
1
6

score 1 · Answer 4 · answered Dec 23 '19 at 20:15

The problem with misspellings is usually less that they exist than that they make it unclear how specific entries should be classified. It depends somewhat on how bad the misspellings are-- someone typing StackExchaneg probably meant StackExchange, and I would be comfortable classifying the former as an alias for the latter.

If the spelling mistakes are mostly like that, then you could hand-craft a list of aliases by reviewing similar spellings (a fuzzy matching algorithm, as suggested by @Peter, would be a good way to identify potential misspellings without too much upfront work). Even starting with fuzzy matching this could be tedious, but there is often little valid way to avoid tedium in data science (especially with what is currently an unsupervised classification problem).

If there are more extreme misspellings, or there is potential confusion about which "true" category a misspelled entry should belong to, the most sound approach is probably to have multiple people review those edge cases for classification. Then compute inter-rater reliability (Cohen's Kappa) on specific cases to indicate how certain (and uncertain) you are about when a misspelled entry belongs to a particular category.

How to deal with spelling errors NLP

4 Answers4

Find matches

Compare text sequences

Find matching sequence(s)

Compare sequences of single letters

Find matching sequences of single letters

Linked