9

Can anyone suggest any algorithm and technique for spell checking? After some googling, I found some interesting ones such as this one from Peter Norvig, http://norvig.com/spell-correct.html and few others. However, most of them were written many years ago. Therefore, I am trying to find out whether is there any newer/improved approach to tackle such problem.

edwin
  • 343
  • 1
  • 3
  • 10

4 Answers4

10

Here is what I built...

  • Step 1: Store all the words in a Trie data structure. Wiki about trie.

  • Step 2: Train an RNN or RNTN to get seq2seq mapping for words and store the model

  • Step 3: Retrieve top n words with levenshtein distance with the. Wiki about LD with the word that you are trying to correct.

    • Naive Approach: Calculating the edit distance between the query term and every dictionary term. Very expensive.
    • Peter Norvig's Approach: Deriving all possible terms with an edit distance<=2 from the query term, and looking them up in the dictionary. Better, but still expensive (114,324 terms for word length=9 and edit distance=2) Check this.
    • Faroo's Approach: Deriving deletes only with an edit distance<=2 both from the query term and each dictionary term. Three orders of magnitudes faster. http://blog.faroo.com/2012/06/07...
  • Step 4: Based on the previous 2 or 3 or 4 words, predict the words that are retrieved from Step 3 above. Select the number of words depending on how much accuracy you want (Of course we want 100% accuracy), and the processing power of the machine running the code, you can select the number of words you wanna consider.

Check out this approach too. It's an acl paper from 2009. They call it - language independent auto correction.

Hima Varsha
  • 2,366
  • 16
  • 34
4

Character level Recurrent Neural Network to the rescue here. Mainly you will build a character level language model using a recurrent neural network. Look into this github project. Github and also this medium blog to get a good understanding of the process.

Himanshu Rai
  • 1,858
  • 13
  • 10
1

I would assume you want to do this for the cleaning phase of your project. Microsoft Cognitive Services provides a spell check API https://www.microsoft.com/cognitive-services/en-us/bing-spell-check-api.

If you are trying to group words into embedding's, then I might suggest Word2Vec or Glove. Glove is pre-trained and has a different specified number of dimensions for each vector. However, both can potentially account for misspellings as well as similarities.

Samuel Sherman
  • 346
  • 2
  • 4
-1

Might want to look at this using LSTM neural networks.

For bots and readers who need more text content here, I'm linking to a post that goes through a detailed development of an LSTM based spell correction algorithm. It would be silly for me to rewrite what is clearly spelled out in the link so what follows is the post's own description of what it does. Brackets are mine.

In this article, I will use bi-direction LSTM in the encoding layer and multi-head attention in the decoding layer [to write a program for spelling correction -- theory is explained and code is linked]. Basically, spelling correction in natural language processing and information retrieval literature mostly relies on pre-defined lexicons to detect spelling errors. Firstly, I will explain some other model architecture which is also used in Natural Language Processing task like speech recognition, spelling correction, language translation etc..