I have some data where the main column is the description of one product. The main task is to extract the name of some product from this column, where it sometimes is spelled wrong and amended in other words. I have more than a thousand possible product names.
Currently I'm just using regular expressions with a list of product names to find and extract the product names in each row of the dataset, but it's not working well in the cases of misspelled product names.
Since I already have more than 50,000 rows in which the product was extracted by hand in cases where there is some product from the list in the column. I am wondering if after breaking each description into multiple lines (tokenization) could apply some classification/search method to detect if there is some product and which is in the description.
Example: Medical product used in cancer treatment "PRODUCT XXXXYYYY" where the XXXX and YYYY are the first and second name of the product.
I think most of the description would be unused, because as in the example above, simply being a medical product would not help at all since there are several possibilities, but the presence of the wrongly spelled words that describe a product specific would be helpful. I am open to suggestions on how to deal with these spelling errors.