Regex in R as a list for Quanteda

Question

R newbie here. I'm doing some text analysis using the package quanteda. Basically, what I'm trying to do is put all the words follow the regex pattern child|(care) basically to capture any text which includes any of the words "child" or "care".
To do this, I can create a list and then use the dictionary function:
childcare_list <- c("child","care")
word_dict <- dictionary(list(childcare = childcare_list)).

However, how could I incorporate regex and do this for other patterns which would be tedious to type up manually as in the first line? For example, I may want to capture something like
\bC\w?V\w?D\-19 which captures possible typos of "COVID-19" e.g. "CiVID-19", "CpVID-19".
I could of course do covid_list <- c("CiVID-19", "CpVID-19", ...) but that would be too manual. As well, it doesn't use the \b anchor.

Basically, asking if it's possible to make it so a list contains all possible combinations of a regex.

score 1 · Answer 1 · answered Apr 30 '21 at 02:57

This doesn't seem like a great task for regex--even your pattern would miss very close typos like COWID-19 or potential OCR mistakes like C0VID-I9. Instead, I'd suggest using the stringdist package to do fuzzy matching, perhaps stringdist::afind to find approximate matches of "COVID-19". You can read a bit about it here.

This will let you select from a variety of string distance algorithms and set a maximum distance. You could then, e.g., correct matches to "COVID-19" and proceed with your analysis.

Regex in R as a list for Quanteda

1 Answers1