4

I'm working with a bag of words in R:

library(tm)
corpus = VCorpus(textsource)
dtm = DocumentTermMatrix(corpus)
dtm = as.matrix(dtm)

I use the matrix dtm to train a lasso model.

Now I want to predict new (unseen) text. The problem is, that I need to generate a new dtm (for prediction) with the same matrix columns as the original dtm used for model training.

Essentially, I need to populate the original dtm (as used for training) with new text.

Example: "original text" would yield a dtm used for taining:

original | text
1          1

While new (unseen) text, e.g. "new text" should yield a dtm for prediction:

original | text
0          1

Q: What is the most efficient way to populate an existing document term matrix / bag of words with new (text) data in R?

Peter
  • 7,896
  • 5
  • 23
  • 50

0 Answers0