Gensim doc2vec error: KeyError: "word 'senseless' not in vocabulary"

Question

I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample:

input:

q_arr = np.append(new_dfx['question1'].values, new_dfx['question2'].values)
tagged_data1 = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(q_arr)]
tagged_data1[50001]

output:

TaggedDocument(words=['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'], tags=['50001'])

Input:

model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)
model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])
train_documents1  = utils.shuffle(tagged_data1)
model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

-- to check if model trained right

model_dbow1.most_similar('senseless')

Error:

KeyError: "word 'senseless' not in vocabulary"

The data I have given to model for training as input has the word "senseless" so why this error? Could anyone please help?

score 0 · Answer 1 · answered Jan 16 '23 at 21:04

This is Doc2Vec not Word2Vec, so I don't think you you don't give a word to most_similar(). So instead of:

model_dbow1.most_similar('senseless')

I think you would do:

model_dbow1.most_similar('55001')

Or, alternatively if you did want to search for a one-word sentence:

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.most_similar([vector])

(The above is a bit of guesswork, based on the online docs for Doc2Vec, and some tutorials (e.g. this and e.g. Sentence similarity using Doc2vec). Where possible give a fully reproducible example that we can test against.)

score 0 · Answer 2 · answered Mar 09 '23 at 13:05

The Doc2Vec API should be called this:

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.wv.most_similar([vector])

Here is a complete working example:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from tqdm import tqdm
q_arr = " ".join(['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'])
cores = 4
train_documents1 = q_arr
tagged_data1 = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(q_arr)]
model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)
model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])
train_documents1  = utils.shuffle(tagged_data1)
train_documents1  = tagged_data1
model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)
vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.wv.most_similar([vector])

Gensim doc2vec error: KeyError: "word 'senseless' not in vocabulary"

2 Answers2

train_documents1 = utils.shuffle(tagged_data1)