3

I am new to machine learning and tried doc2vec on quora duplicate dataset. new_dfx has columns 'question1' and 'question2' which has preprocessed questions in each row. Following is the tagged document sample:

input:

q_arr = np.append(new_dfx['question1'].values, new_dfx['question2'].values)
tagged_data1 = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(q_arr)]
tagged_data1[50001]

output:

TaggedDocument(words=['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india'], tags=['50001'])

Input:

model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores)
model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])
train_documents1  = utils.shuffle(tagged_data1)
model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

-- to check if model trained right

model_dbow1.most_similar('senseless')

Error:

KeyError: "word 'senseless' not in vocabulary"

The data I have given to model for training as input has the word "senseless" so why this error? Could anyone please help?Other word is giving output

2 Answers2

0

This is Doc2Vec not Word2Vec, so I don't think you you don't give a word to most_similar(). So instead of:

model_dbow1.most_similar('senseless')

I think you would do:

model_dbow1.most_similar('55001')

Or, alternatively if you did want to search for a one-word sentence:

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.most_similar([vector])

(The above is a bit of guesswork, based on the online docs for Doc2Vec, and some tutorials (e.g. this and e.g. Sentence similarity using Doc2vec). Where possible give a fully reproducible example that we can test against.)

Darren Cook
  • 1,324
  • 8
  • 16
0

The Doc2Vec API should be called this:

vector = model_dbow1.infer_vector(["senseless"])
model_dbow1.wv.most_similar([vector])

Here is a complete working example:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from tqdm import tqdm

q_arr = " ".join(['senseless', 'movi', 'like', 'dilwal', 'happi', 'new', 'year', 'earn', 'easi', '100', 'crore', 'india']) cores = 4 train_documents1 = q_arr tagged_data1 = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(q_arr)] model_dbow1 = Doc2Vec(dm=1, vector_size=300, negative=5, workers=cores) model_dbow1.build_vocab([x for x in tqdm(tagged_data1)])

train_documents1 = utils.shuffle(tagged_data1)

train_documents1 = tagged_data1 model_dbow1.train(tagged_data1,total_examples=len(train_documents1), epochs=30)

vector = model_dbow1.infer_vector(["senseless"]) model_dbow1.wv.most_similar([vector])

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113