How much training data does Word2Vec need?

Question

I'd like to compare the difference among the same word mentioned in different sources. That is, how authors differ in their usage of ill-defined words, such as "democracy".

A brief plan was

Take the books mentioning the term "democracy" as plain text
In each book, replace democracy with democracy_%AuthorName%
Train a word2vec model on these books
Calculate the distance between democracy_AuthorA, democracy_AuthorB, and other relabeled mentions of "democracy"

So each author's "democracy" gets its own vector, which is used for comparison.

But it seems that word2vec requires much more than several books (each relabeled word occurs only in a subset of books) to train reliable vectors. The official page recommends datasets including billions of words.

I just wanted to ask how large should be the subset of one author's books to make such inference with word2vec or alternative tools, if available?

score 1 · Answer 1 · edited Jan 01 '21 at 12:54

It sounds like Doc2Vec (or paragraph/context vectors) might be the right fit for this problem.

In a nutshell, in addition to the word vectors, you add a "context vector" (in your case, an embedding for the author) that is used to predict the center or context words.

This means that you would benefit from all the data about "democracy" but also extract an embedding for that author, which combined should allow you to analyze the bias of each author with limited data about each author.

You can use gensim's implementation. The doc includes links to the source papers.

How much training data does Word2Vec need?

1 Answers1

Linked