I read spark document, which said
During the fitting process,
CountVectorizerwill select the topvocabSizewords ordered by term frequency across the corpus. An optional parameterminDFalso affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary.
Could anyone explain it to me more clearly?