training model on random samples from a large dataset

Question

I have a huge data set(More than 1 million data points).My dataset is text. i am doing NER on it to identify few entities. if i randomly choose 100 data points from the total data set and train my model(LSTM), will this yield good results? i will be running for 20k random batches. Does this approximate the data properly or do i need to run for more number of batches than the total number of datapoints?

score 2 · Answer 1 · answered Sep 18 '18 at 09:32

Depends entirely on your data, if your variables are mostly numerical then you can get by with small samples. If however you have a lot of categorical variables you need to make sure that each category of every variable is well represented in the subsample. If they are all numerical I would go for 1000 data points repeated 1000 times.

training model on random samples from a large dataset

1 Answers1