3

I have a huge data set(More than 1 million data points).My dataset is text. i am doing NER on it to identify few entities. if i randomly choose 100 data points from the total data set and train my model(LSTM), will this yield good results? i will be running for 20k random batches. Does this approximate the data properly or do i need to run for more number of batches than the total number of datapoints?

rawwar
  • 881
  • 2
  • 12
  • 23

1 Answers1

2

Depends entirely on your data, if your variables are mostly numerical then you can get by with small samples. If however you have a lot of categorical variables you need to make sure that each category of every variable is well represented in the subsample. If they are all numerical I would go for 1000 data points repeated 1000 times.

user2974951
  • 636
  • 3
  • 6