I have a huge data set(More than 1 million data points).My dataset is text. i am doing NER on it to identify few entities. if i randomly choose 100 data points from the total data set and train my model(LSTM), will this yield good results? i will be running for 20k random batches. Does this approximate the data properly or do i need to run for more number of batches than the total number of datapoints?
Asked
Active
Viewed 570 times
3
rawwar
- 881
- 2
- 12
- 23
1 Answers
2
Depends entirely on your data, if your variables are mostly numerical then you can get by with small samples. If however you have a lot of categorical variables you need to make sure that each category of every variable is well represented in the subsample. If they are all numerical I would go for 1000 data points repeated 1000 times.
user2974951
- 636
- 3
- 6