1

I built a NLP sentence classifier, which uses vectors from word embedding as features.

Training dataset is big (100k sentences). Every sentence has 930 features.

I found the best model using an auto machine learning library (auto-sklearn); the training required 40 GB of RAM and 60 hours. The best model is an ensemble of the top N models found by this library.

Occasionally, I need to add some data to the training set and update the training. Since this autoML library isn't suitable for incremental training, every time I need to do complete retraining, using more and more memory and time.

How to address this issue? How to do incremental training? Should I quit the usage of this library? For memory and time usage, would it be better to parallelize the training?

1 Answers1

1

First of all using auto-sklearn, you can use

automl.fit(X_train, y_train, dataset_name='X_train',
               feat_type=feature_types)
print(automl.show_models())

so you can extract the instance of the best model from the first fitting. However in order to learn incrementally you have to (in case of sklearn models) have fit_partially method. Naive Bayes varaints and other algorithms here have this functionality. So you are out of luck if these are not in the output of show_models: In this case you ought to do your own automated ml targeted on fit_partial models.

Alternative is using spark it has some cool streaming (incremental learning algos) StreamingKMeans, StreamingLinearRegressionWithSGD, StreamingLogisticRegressionWithSGD and generally StreamingLinearAlgorithm.

To conclude, I would not use auto-sklearn if these are your problems and choose some alternatives that do work parallel.

Noah Weber
  • 5,829
  • 1
  • 13
  • 26