4

I have a data set of 36K processes each with 7 features. Each set represents a task and I know how long it took to complete each of the tasks. I want to build a model which will be able to predict the duration of the future tasks.

I've only learned about Decision Tree (DT) and tried to apply it to my problem. The resulting accuracy score is 0.03. I believe DT is not suitable because the time is continuous and DT is meant for categorization.

Which algorithm is suitable for duration prediction?

My environment: Python with sklearn, if that matters.

Slawa
  • 149
  • 1
  • 3

3 Answers3

4

What others have said is accurate, that you need to build a regression model of some sort. Depending on the scale of the duration of your task, you will have to model it slightly differently.

In fact, there's an entire class of models that try to predict duration. These are called survival models. Here's a Python library on survival analysis.

But these models are fairly academic. A typical hack is to use Gamma, Poison, or Log-Normal regression which works out nicely because these models predict non-negative values.

-1

You might want to look into scikit's DecisionTreeRegressor() class. A decision tree regressor will predict a real number. The decision tree classifier, by contrast, will predict a discrete class for an observation. A more advanced version would be the RandomForestRegressor() class, which builds a random forest of regression trees. A third option to consider might be a GradientBoostedRegressor().

I would also recommend this flow chart from Scikit that can help you choose an estimator.

neelshiv
  • 119
  • 2
-1

You are facing a regression problem: your aim is to predict the value of a continuous variable given the value of a set of input variables (these input variables can be of any type: numbers, categories, etc.) A decision tree is usually applied to classification problems, in which you are aiming at predicting a discrete value.

There are several regression methods in sklearn that you could use. The simplest ones, and maybe the ones you should start from, are linear models: http://scikit-learn.org/stable/modules/linear_model.html.

Notice that you may need to transform your input data. For instance, if one of the features is categorical, you may need to transform it into a set of binary variables. Other processes, like data normalisation, may also be advisable in order to get better results.

Edit: as stated in the comments, Decision Trees can also be applied to regression problems. However, and in my own experience, the output curve that you obtain by means of this algorithm usually has a step-wise shape that may affect the final bias (see, for instance the example in the scikit-learn docs.). I would suggest not to constraint yourself and try different types of algorithms.

Pablo Suau
  • 1,809
  • 1
  • 14
  • 20