2

I have a time series of data in the following form:

| purchase_date    |    cutomer_id  |   num_purchases | churned |
   2018-10-31            id1              39             0
   2018-11-31            id1              0              0
   2019-01-31            id1              6              1
   2019-03-31            id2              300            0 
   2018-04-31            id2               2             1
    ...

I grouped the data by month and summed num_purchases by month. The churned column for user id1 for example represents in which month customer churned. So id1 in my case churned in January. Before this, to label who has churned or not, we sampled customers based on 2 months of inactivity period from the churn date. I need to predict if a user is going to churn in a 2 months from now. I am not sure what is the best approach for this.

  • Q1: Should I be grouping customers like I am doing, on a monthly basis or I have to group them on a 2-month basis since that is how they were labeled?
  • Q2: Also, how do I model this? Do I keep customer_id as a feature of the model or not? Is the gap in dates for each customer relevant and how should I deal with it (if)? The dates repeat for different users, should I create index out of a date but it won't be unique or should I create index out of customer_id?
  • Q3: If I need to predict whether the user is going to churn by the end of the year for example or in the next 6 months, would that change how I group/arrange my date and model this?

I plan to add more features to this dataframe (both categorical and numerical).

Michael
  • 33
  • 3

1 Answers1

1

OK, let me try to dissect your questions:

Q1: Should I be grouping customers like I am doing, on a monthly basis or I have to group them on a 2-month basis since that is how they were labeled?

This depends a bit on what your goal is and what kind of features you want to use. I think, with the limited things I know from your question, I would go for a daily prognosis for the next two months (so if you're doing daily you can forecast a period of two months daily data for each day). Reason being that otherwise you might loose some inactivity "features" that are implicitly part of your data. As an alternative you can group your data and create features that help you introduce the information anyway. For example: activity_in_last_5d, activity_in_last_10d, etc.

Q2: Also, how do I model this? Do I keep customer_id as a feature of the model or not?

Customer_id doesn't seem to be a feature to me but rather the key for identifying which features to feed into your model.

Is the gap in dates for each customer relevant and how should I deal with it (if)?

You are basically asking about imputing your data (see this for more: https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779). Gaps can cause problems in your modeling. Some models (for example ARIMA for time series) won't work at all if you have gaps that aren't handled. Looking at your use case, I think taking the last known value for a gap should work fine since a gap means your customer didn't churn on that day.

The dates repeat for different users, should I create index out of a date but it won't be unique or should I create index out of customer_id?

Probably go for indexing the customer_id and creating features based on the date! Maybe there are some seasonality features interesting as well (for example how many people churned on the same day last year?)

Q3: If I need to predict whether the user is going to churn by the end of the year for example or in the next 6 months, would that change how I group/arrange my date and model this?

You would have to change the size of the period you are predicting for. Changing features, grouping, etc. might be useful as well but that really depends on what features you are choosing aka are important and how your model is going to work.

Philipp
  • 722
  • 3
  • 10