I am building a model to predict when a given customer is likely to purchase at our store again (a time between orders problem). My current approach is as follows:
- Targets are bins in fortnights, e.g. label 0 is 0 to 14 days, label 1 is 14 to 28 days, etc. Anything above 56 days is label 4. We don't need any more precision than that.
- I start by filtering out customers who didn't shop a minimum number of times with us, as we need their history for creating the target and for feature engineering (e.g. people who only have one single purchase with us don't have a gap between purchases for training).
- I am ranking their purchases in descending order of dates for each customer (rank 1 is a client's latest purchase, rank 2 the one before, etc.).
- The difference in days between ranks 1 and 2, binned into labels described above, is the target of the model.
- Features are then engineered using information from rank 2 purchase and the history of the client at that point in time. Model gets information about rank 2 purchase itself, such as number of items in basket for that specific purchase, total spend of that purchase, etc. Model also gets cumulative features at that point in time, i.e. ignoring rank 1 row (for instance, the average total spend of that client at the point in time of rank 2 purchase, but not including the spend of rank 1 to avoid data leakage). Average gaps between purchase (again ignoring rank 1) is also included.
- This model will be trained on this data above that includes features of rank 2 plus cumulative features at that point in time. This data is then split into train/test data sets, and the model is trained and evaluated. For the inference pipeline, the model will then use rank 1 purchase information plus cumulative features at that point in time in order to predict the actual next purchase date bin for each client.
I'm trying to improve the performance of this model and features about recent behaviour could prove to be relevant, e.g. how many times someone visited our website in the last week/week prior to their purchase. However, I am having trouble finding a way of including this type of feature given my approach above. The issue is:
- if we look at website visists last week, this would be a client-level feature that would be the same for both ranks 1 and 2. Inference performance seems much poorer than on the test dataset (tested by applying the approach to data that is 3 months old and comparing inferences with actual returning clients).
- Using frequency of website visits the week before rank 2 purchase requires knowing the purchase date, so this wouldn't work for the inference process either as it would require knowledge of the target.
Would anyone have suggestions on how to include these recent behaviour features in this model?