3

I am working on a binary classification model (healthy/diseased) based on gene expression data of different patients. As a second task, I would like to stratify these patients and find subgroups. I expect that the summary pattern of different genes within an experiment will be the strongest predictor of the outcome (differential coexpression analysis). How do I deal with the importance of the group-setup in my ML model if I need to follow the rule not to include IDs (in my case experiment IDs) in a model?

Also, I have repeated measures of the same patients and also hope for significant differences between some patient groups - does that mean I should just include the patient IDs as well, or pre-define some groups, or use all patient characteristics that could be interesting as features?

This is how my data is currently organized:

experiment ID gene expression patient ID label
1 A 11 1234 healthy
1 B 5 1234 healthy
2 A 3 4356 diseased
2 B 9 4356 diseased
3 A 13 1234 healthy
3 B 6 1234 healthy
vhio
  • 31
  • 2

1 Answers1

2

I started writing this as a comment but I realized that I have too many things to say... I'm not sure that it's a proper answer either but hopefully it's useful:

  • I'm not really sure that I understand what the "experiment id" represents here, but this idea of relying on it as a grouping variable doesn't seem very good to me: it's going to be used by the model as a potential explanatory variable for the target, I'm not sure that's what you want here.
  • I would definitely advise to format all the observations for one patient as one instance. The model assumes that the instances are independent of each other, so it cannot use the relation between two instances which share a patient id.
  • 30 different genes as features could be perfectly fine, but it depends how many instances you have as training data. Too few instances and/or too many features could cause overfitting, i.e. the model using details which happen by chance in the data as patterns. Anyway there are some options for this kind of problem, feature selection would be the most obvious one.
  • For the repeated measures for one patient, assuming the measures always include all/most of the genes, this is not necessarily a problem: the different sets of measures can be used as different instances. However in this case there might be some bias in the distribution, for example if multiple measures are more common with healthy patients. A workaround would be to always include N instances for every patient, and if needed repeat the same measures if the patient doesn't have several sets of measures.
  • About the different patients groups: my first intuition would be to simply train a different model for every group, this way you can observe how the models (or their predictions) differ.
  • If the goal is to find the most important causative factors, I'd recommend training a simple decision tree model: decision trees are easily observable and interpretable, with the most discriminative features at the top/root of the tree. Don't hesitate to restrict the parameters, in particular the depth of the tree, in order to make the result readable.
Erwan
  • 26,519
  • 3
  • 16
  • 39