Is it advisable to combine two dataset?

Question

I have two datasets on heart rate of subjects that were recorded in two different places (two different continent to be exact). The two research experiments aimed to find the subjects' emotions based on how much their heart rate change over time. I am using machine learning to predict the subjects' emotions and i am getting acceptable result when tested separately on each dataset. However, i get even better result if i merge the two datasets.

I am not however sure if combining the two datasets is acceptable. As I am combining two somehow different datasets, will it create statistical bias? How should i report my finding in a journal paper?

score 4 · Answer 1 · answered Oct 01 '18 at 05:33

Although generally in training a machine learning model, the more data you have the better for training generalised models, that may not be the case here.

Given that the two datasets were collected in completely different environments, they may have completely different distributions. In this case, training a model on the combined dataset may even reduce the performance of the model.

My advice would be, do some statistical analysis on each dataset independently - find the mean and variances of each of the variables for each dataset and compare them for example. If the analysis shows that the two datastes have fairly similar distributions (I’ll leave the definition of fairly similar up to you), then it should be ok to combine the two datasets to train a model on.

score 3 · Accepted Answer · answered Sep 30 '18 at 17:42

3

If you add ‘continent’ or ‘location’ as a feature for the model, then you will be able to control for potential bias while getting the results of the additional data.

answered Sep 30 '18 at 17:42

Super_John

154
5

Aditya · Answer 3 · 2018-10-01T03:57:15.143

Adding to what @Super_John said, if adding continents as a Feature, then you can also probably have at-least 2 more features as well,

The Latitude
The Longitude

Also add another temporary column to indicate the Source (like $1$ to $1st$ df, $2$ to $2nd$ df etc), So that we can add Colors to the k-means

So now we can have a k-means Cluster to see whether values are overlapping or not... (Trying to see it in an Unsupervised Way)

(The analogy is equivalent to the fact that that you can cluster time(24 hours in a day) in a cyclic fashion , like plotting $sin(x)$, $cos(X)$ and then trying to cluster them)

Have a look at this answer, Features Selection, Extraction

score 2 · Answer 4 · answered Sep 30 '18 at 17:28

Yes, usually with ML, more data you have, better the results! Of course mixing data from different population is risky, but if it works you are on the right path.

Using more data helps generalise during the training of your model. So, if you are able to test your model over sample from both population and you obtain good result, you can do it.

score 1 · Answer 5 · answered Oct 03 '18 at 07:24

To add to this discussion, a proper evaluation will tell you quite a bit, and can be used to present the work:

Create a test set for dataset 1.
Create a test set for dataset 2.
Train a model using only dataset 1, only dataset 2 and using a combination of dataset 1 and 2 evaluate their performances on both test sets.

If the combined model is significantly better than the separate models, you have something, and I think you can report as such in a possible publication. Of course, you will still have to motivate which machine learning model you use, your performance metric of interest, how you conduct cross-validation, ...

score 1 · Answer 6 · answered Oct 03 '18 at 09:14

Before I could attempt to answer your questions, I will put across what I have understood.

Scenario: Two datasets with heart rate of subjects recorded in two different continents are available.

Aim: Find the subjects' emotions based on how much their heart rate change over time

Objective: Classify subjects' emotions

Noted:

Results are acceptable when trained and tested as separately.
Assume that results would improve upon combining two datasets

Questions:

Is combining the two datasets acceptable?

If the subjects of the two continents are same then there shouldn't be a problem in combining the datasets. Set of Emotions are pretty much the same across same subjects

As you are combining two somehow different datasets, will it create statistical bias?

As long as subjects of two datasets are same then combining will improve your results due to more data.

How should you report your findings in a journal paper?

You could perform hypothesis test(ANOVA) for two samples

Is it advisable to combine two dataset?

6 Answers6

Linked