1

I am trying to research a new topic using machine learning. The problem is, I have a very small dataset, consisting of 5 months of data. I have been told that my dataset is too small. I want to know if there is any way of making my dataset larger, using an automated tool/algorithm I can implement fast, and more than that, if it is a good idea.

My dataset contains also months with peculiar data (or outliers) and they have cases from generic and standard to unexpected data.

I mention that I have already searched for similar datasets on the available websites. The dataset I am interested in it's about personal budgets, incomes, expenses, consumer behavior.

2 Answers2

1

I know nothing about machine learning, so take the following with a pinch of salt. (I originally posted it as a comment and was encouraged to repost as an answer, so I guess it can't be terrible.)

You get more data by, well, collecting more data...

Simply using some algorithm to generate more data similar to the data you already have won't help, because it doesn't change the distribution of the data, so doesn't do anything to address the problem that your data might not be representative of reality. For example, suppose I make a crappy painting (I can't paint at all) and I ask ten people from my family and close friends if it's any good. They all say, "Yes, David, it's wonderful" because they don't want to hurt my feelings. If you advise me to ask more people's opinion, that's because you feel that I didn't get a proper variety of opinions. Just asking the same ten people again won't change anything: it'll just make me more confident in my bad data because now "twenty" people have told me that my crappy painting is wonderful.

Ultimately, the same problem applies to any algorithmic approach: the best it can do is to generate more data like the data you already have, but that can't possibly have higher quality. And note that any algorithmic approach is, essentially, "use machine learning to generate more data like the data I already have, and then use machine learning to do X with all that data" so it can't be any better than just using machine learning on the original dataset.

David Richerby
  • 82,470
  • 26
  • 145
  • 239
1

There is no way around collecting more data if you want to increase your test set and thus how meaningful your benchmark results are.

If you, however, are worried that your training set is too small, then there are two main ways to deal with that: Data augmentation which adds invariant transformations to the data (e.g. see Analysis and Optimization of Convolutional Neural Network Architectures, B.2 on page 80 for techniques used in Computer Vision) and using more restricted algorithms (e.g. smaller neural networks, linear regression).

Also, k-fold cross validation is a way to deal with very limited datasets.

Martin Thoma
  • 2,360
  • 1
  • 22
  • 41