9

Which freely available datasets can I use to train a text classifier?

We are trying to enhance our users engagement by recommending the most related content for him, so we thought If we classified our content based on a predefined bag of words we can recommend to him engaging content by getting his feedback on random number of posts already classified before.

We can use this info to recommend for him pulses labeled with those classes. But we found If we used a predefined bag of words not related to our content the feature vector will be full of zeros, also categories may be not relevant to our content. so for those reasons we tried another solution that will be clustering our content not classifying it.

Thanks :)

lsdr
  • 363
  • 2
  • 11
Abdelmawla
  • 121
  • 1
  • 8

4 Answers4

14

Some standard datasets for text classification are the 20-News group, Reuters (with 8 and 52 classes) and WebKb. You can find all of them here.

Debasis
  • 1,556
  • 12
  • 10
7

One of the most widely used test collection for text categorization research (link below). I've used many times. Enjoy your exploration :)

http://www.daviddlewis.com/resources/testcollections/reuters21578/ or http://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

Hammam
  • 71
  • 1
5

There is a bunch of datasets made free by UC Irvine to play with here. Among those datasets, there are a few dozen textual datasets that might help you guys with your task.

Those are kind of generic datasets, so depending on your purpose they should not be used as the only data to train your models, or else your model -- while it might work -- will not produce quality results.

lsdr
  • 363
  • 2
  • 11
1

Apart from the suggestions above, there is an extremely useful pdf - Benchmarking Text Collections for Classification and Clustering Tasks which contains various datasets along with the benchmarks in order to test our models. This includes 20ng Collection, Reuters and many of the above suggested datasets. I hope it helps!

Hima Varsha
  • 2,366
  • 16
  • 34