22

I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the documents in my dataset, I'm looking for out of domain data that's a bit more "informal" than the news article and journal entries that many of today's state of the art named entity recognition systems are trained on.

Any recommendations? So far I've only been able to locate 50k tokens from twitter published here.

Ethan
  • 1,657
  • 9
  • 25
  • 39
Madison May
  • 2,039
  • 2
  • 18
  • 18

3 Answers3

6

As I understand it, these are the properties that you're seeking in a sample dataset:

  1. Text data
  2. It should be informal, i.e. have typos, slang, and basically something not professionally edited
  3. Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)

Here are some recommendations:

  1. Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
  2. microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
  3. Amazon Commerce reviews dataset from UCI
  4. Within the bag-o-words dataset, try using the Enron emails
  5. The Twenty Newsgroups dataset
  6. This nice collection of SMS spam
  7. You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (rvest, scrapeR, etc) and Python to accomplish this
Hack-R
  • 1,949
  • 1
  • 21
  • 34
3

Check these :

Repository of Test Domains for Information Extraction : http://www.isi.edu/info-agents/RISE/repository.html

DBpedia : http://wiki.dbpedia.org/Downloads32 (mirror)

Link Updated :

http://www.isi.edu/integration/RISE/

https://github.com/dbpedia/extraction-framework/wiki/The-DBpedia-Data-Set

Franck Dernoncourt
  • 5,862
  • 12
  • 44
  • 80
Sreejithc321
  • 1,940
  • 3
  • 20
  • 34
0

Some of the sources that I have used:

I think these datasets will be of great help for your task

Gyan Ranjan
  • 851
  • 7
  • 13