Dataset for Named Entity Recognition on Informal Text

Question

I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the documents in my dataset, I'm looking for out of domain data that's a bit more "informal" than the news article and journal entries that many of today's state of the art named entity recognition systems are trained on.

Any recommendations? So far I've only been able to locate 50k tokens from twitter published here.

score 6 · Answer 1 · answered Mar 26 '15 at 12:44

As I understand it, these are the properties that you're seeking in a sample dataset:

Text data
It should be informal, i.e. have typos, slang, and basically something not professionally edited
Something other than Twitter (I don't blame you, Twitter is a useful yet way overused example datasource in text mining)

Here are some recommendations:

Emails from the SpamAssassin corpus -- note that both "ham" (non-spam) and spam datasets are available
microblogPCU data set from UCI, which is data scraped from the microblogs of Sina Weibo users -- note, the raw text data is a mix of Chinese and English (you could perform machine translation of the Chinese, filter to only English, or use it as-is)
Amazon Commerce reviews dataset from UCI
Within the bag-o-words dataset, try using the Enron emails
The Twenty Newsgroups dataset
This nice collection of SMS spam
You can always scrape (extract) your own text data from the Internet; I'm not sure which language or statistical package you're using, but XPath-based packages are available in R (rvest, scrapeR, etc) and Python to accomplish this

score 3 · Answer 2 · edited May 23 '17 at 02:57

3

Check these :

Repository of Test Domains for Information Extraction : http://www.isi.edu/info-agents/RISE/repository.html

DBpedia : http://wiki.dbpedia.org/Downloads32 (mirror)

Link Updated :

http://www.isi.edu/integration/RISE/

https://github.com/dbpedia/extraction-framework/wiki/The-DBpedia-Data-Set

edited May 23 '17 at 02:57

Franck Dernoncourt

5,862
12
44
80

answered Nov 27 '14 at 07:21

Sreejithc321

1,940
3
20
34

Gyan Ranjan · Answer 3 · 2018-08-26T18:38:16.793

0

Some of the sources that I have used:

The classic CONLL Corpus : CONLL Dataset
One Kaggle Source that is worth a try : Kaggle NER Corpus
OntoNotes Release 5.0 : Onto Notes
Bio Entity Recognition Task : Bio Entities
Another Email Related Dataset : Enron Email Dataset

I think these datasets will be of great help for your task

edited Aug 26 '18 at 18:38

answered Aug 26 '18 at 17:54

Gyan Ranjan

851
7
13

Dataset for Named Entity Recognition on Informal Text

3 Answers3

Linked