3

I'm new to all this and am putting together a learning project. I've decided on finding similarities between users in a data set such as http://en.wikipedia.org/wiki/Enron_Corpus. After doing a bit of research, I also came across Dataset for Named Entity Recognition on Informal Text. So I'm not short of data or a goal, I need to understand high-level techniques to get there.

One valuable comment noted that this question appears too broad. What I was hoping to find with this question was the breadth of techniques I should focus research on, not answers that are immediately implementable. Please consider vague answers as entirely appropriate!!

Expanding on the goal, I am hoping to discover which authors might have affinity toward each other, or conversely do not care much for each other. So I will definitely need to start with Named Entity Recognition and build a means to organize the documents against those entities. Beyond that, I am not so sure.

What high level concepts should I be looking at? Thanks!

1 Answers1

1

As you accept vague answers: Sranford NLP tools are strong for this kind of stuff. NER, POS Tagger maybe, Parsers, etc. Now for machine learning itself, I would try looking up at WEKA, it's got a lot of filtering, classifiers and clustering methods, including StringToWordVector filters, that are, in my opinion, fundamental to text classification. Mostly, the tags you should be looking for are Text Categorization, Natural Language Processing and even Sentiment Analysis if you will.

Henrique Nader
  • 511
  • 2
  • 7
  • 16