Understanding text conversion into SVM input

Question

In Support Vector Machines, when used for sentiment analysis, text gets converted into a set of data points. How does this happen, usually?

score 2 · Answer 1 · answered May 04 '15 at 00:22

Text can be converted to data via the use of concept clusters (after stemming and stopping), or to count (frequencies) via use of n-grams. N-grams are basically tabulations of the 1-gram count (frequency) of alphabet characters (a though z) in each document, and counts of 2-grams (aa to zz), 3-grams (aaa through zzz), up to about 5-grams (aaaaa through zzzzz). Beyond 5-grams, the data will be sparse and less informative. Thus, a dataset can be constructed for which rows represent documents, and columns represent n-grams. The data values themselves are the total number of occurrences of each gram found in each document.

FYI - n-grams have proven to be the best technique for identifying different languages based on characters.

Regarding SVMs, focus on the SVM literature.

score 2 · Answer 2 · answered May 22 '15 at 22:39

Well the text doesn't get converted into data points ... Let's say we are doing sentence level opinion mining.. Features are extracted from a sentence . Now it depends on case to case as to what features to use... A common one is bag of words models in which features become distinct words in sentence and the value of features are the frequency it is repeated in a sentence. Those frequencies are your data points.

Understanding text conversion into SVM input

2 Answers2