0

In Support Vector Machines, when used for sentiment analysis, text gets converted into a set of data points. How does this happen, usually?

logc
  • 731
  • 3
  • 12

2 Answers2

2

Text can be converted to data via the use of concept clusters (after stemming and stopping), or to count (frequencies) via use of n-grams. N-grams are basically tabulations of the 1-gram count (frequency) of alphabet characters (a though z) in each document, and counts of 2-grams (aa to zz), 3-grams (aaa through zzz), up to about 5-grams (aaaaa through zzzzz). Beyond 5-grams, the data will be sparse and less informative. Thus, a dataset can be constructed for which rows represent documents, and columns represent n-grams. The data values themselves are the total number of occurrences of each gram found in each document.

FYI - n-grams have proven to be the best technique for identifying different languages based on characters.

Regarding SVMs, focus on the SVM literature.

2

Well the text doesn't get converted into data points ... Let's say we are doing sentence level opinion mining.. Features are extracted from a sentence . Now it depends on case to case as to what features to use... A common one is bag of words models in which features become distinct words in sentence and the value of features are the frequency it is repeated in a sentence. Those frequencies are your data points.

user242782
  • 59
  • 3