32

Given a sentence like:

Complimentary gym access for two for the length of stay ($12 value per person per day)

What general approach can I take to identify the word gym or gym access?

William Falcon
  • 431
  • 1
  • 6
  • 7

3 Answers3

32

Shallow Natural Language Processing technique can be used to extract concepts from sentence.

-------------------------------------------

Shallow NLP technique steps:

  1. Convert the sentence to lowercase

  2. Remove stopwords (these are common words found in a language. Words like for, very, and, of, are, etc, are common stop words)

  3. Extract n-gram i.e., a contiguous sequence of n items from a given sequence of text (simply increasing n, model can be used to store more context)

  4. Assign a syntactic label (noun, verb etc.)

  5. Knowledge extraction from text through semantic/syntactic analysis approach i.e., try to retain words that hold higher weight in a sentence like Noun/Verb

-------------------------------------------

Lets examine the results of applying the above steps to your given sentence Complimentary gym access for two for the length of stay ($12 value per person per day).

1-gram Results: gym, access, length, stay, value, person, day

Summary of step 1 through 4 of shallow NLP:

1-gram PoS_Tag Stopword (Yes/No)? PoS Tag Description

Complimentary NNP Proper noun, singular gym NN Noun, singular or mass access NN Noun, singular or mass for IN Yes Preposition or subordinating conjunction two CD Cardinal number for IN Yes Preposition or subordinating conjunction the DT Yes Determiner length NN Noun, singular or mass of IN Yes Preposition or subordinating conjunction stay NN Noun, singular or mass ($12 CD Cardinal number value NN Noun, singular or mass per IN Preposition or subordinating conjunction person NN Noun, singular or mass per IN Preposition or subordinating conjunction day) NN Noun, singular or mass

Step 4: Retaining only the Noun/Verbs we end up with gym, access, length, stay, value, person, day

Lets increase n to store more context and remove stopwords.

2-gram Results: complimentary gym, gym access, length stay, stay value

Summary of step 1 through 4 of shallow NLP:

2-gram Pos Tag

access two NN CD complimentary gym NNP NN gym access NN NN length stay NN NN per day IN NN per person IN NN person per NN IN stay value NN NN two length CD NN value per NN IN

Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym, gym access, length stay, stay value

3-gram Results: complimentary gym access, length stay value, person per day

Summary of step 1 through 4 of shallow NLP:

3-gram Pos Tag

access two length NN CD NN complimentary gym access NNP NN NN gym access two NN NN CD length stay value NN NN NN per person per IN NN IN person per day NN IN NN stay value per NN NN IN two length stay CD NN NN value per person NN IN NN

Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym access, length stay value, person per day

Things to remember:

  • Refer the Penn tree bank to understand PoS tag description
  • Depending on your data and the business context you can decide the n value to extract n-grams from sentence
  • Adding domain specific stop words would increase the quality of concept/theme extraction
  • Deep NLP technique will give better results i.e., rather than n-gram, detect relationships within the sentences and represent/express as complex construction to retain the context. For additional info, see this

Tools:

You can consider using OpenNLP / StanfordNLP for Part of Speech tagging. Most of the programming language have supporting library for OpenNLP/StanfordNLP. You can choose the language based on your comfort. Below is the sample R code I used for PoS tagging.

Sample R code:

Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7') # for 32-bit version
library(rJava)
require("openNLP")
require("NLP")

s <- paste("Complimentary gym access for two for the length of stay $12 value per person per day")

tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, [[, "POS")) POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags) }

tagged_str <- tagPOS(s) tagged_str

#$POStagged #[1] "Complimentary/NNP gym/NN access/NN for/IN two/CD for/IN the/DT length/NN of/IN stay/NN $/$ 12/CD value/NN per/IN person/NN per/IN day/NN"

#$POStags #[1] "NNP" "NN" "NN" "IN" "CD" "IN" "DT" "NN" "IN" "NN" "$" "CD" #[13] "NN" "IN" "NN" "IN" "NN"

Additional readings on Shallow & Deep NLP:

Zephyr
  • 997
  • 4
  • 11
  • 20
3

You need to analyze sentence structure and extract corresponding syntactic categories of interest (in this case, I think it would be noun phrase, which is a phrasal category). For details, see corresponding Wikipedia article and "Analyzing Sentence Structure" chapter of NLTK book.

In regard to available software tools for implementing the above-mentioned approach and beyond, I would suggest to consider either NLTK (if you prefer Python), or StanfordNLP software (if you prefer Java). For many other NLP frameworks, libraries and programming various languages support, see corresponding (NLP) sections in this excellent curated list.

Aleksandr Blekh
  • 6,603
  • 4
  • 29
  • 55
0

If you're a R user, there is a lot of good practical information here. Look at their text mining examples.
Also, take a look at the tm package.
This is also a good aggregation site.

Zephyr
  • 997
  • 4
  • 11
  • 20