Questions tagged [text]

Text is a type of data often used in data science projects involving natural language processing.

Text is a type of data often used in data science projects involving natural language processing.

162 questions
37
votes
6 answers

Sentence similarity prediction

I'm looking to solve the following problem: I have a set of sentences as my dataset, and I want to be able to type a new sentence, and find the sentence that the new one is the most similar to in the dataset. An example would look like: New…
lte__
  • 1,379
  • 5
  • 19
  • 29
25
votes
3 answers

How do you apply SMOTE on text classification?

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
  • 369
  • 1
  • 3
  • 5
10
votes
1 answer

How to use TFIDF vectors with multinomial naive bayes?

Say we have used the TFIDF transform to encode documents into continuous-valued features. How would we now use this as input to a Naive Bayes classifier? Bernoulli naive-bayes is out, because our features aren't binary anymore. Seems like we can't…
dhrumeel
  • 201
  • 2
  • 4
9
votes
2 answers

How to implement hierarchical labeling classification?

I am currently working on the task of eCommerce product name classification, so I have categories and subcategories in product data. I noticed that using subcategories as labels delivers worse results (84% acc) than categories (94% acc). But…
chacid
  • 171
  • 7
9
votes
1 answer

Which type auto encoder gives best results for text

I did I couple of examples for auto encoders for images and they worked fine. Now I want to do an auto encoder for text that takes as input a sentence and returns the same sentence. But when I try to use the same auto encoders as the ones I used for…
sspp
  • 109
  • 2
  • 6
7
votes
2 answers

Data transformations in hierarchical classification

I am building a hierarchical text classifier using the Local Classifier Per Parent Node (LCPN) approach with the 'siblings' policy as described in the A survey of hierarchical classification across different application domains: E.g. if we have the…
matentzn
  • 171
  • 1
6
votes
2 answers

What is the minimum number of times a word needs to appear in word2vec training corpus for quality results?

When training a word2vec model with, eg, gensim, you can specify the minimum times a word needs to be seen (with the parameter min_count). The default value for this seems to be 5. Are there any theoretical considerations for selecting a threshold…
user1253952
  • 203
  • 2
  • 5
5
votes
1 answer

Doc2vec to calculate cosine similarity - absolutely inaccurate

I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give…
lte__
  • 1,379
  • 5
  • 19
  • 29
5
votes
2 answers

Text similarity using RNN

Data set contains records of short text, typically a sentence. The goal is to find duplicated records and similar records. Currently, I have tried R package 'text2vec', the glove word vectors and the similarity APIs provided by the package. There is…
user28251
  • 51
  • 1
  • 2
5
votes
3 answers

How can I group texts with similar content together?

I need to find a solution to group a corpus of texts according to document similarity. Premising I have no experience in ML - only a few readings - I'd like to know if calculating the tf-idf on each text is the right approach. I've read something…
Max
  • 191
  • 1
  • 6
5
votes
1 answer

How to evaluate the similarity of two columns containing strings?

I am new to text processing and stuck on a problem to identify the similarity of columns. To detail the problem, consider we have two columns with string values: Column A | Column B ------------------------------- abcd | …
Rachit Tayal
  • 213
  • 1
  • 2
  • 4
4
votes
1 answer

encoding of text data in NLP

I'm getting data using web scraping to create a dataset. I have a 'company' column that contains the names of the companies. I would like to encode this column but i don't know how to find the sentences that represent the same companies . For…
Lydia
  • 43
  • 2
4
votes
3 answers

Bidirectional Encoder Representations from Transformers in R

Can anybody suggest to me, where I can find example code for R language for BERT neural network for text mining tasks. All I can see are python examples, and I need…
Kogan
4
votes
1 answer

How does ,the Mutlinomial Bayes's alpha parameter, affects the text classification task?

I would like to know how the alpha parameter, in Multinomial Bayes, affects the text classification task. I know that this parameter is correlated to the algorithm's ability in classifying unseen words during training. How changes text…
Simone
  • 725
  • 2
  • 14
  • 23
4
votes
4 answers

Extract 2 pieces of information from a string - what to use?

First of all, I am a complete newbie in regard to data science and I am not asking for the complete solution but some guidance as to what I should read up to achieve my task (what algorithms, techniques etc are used to tackle similar problems). I…
kyriakos
  • 141
  • 1
1
2 3
10 11