Highest Voted Questions - Data Science Stack Exchange

10

votes

2 answers

Extract canonical string from a list of noisy strings

I have thousands of lists of strings, and each list has about 10 strings. Most strings in a given list are very similar, though some strings are (rarely) completely unrelated to the others and some strings contain irrelevant words. They can be…

nlp similarity information-retrieval

asked Aug 22 '14 at 15:59

lacton

201
1
5

10

votes

2 answers

Machine Learning Steps

Which of the below set of steps options is the correct one when creating a predictive model? Option 1: First eliminate the most obviously bad predictors, and preprocess the remaining if needed, then train various models with cross-validation, pick…

machine-learning predictive-modeling

asked Feb 04 '16 at 08:43

A K

103
4

9

votes

4 answers

How to combine PCA and MCA on mixed data?

Suppose I have mixed data and (python) code which is capable of doing PCA (principal component analysis) on continuous predictors and MCA (multiple correspondence analysis) on nominal predictors. Is it possible to combine results from PCA and MCA…

python categorical-data

asked Jan 19 '16 at 09:03

Boycott OpenAI sellouts

191
1
1
4

9

votes

1 answer

What tokenizer does OpenAI's GPT3 API use?

I'm building an application for the API, but I would like to be able to count the number of tokens my prompt will use, before I submit an API call. Currently I often submit prompts that yield a 'too-many-tokens' error. The closest I got to an answer…

python-3.x tokenization gpt

asked Jul 08 '21 at 18:07

Herman Autore

93
1
3

9

votes

3 answers

Export weights (formula) from Random Forest Regressor in Scikit-Learn

I trained a prediction model with Scikit Learn in Python (Random Forest Regressor) and I want to extract somehow the weights of each feature to create an excel tool for manual prediction. The only thing that I found is the model.feature_importances_…

python predictive-modeling regression random-forest scikit-learn

asked Jan 08 '16 at 11:57

Tasos

3,960
5
25
54

9

votes

2 answers

Ethical consequences of non-deterministic learning processes?

Most advanced supervised learning techniques are non-deterministic by construction. The final output of the model usually depends on some random parts of the learning process. (Random weight initialization for Neural Networks or variable selection /…

model-selection methodology ethical-ai

asked Jun 23 '21 at 11:39

Lucas Morin

2,775
5
25
47

9

votes

1 answer

Where does the name 'LSTM' come from?

Long short-term memory is a recurrent neural network architecture introduced in the paper Long short-term memory. Can you please tell me where the name comes from? ("Memory", as the network can store information because of the recurrence - but where…

machine-learning neural-network terminology

asked Dec 25 '15 at 08:08

Martin Thoma

19,540
36
98
170

9

votes

1 answer

Properties for building a Multilayer Perceptron Neural Network using Keras?

I am trying to build and train a multilayer perceptron neural network that correctly predicts what president won in what county for the first time. I have the following information for training data. Total population Median age % BachelorsDeg or…

neural-network keras

asked Dec 24 '15 at 00:54

pr338

385
2
7

9

votes

1 answer

How to customise cost function in Scikit learn's model？

For example, when I have a problem that false negative should be penalised more, how can I incorporate that requirement in the algorithm such as SVM?

machine-learning python scikit-learn

asked Dec 19 '15 at 07:52

Ghostintheshell

451
1
5
7

9

votes

1 answer

Dimensions of Transformer - dmodel and depth

Trying to understand the dimensions of the Multihead Attention component in Transformer referring the following tutorial https://www.tensorflow.org/tutorials/text/transformer#setup There are 2 unknown dimensions - depth and d_model which I dont…

deep-learning neural-network keras tensorflow transformer

asked Apr 30 '21 at 08:05

data_person

265
1
3
11

9

votes

1 answer

What is the difference between affinity matrix eigenvectors and graph Laplacian eigenvectors in the context of spectral clustering?

In spectral clustering, it's standard practice to solve the eigenvector problem $$L v = \lambda v$$ where $L$ is the graph Laplacian, $v$ is the eigenvector related to eigenvalue $\lambda$. My question: why bother taking the graph Laplacian?…

machine-learning clustering graphs

asked Dec 12 '15 at 13:35

felipeduque

201
1
2
5

9

votes

7 answers

Python library that can compute the confusion matrix for multi-label classification

I'm looking for a Python library that can compute the confusion matrix for multi-label classification. FYI: scikit-learn doesn't support multi-label for confusion matrix) What is the difference between Multiclass and Multilabel Problem

python software-recommendation multilabel-classification

asked Dec 11 '15 at 02:54

Franck Dernoncourt

5,862
12
44
80

9

votes

6 answers

Which cross-validation type best suits to binary classification problem

Data set looks like: 25000 observations up to 15 predictors of different types: numeric, multi-class categorical, binary target variable is binary Which cross validation method is typical for this type of problems? By default I'm using K-Fold. How…

classification cross-validation

asked Aug 06 '14 at 08:41

IgorS

5,474
11
34
43

9

votes

1 answer

Is a multi-layer perceptron exactly the same as a simple fully connected neural network?

I've been learning a little about StyleGans lately and somebody told me that a Multi-Layer Perceptron, MLP, is used in parts of the architecture for transforming noise. When I saw this person's code, it just looked like a normal 8-layer fully…

deep-learning neural-network gan mlp

asked Mar 25 '21 at 09:23

zipline86

399
1
5
13

9

votes

1 answer

When do I have to use aucPR instead of auROC? (and vice versa)

I'm wondering if sometimes, to validate a model, it's not better to use aucPR instead of aucROC? Do these cases only depend on the "domain & business understanding" ? Especially, I'm thinking about the "unbalanced class problem" where, it seems…

machine-learning data-mining cross-validation model-evaluations

asked Nov 24 '15 at 11:50

jmvllt

629
2
8
15

Most Popular