Questions tagged [python]

Use for data science questions related to the programming language Python. Not intended for general coding questions (which should be asked on Stack Overflow).

Python is a general-purpose, dynamic, strongly typed language with many 3rd-party libraries for data science applications. As of 2020-01-01, Python 3 is the only version with active development as Python 2 has been sunset.

Python syntax is relatively easy to comprehend compared to other languages. For example:

numbers = [1, 2, 5, 8, 9]
for number in numbers:
    print("Hello world #", number)

Python has a clean look due to its regulatory approach to whitespace. While seemingly restrictive, it allows all Python code to look similar, which makes inspecting code much more predictable. All loops and conditionals (for, while, if, etc.) must be indented for the code block that follows.

Popular scientific and data science packages include:

Numpy - A fast, N-dimensional array library; the foundation for all things scientific Python.
Scipy - Numerical analysis built on Numpy. Allows for optimization, linear algebra, Fourier Transforms and much else.
Pandas (PANel DAta) - A fast and extremely flexible package that is very useful for data exploration. It handles NaN data well as well as fast indexing. Handles a wide variety of external data types and file formats.

6630 questions

250

votes

10 answers

What's the difference between fit and fit_transform in scikit-learn models?

I do not understand the difference between the fit and fit_transform methods in scikit-learn. Can anybody explain simply why we might need to transform data? What does it mean, fitting a model on training data and transforming to test data? Does it…

python scikit-learn

asked Jun 21 '16 at 10:05

Kaggle

2,977
5
15
8

196

votes

2 answers

Difference between isna() and isnull() in pandas

I have been using pandas for quite some time. But, I don't understand what's the difference between isna() and isnull(). And, more importantly, which one to use when identifying missing values in a dataframe. What is the basic underlying difference…

python pandas dataframe

asked Sep 06 '18 at 10:14

Vaibhav Thakur

2,403
3
13
9

151

votes

13 answers

Why do people prefer Pandas to SQL?

I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle. The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This…

python pandas sql

asked Jul 12 '18 at 09:25

vy32

150

votes

17 answers

Best python library for neural networks

I'm using Neural Networks to solve different Machine learning problems. I'm using Python and pybrain but this library is almost discontinued. Are there other good alternatives in Python?

machine-learning python neural-network

asked Jul 07 '14 at 19:17

marcodena

1,667
4
14
17

129

votes

14 answers

Python vs R for machine learning

I'm just starting to develop a machine learning application for academic purposes. I'm currently using R and training myself in it. However, in a lot of places, I have seen people using Python. What are people using in academia and industry, and…

machine-learning r python

asked Jun 12 '14 at 06:04

user721

125

votes

2 answers

Training an RNN with examples of different lengths in Keras

I am trying to get started learning about RNNs and I'm using Keras. I understand the basic premise of vanilla RNN and LSTM layers, but I'm having trouble understanding a certain technical point for training. In the keras documentation, it says the…

python keras rnn training

asked Jan 06 '18 at 23:41

Tac-Tics

1,370
2
9
6

119

votes

12 answers

SVM using scikit learn runs endlessly and never completes execution

I am trying to run SVR using scikit-learn (python) on a training dataset that has 595605 rows and 5 columns (features) while the test dataset has 397070 rows. The data has been pre-processed and regularized. I am able to successfully run the test…

python svm scikit-learn

asked Aug 18 '14 at 10:46

tejaskhot

4,125
7
22
18

votes

10 answers

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I got ValueError when predicting test data using a RandomForest model. My code: clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2) clf.fit(X_fit, y_fit) df_test.fillna(df_test.mean()) X_test = df_test.values y_pred =…

python scikit-learn pandas random-forest python-3.x

asked May 26 '16 at 04:13

Edamame

2,785
5
25
34

votes

6 answers

strings as features in decision tree/random forest

I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I…

machine-learning python scikit-learn random-forest decision-trees

asked Feb 25 '15 at 01:07

user3001408

1,015
1
10
8

votes

10 answers

How to clone Python working environment on another machine?

I developed a machine learning model with Python (Anaconda + Flask) on my workstation and all goes well. Later, I tried to ship this program onto another machine where of course I tried to set up the same environment, but the program fails to run. I…

python anaconda

asked Oct 26 '17 at 12:36

Hendrik

8,767
17
43
55

votes

7 answers

Open source Anomaly Detection in Python

Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of…

machine-learning python data-mining anomaly-detection library

asked Jul 22 '15 at 14:26

ximiki

votes

4 answers

What is the use of torch.no_grad in pytorch?

I am new to pytorch and started with this github code. I do not understand the comment in line 60-61 in the code "because weights have requires_grad=True, but we don't need to track this in autograd". I understood that we mention requires_grad=True…

python pytorch

asked Jun 05 '18 at 08:21

mausamsion

1,312
1
10
14

votes

9 answers

Clustering geo location coordinates (lat,long pairs)

What is the right approach and clustering algorithm for geolocation clustering? I'm using the following code to cluster geolocation coordinates: import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans2,…

machine-learning python clustering k-means geospatial

asked Jul 17 '14 at 09:50

rokpoto.com

votes

9 answers

Tools and protocol for reproducible data science using Python

I am working on a data science project using Python. The project has several stages. Each stage comprises of taking a data set, using Python scripts, auxiliary data, configuration and parameters, and creating another data set. I store the code in…

python tools version-control

asked Jul 16 '14 at 20:09

Yuval F

votes

4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…

machine-learning python scikit-learn preprocessing encoding

asked Oct 07 '18 at 18:55

Saurabh Singh

2 3

…

99 100 Next