Questions tagged [python]

Use for data science questions related to the programming language Python. Not intended for general coding questions (which should be asked on Stack Overflow).

Python is a general-purpose, dynamic, strongly typed language with many 3rd-party libraries for data science applications. As of 2020-01-01, Python 3 is the only version with active development as Python 2 has been sunset.

Python syntax is relatively easy to comprehend compared to other languages. For example:

numbers = [1, 2, 5, 8, 9]
for number in numbers:
    print("Hello world #", number)

Python has a clean look due to its regulatory approach to whitespace. While seemingly restrictive, it allows all Python code to look similar, which makes inspecting code much more predictable. All loops and conditionals (for, while, if, etc.) must be indented for the code block that follows.

Popular scientific and data science packages include:

  • Numpy - A fast, N-dimensional array library; the foundation for all things scientific Python.
  • Scipy - Numerical analysis built on Numpy. Allows for optimization, linear algebra, Fourier Transforms and much else.
  • Pandas (PANel DAta) - A fast and extremely flexible package that is very useful for data exploration. It handles NaN data well as well as fast indexing. Handles a wide variety of external data types and file formats.
6630 questions
250
votes
10 answers

What's the difference between fit and fit_transform in scikit-learn models?

I do not understand the difference between the fit and fit_transform methods in scikit-learn. Can anybody explain simply why we might need to transform data? What does it mean, fitting a model on training data and transforming to test data? Does it…
Kaggle
  • 2,977
  • 5
  • 15
  • 8
196
votes
2 answers

Difference between isna() and isnull() in pandas

I have been using pandas for quite some time. But, I don't understand what's the difference between isna() and isnull(). And, more importantly, which one to use when identifying missing values in a dataframe. What is the basic underlying difference…
Vaibhav Thakur
  • 2,403
  • 3
  • 13
  • 9
151
votes
13 answers

Why do people prefer Pandas to SQL?

I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle. The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This…
vy32
  • 611
  • 3
  • 7
  • 11
150
votes
17 answers

Best python library for neural networks

I'm using Neural Networks to solve different Machine learning problems. I'm using Python and pybrain but this library is almost discontinued. Are there other good alternatives in Python?
marcodena
  • 1,667
  • 4
  • 14
  • 17
129
votes
14 answers

Python vs R for machine learning

I'm just starting to develop a machine learning application for academic purposes. I'm currently using R and training myself in it. However, in a lot of places, I have seen people using Python. What are people using in academia and industry, and…
user721
  • 159
  • 2
  • 3
  • 3
125
votes
2 answers

Training an RNN with examples of different lengths in Keras

I am trying to get started learning about RNNs and I'm using Keras. I understand the basic premise of vanilla RNN and LSTM layers, but I'm having trouble understanding a certain technical point for training. In the keras documentation, it says the…
Tac-Tics
  • 1,370
  • 2
  • 9
  • 6
119
votes
12 answers

SVM using scikit learn runs endlessly and never completes execution

I am trying to run SVR using scikit-learn (python) on a training dataset that has 595605 rows and 5 columns (features) while the test dataset has 397070 rows. The data has been pre-processed and regularized. I am able to successfully run the test…
tejaskhot
  • 4,125
  • 7
  • 22
  • 18
96
votes
10 answers

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I got ValueError when predicting test data using a RandomForest model. My code: clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2) clf.fit(X_fit, y_fit) df_test.fillna(df_test.mean()) X_test = df_test.values y_pred =…
Edamame
  • 2,785
  • 5
  • 25
  • 34
88
votes
6 answers

strings as features in decision tree/random forest

I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I…
77
votes
10 answers

How to clone Python working environment on another machine?

I developed a machine learning model with Python (Anaconda + Flask) on my workstation and all goes well. Later, I tried to ship this program onto another machine where of course I tried to set up the same environment, but the program fails to run. I…
Hendrik
  • 8,767
  • 17
  • 43
  • 55
74
votes
7 answers

Open source Anomaly Detection in Python

Problem Background: I am working on a project that involves log files similar to those found in the IT monitoring space (to my best understanding of IT space). These log files are time-series data, organized into hundreds/thousands of rows of…
ximiki
  • 943
  • 1
  • 7
  • 15
71
votes
4 answers

What is the use of torch.no_grad in pytorch?

I am new to pytorch and started with this github code. I do not understand the comment in line 60-61 in the code "because weights have requires_grad=True, but we don't need to track this in autograd". I understood that we mention requires_grad=True…
mausamsion
  • 1,312
  • 1
  • 10
  • 14
67
votes
9 answers

Clustering geo location coordinates (lat,long pairs)

What is the right approach and clustering algorithm for geolocation clustering? I'm using the following code to cluster geolocation coordinates: import numpy as np import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans2,…
rokpoto.com
  • 813
  • 1
  • 7
  • 6
63
votes
9 answers

Tools and protocol for reproducible data science using Python

I am working on a data science project using Python. The project has several stages. Each stage comprises of taking a data set, using Python scripts, auxiliary data, configuration and parameters, and creating another data set. I store the code in…
Yuval F
  • 761
  • 1
  • 6
  • 7
63
votes
4 answers

Difference between OrdinalEncoder and LabelEncoder

I was going through the official documentation of scikit-learn learn after going through a book on ML and came across the following thing: In the Documentation it is given about sklearn.preprocessing.OrdinalEncoder() whereas in the book it was given…
1
2 3
99 100