Questions tagged [pandas]

pandas is a python library for Panel Data manipulation and analysis, e.g. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance.

pandas is a python library for PAN-el DA-ta manipulation and analysis, i.e. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance. pandas is implemented primarily using numpy and Cython; it is intended to be able to integrate very easily with other numpy-based scientific libraries, such as statsmodels.

Main Features:

Data structures: for 1, 2, and 3 dimensional labeled data sets (respectively Series, DataFrames and Panels). Some of their main features include:
- Automatically aligning data and interpolation
- Handling missing observations in calculations
- Convenient slicing and reshaping ("reindexing") functions
- Categorical data types
- Provide 'group by' aggregation or transformation functionality
- Tools for merging / joining together data sets
- Simple matplotlib integration for plotting
Date tools: objects for expressing date offsets or generating date ranges; some functionality similar to scikits.timeseries. Dates can be aligned to a specific timezone and converted / compared at-will
Statistical models: convenient ordinary least squares and panel OLS implementations for in-sample or rolling time series / cross-sectional regressions. These will hopefully be the starting point for implementing other models
Intelligent Cython offloading; complex computations are performed rapidly due to these optimizations.
Static and moving statistical tools: mean, standard deviation, correlation, covariance
Rich User Documentation, using Sphinx

Resources:

Books:

1341 questions

196

votes

2 answers

Difference between isna() and isnull() in pandas

I have been using pandas for quite some time. But, I don't understand what's the difference between isna() and isnull(). And, more importantly, which one to use when identifying missing values in a dataframe. What is the basic underlying difference…

python pandas dataframe

asked Sep 06 '18 at 10:14

Vaibhav Thakur

2,403
3
13
9

151

votes

13 answers

Why do people prefer Pandas to SQL?

I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle. The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This…

python pandas sql

asked Jul 12 '18 at 09:25

vy32

votes

10 answers

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I got ValueError when predicting test data using a RandomForest model. My code: clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2) clf.fit(X_fit, y_fit) df_test.fillna(df_test.mean()) X_test = df_test.values y_pred =…

python scikit-learn pandas random-forest python-3.x

asked May 26 '16 at 04:13

Edamame

2,785
5
25
34

votes

4 answers

Convert a list of lists into a Pandas Dataframe

I am trying to convert a list of lists which looks like the following into a Pandas Dataframe [['New York Yankees ', '"Acevedo Juan" ', 900000, ' Pitcher\n'], ['New York Yankees ', '"Anderson Jason"', 300000, ' Pitcher\n'], ['New York Yankees ',…

pandas

asked Jan 05 '18 at 18:40

Aravind Veluchamy

votes

9 answers

How do I compare columns in different data frames?

I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.

pandas dataframe

asked Jun 12 '18 at 22:34

a_a_a

votes

5 answers

Opening a 20GB file for analysis with pandas

I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but I keep getting memory errors. From your experience…

python bigdata pandas anaconda

asked Feb 13 '18 at 14:03

Hari Prasad

votes

3 answers

Calculation and Visualization of Correlation Matrix with Pandas

I have a pandas data frame with several entries, and I want to calculate the correlation between the income of some type of stores. There are a number of stores with income data, classification of area of activity (theater, cloth stores, food ...)…

python statistics visualization pandas

asked Mar 01 '16 at 05:56

gdlm

votes

4 answers

Is pandas now faster than data.table?

Here is the GitHub link to the most recent data.table benchmark. The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used…

python r pandas data data-table

asked Oct 25 '17 at 02:43

xiaodai

votes

6 answers

How to fill missing value based on other columns in Pandas dataframe?

Suppose I have a 5*3 data frame in which third column contains missing value 1 2 3 4 5 NaN 7 8 9 3 2 NaN 5 6 NaN I hope to generate value for missing value based rule that first product second column 1 2 3 4 5 20 <--4*5 7 8 9 3 2 6 <-- 3*2 5 6 30…

pandas

asked Mar 22 '17 at 12:57

KyL

votes

8 answers

How to count the number of missing values in each row in Pandas dataframe?

How can I get the number of missing value in each row in Pandas dataframe. I would like to split dataframe to different dataframes which have same number of missing values in each row. Any suggestion?

python pandas

asked Jul 07 '16 at 10:26

Kaggle

2,977
5
15
8

votes

6 answers

make seaborn heatmap bigger

I create a corr() df out of an original df. The corr() df came out 70 X 70 and it is impossible to visualize the heatmap... sns.heatmap(df). If I try to display the corr = df.corr(), the table doesn't fit the screen and I can see all the…

visualization pandas plotting

asked Mar 12 '17 at 18:32

redeemefy

votes

3 answers

How to sum values grouped by two columns in pandas

I have a Pandas DataFrame like this: df = pd.DataFrame({ 'Date': ['2017-1-1', '2017-1-1', '2017-1-2', '2017-1-2', '2017-1-3'], 'Groups': ['one', 'one', 'one', 'two', 'two'], 'data': range(1, 6)}) Date Groups data 0 …

python pandas dataframe

asked Jul 10 '17 at 15:47

Kevin

votes

4 answers

Is there a straightforward way to run pandas.DataFrame.isin in parallel?

I have a modeling and scoring program that makes heavy use of the DataFrame.isin function of pandas, searching through lists of facebook "like" records of individual users for each of a few thousand specific pages. This is the most time-consuming…

performance python pandas parallel

asked May 19 '14 at 23:59

Therriault

votes

3 answers

Pandas Dataframe to DMatrix

I am trying to run xgboost in scikit learn. And I am only using Pandas to load the data into a dataframe. How am I supposed to use pandas df with xgboost? I am confused by the DMatrix routine required to run the xgboost algorithm.

scikit-learn pandas xgboost

asked Jul 15 '16 at 13:48

Ghostintheshell

votes

4 answers

Is there any data tidying tool for python/pandas similar to R tidyr tool?

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of gather(), separate() and spread(), which can be found in R tidyr tool.

r python dataset data-cleaning pandas

asked Mar 02 '16 at 08:54

cpumar

2 3

…

89 90 Next