Questions tagged [pandas]

pandas is a python library for Panel Data manipulation and analysis, e.g. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance.

pandas is a python library for PAN-el DA-ta manipulation and analysis, i.e. multidimensional time series and cross-sectional data sets commonly found in statistics, experimental science results, econometrics, or finance. pandas is implemented primarily using numpy and Cython; it is intended to be able to integrate very easily with other numpy-based scientific libraries, such as statsmodels.

Main Features:

  • Data structures: for 1, 2, and 3 dimensional labeled data sets (respectively Series, DataFrames and Panels). Some of their main features include:
    • Automatically aligning data and interpolation
    • Handling missing observations in calculations
    • Convenient slicing and reshaping ("reindexing") functions
    • Categorical data types
    • Provide 'group by' aggregation or transformation functionality
    • Tools for merging / joining together data sets
    • Simple matplotlib integration for plotting
  • Date tools: objects for expressing date offsets or generating date ranges; some functionality similar to scikits.timeseries. Dates can be aligned to a specific timezone and converted / compared at-will
  • Statistical models: convenient ordinary least squares and panel OLS implementations for in-sample or rolling time series / cross-sectional regressions. These will hopefully be the starting point for implementing other models
  • Intelligent Cython offloading; complex computations are performed rapidly due to these optimizations.
  • Static and moving statistical tools: mean, standard deviation, correlation, covariance
  • Rich User Documentation, using Sphinx

Resources:

Books:

1341 questions
196
votes
2 answers

Difference between isna() and isnull() in pandas

I have been using pandas for quite some time. But, I don't understand what's the difference between isna() and isnull(). And, more importantly, which one to use when identifying missing values in a dataframe. What is the basic underlying difference…
Vaibhav Thakur
  • 2,403
  • 3
  • 13
  • 9
151
votes
13 answers

Why do people prefer Pandas to SQL?

I've been using SQL since 1996, so I may be biased. I've used MySQL and SQLite 3 extensively, but have also used Microsoft SQL Server and Oracle. The vast majority of the operations I've seen done with Pandas can be done more easily with SQL. This…
vy32
  • 611
  • 3
  • 7
  • 11
96
votes
10 answers

ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I got ValueError when predicting test data using a RandomForest model. My code: clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2) clf.fit(X_fit, y_fit) df_test.fillna(df_test.mean()) X_test = df_test.values y_pred =…
Edamame
  • 2,785
  • 5
  • 25
  • 34
77
votes
4 answers

Convert a list of lists into a Pandas Dataframe

I am trying to convert a list of lists which looks like the following into a Pandas Dataframe [['New York Yankees ', '"Acevedo Juan" ', 900000, ' Pitcher\n'], ['New York Yankees ', '"Anderson Jason"', 300000, ' Pitcher\n'], ['New York Yankees ',…
Aravind Veluchamy
  • 871
  • 1
  • 6
  • 3
55
votes
9 answers

How do I compare columns in different data frames?

I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.
a_a_a
  • 837
  • 2
  • 8
  • 11
49
votes
5 answers

Opening a 20GB file for analysis with pandas

I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but I keep getting memory errors. From your experience…
Hari Prasad
  • 501
  • 1
  • 5
  • 4
38
votes
3 answers

Calculation and Visualization of Correlation Matrix with Pandas

I have a pandas data frame with several entries, and I want to calculate the correlation between the income of some type of stores. There are a number of stores with income data, classification of area of activity (theater, cloth stores, food ...)…
gdlm
  • 535
  • 1
  • 6
  • 9
32
votes
4 answers

Is pandas now faster than data.table?

Here is the GitHub link to the most recent data.table benchmark. The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used…
xiaodai
  • 640
  • 1
  • 5
  • 13
31
votes
6 answers

How to fill missing value based on other columns in Pandas dataframe?

Suppose I have a 5*3 data frame in which third column contains missing value 1 2 3 4 5 NaN 7 8 9 3 2 NaN 5 6 NaN I hope to generate value for missing value based rule that first product second column 1 2 3 4 5 20 <--4*5 7 8 9 3 2 6 <-- 3*2 5 6 30…
KyL
  • 429
  • 1
  • 4
  • 5
31
votes
8 answers

How to count the number of missing values in each row in Pandas dataframe?

How can I get the number of missing value in each row in Pandas dataframe. I would like to split dataframe to different dataframes which have same number of missing values in each row. Any suggestion?
Kaggle
  • 2,977
  • 5
  • 15
  • 8
28
votes
6 answers

make seaborn heatmap bigger

I create a corr() df out of an original df. The corr() df came out 70 X 70 and it is impossible to visualize the heatmap... sns.heatmap(df). If I try to display the corr = df.corr(), the table doesn't fit the screen and I can see all the…
redeemefy
  • 661
  • 1
  • 6
  • 9
27
votes
3 answers

How to sum values grouped by two columns in pandas

I have a Pandas DataFrame like this: df = pd.DataFrame({ 'Date': ['2017-1-1', '2017-1-1', '2017-1-2', '2017-1-2', '2017-1-3'], 'Groups': ['one', 'one', 'one', 'two', 'two'], 'data': range(1, 6)}) Date Groups data 0 …
Kevin
  • 543
  • 2
  • 5
  • 12
27
votes
4 answers

Is there a straightforward way to run pandas.DataFrame.isin in parallel?

I have a modeling and scoring program that makes heavy use of the DataFrame.isin function of pandas, searching through lists of facebook "like" records of individual users for each of a few thousand specific pages. This is the most time-consuming…
Therriault
  • 871
  • 1
  • 8
  • 13
25
votes
3 answers

Pandas Dataframe to DMatrix

I am trying to run xgboost in scikit learn. And I am only using Pandas to load the data into a dataframe. How am I supposed to use pandas df with xgboost? I am confused by the DMatrix routine required to run the xgboost algorithm.
Ghostintheshell
  • 451
  • 1
  • 5
  • 7
25
votes
4 answers

Is there any data tidying tool for python/pandas similar to R tidyr tool?

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of gather(), separate() and spread(), which can be found in R tidyr tool.
cpumar
  • 815
  • 1
  • 10
  • 14
1
2 3
89 90