Questions tagged [dataframe]

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While "data frame" or "dataframe" is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), "table" is the term used in MATLAB and SQL.

A data frame is a tabular data structure. Usually, it contains data where rows are observations and columns are variables of various types. While data frame or dataframe is the term used for this concept in several languages (R, Apache Spark, deedle, Maple, the pandas library in Python and the DataFrames library in Julia), table is the term used in MATLAB and SQL.

The sections below correspond to each language that uses this term and are aimed at the level of an audience only familiar with the given language.

data.frame in R

Data frames (object class data.frame) are one of the basic tabular data structures in the R language, alongside matrices. Unlike matrices, each column can be a different data type. In terms of implementation, a data frame is a list of equal-length column vectors.

Type ?data.frame for help constructing a data frame. An example:

data.frame(
  x = letters[1:5], 
  y = 1:5, 
  z = (1:5) > 3
)
#   x y     z
# 1 a 1 FALSE
# 2 b 2 FALSE
# 3 c 3 FALSE
# 4 d 4  TRUE
# 5 e 5  TRUE

Related functions include is.data.frame, which tests whether an object is a data.frame; and as.data.frame, which coerces many other data structures to data.frame (through S3 dispatch, see ?S3). base data.frames have been extended or modified to create new data structures by several R packages, including and . For further reading, see the paragraph on Data frames in the CRAN manual Intro to R


DataFrame in Python's pandas library

The pandas library in Python is the canonical tabular data framework on the SciPy stack, and the DataFrame is its two-dimensional data object. It is basically a rectangular array like a 2D numpy ndarray, but with associated indices on each axis which can be used for alignment. As in R, from an implementation perspective, columns are somewhat prioritized over rows: the DataFrame resembles a dictionary with column names as keys and Series (pandas' one-dimensional data structure) as values.

After importing numpy and pandas under the usual aliases (import numpy as np, import pandas as pd), we can construct a DataFrame in several ways, such as passing a dictionary of column names and values:

>>> pd.DataFrame({"x": list("abcde"), "y": range(1,6), "z": np.arange(1,6) > 3})
   x  y      z
0  a  1  False
1  b  2  False
2  c  3  False
3  d  4   True
4  e  5   True

DataFrame in Apache Spark

A Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. (source)


DataFrame in Maple

A DataFrame is one of the basic data structures in Maple. Data frames are a list of variables, known as DataSeries, which are displayed in a rectangular grid. Every column (variable) in a DataFrame has the same length, however, each variable can have a different type, such as integer, float, string, name, boolean, etc.

When printed, Data frames resemble matrices in that they are viewed as a rectangular grid, but a key difference is that the first row corresponds to the column (variable) names, and the first column corresponds to the row (individual) names. These row and columns are treated as header meta-information and are not a part of the data. Moreover, the data stored in a DataFrame can be accessed using these header names, as well as by the standard numbered index. For more details, see the Guide to DataFrames in the online Maple Programming Help.

349 questions
196
votes
2 answers

Difference between isna() and isnull() in pandas

I have been using pandas for quite some time. But, I don't understand what's the difference between isna() and isnull(). And, more importantly, which one to use when identifying missing values in a dataframe. What is the basic underlying difference…
Vaibhav Thakur
  • 2,403
  • 3
  • 13
  • 9
55
votes
9 answers

How do I compare columns in different data frames?

I would like to compare one column of a df with other df's. The columns are names and last names. I'd like to check if a person in one data frame is in another one.
a_a_a
  • 837
  • 2
  • 8
  • 11
27
votes
3 answers

How to sum values grouped by two columns in pandas

I have a Pandas DataFrame like this: df = pd.DataFrame({ 'Date': ['2017-1-1', '2017-1-1', '2017-1-2', '2017-1-2', '2017-1-3'], 'Groups': ['one', 'one', 'one', 'two', 'two'], 'data': range(1, 6)}) Date Groups data 0 …
Kevin
  • 543
  • 2
  • 5
  • 12
18
votes
2 answers

How to plot two columns of single DataFrame on Y axis

I have two data frames (Action, Comedy). Action contains two columns (year, rating) ratings columns contains average rating with respect to year. The Comedy data frame contains the same two columns with different mean values. I merged both data…
Bilal Butt
  • 291
  • 1
  • 2
  • 4
18
votes
4 answers

One hot encoding alternatives for large categorical values

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns? I found this interesting link. But they are converting to class/object which I don't want. I…
17
votes
2 answers

How to remove rows from a dataframe that are identical to another dataframe?

I have two data frames df1 and df2. For my analysis, I need to remove rows from df1 that have identical column values (Email) in df2? >>df1 First Last Email 0 Adam Smith email@email.com 1 John Brown email2@email.com 2 Joe Max …
a_a_a
  • 837
  • 2
  • 8
  • 11
14
votes
3 answers

after grouping to minimum value in pandas, how to display the matching row result entirely along min() value

The dataframe contains >> df A B C A 196512 196512 1325 12.9010511000000 196512 196512 114569 12.9267705000000 196512 196512 118910 12.8983353775637 196512 196512 100688 12.9505091000000 196795 196795 …
Sam Joe
  • 173
  • 1
  • 1
  • 10
13
votes
2 answers

Delete/Drop only the rows which has all values as NaN in pandas

I have a Dataframe, i need to drop the rows which has all the values as NaN. ID Age Gender 601 21 M 501 NaN F NaN NaN NaN The resulting data frame should look like. Id Age Gender 601 21 M 501 …
Harshith
  • 303
  • 2
  • 5
  • 16
13
votes
5 answers

How to Write Multiple Data Frames in an Excel Sheet

I have multiple data frames with same column names. I want to write them together to an excel sheet stacked vertically on top of each other. And between each, there will be a text occupying a row. This is what I have in mind. I tried the…
Della
  • 345
  • 1
  • 3
  • 9
13
votes
2 answers

dataframe.columns.difference() use

I am trying to find the working of dataframe.columns.difference() but couldn't find a satisfactory explanation about it. Can anyone explain the working of this method in detail?
Parth S.
  • 133
  • 1
  • 1
  • 6
13
votes
3 answers

Find the consecutive zeros in a DataFrame and do a conditional replacement

I have a dataset like this: Sample Dataframe import pandas as pd df = pd.DataFrame({ 'names': ['A','B','C','D','E','F','G','H','I','J','K','L'], 'col1': [0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0], 'col2': [0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,…
Kevin
  • 543
  • 2
  • 5
  • 12
11
votes
2 answers

Pandas merge column duplicate and sum value

How to merge duplicate column and sum their value? What I have A 30 A 40 B 50 What I need A 70 B 50 DF for this example d = {'address': ["A", "A", "B"], 'balances': [30, 40, 50]} df = pd.DataFrame(data=d) df
11
votes
2 answers

How to rename columns that have the same name?

I would like to rename the column names, but the Data Frame contains similar column names. How do I rename them? df.columns Output: Index([ 'Goods', 'Durable goods','Services','Exports', 'Goods', 'Services', 'Imports', 'Goods',…
Antony Naveen
  • 121
  • 1
  • 1
  • 5
10
votes
1 answer

How to find the count of consecutive same string values in a pandas dataframe?

Assume that we have the following pandas dataframe: df = pd.DataFrame({'col1':['A>G','C>T','C>T','G>T','C>T', 'A>G','A>G','A>G'],'col2':['TCT','ACA','TCA','TCA','GCT', 'ACT','CTG','ATG'],…
burcak
  • 203
  • 1
  • 2
  • 4
10
votes
2 answers

Mapping column values of one DataFrame to another DataFrame using a key with different header names

I have two data frames df1 and df2 which look something like this. cat1 cat2 cat3 0 10 25 12 1 11 22 14 2 12 30 15 all_cats cat_codes 0 10 A 1 11 B 2 12 C 3 25 …
Danny
  • 1,166
  • 1
  • 8
  • 16
1
2 3
23 24