pyspark.pandas.DataFrame

class pyspark.pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]

pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally.

Variables

_internal – an internal immutable Frame to manage metadata.

Parameters
datanumpy ndarray (structured or homogeneous), dict, pandas DataFrame,

Spark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark Series. Dict can contain Series, arrays, constants, or list-like objects

indexIndex or array-like

Index to use for the resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided

columnsIndex or array-like

Column labels to use for the resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

dtypedtype, default None

Data type to force. Only a single dtype is allowed. If None, infer

copyboolean, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

.. versionchanged:: 3.4.0

Since 3.4.0, it deals with data and index in this approach: 1, when data is a distributed dataset (Internal DataFrame/Spark DataFrame/ pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first parallelize the index if necessary, and then try to combine the data and index; Note that if data and index doesn’t have the same anchor, then compute.ops_on_diff_frames should be turned on; 2, when data is a local dataset (Pandas DataFrame/numpy ndarray/list/etc), it will first collect the index to driver if necessary, and then apply the pandas.DataFrame(…) creation internally;

Examples

Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = ps.DataFrame(data=d, columns=['col1', 'col2'])
>>> df
   col1  col2
0     1     3
1     2     4

Constructing DataFrame from pandas DataFrame

>>> df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2']))
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = ps.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from numpy ndarray:

>>> import numpy as np
>>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     columns=['a', 'b', 'c', 'd', 'e'])
   a  b  c  d  e
0  1  2  3  4  5
1  6  7  8  9  0

Constructing DataFrame from numpy ndarray with Pandas index:

>>> import numpy as np
>>> import pandas as pd
>>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     index=pd.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e'])
   a  b  c  d  e
1  1  2  3  4  5
4  6  7  8  9  0

Constructing DataFrame from numpy ndarray with pandas-on-Spark index:

>>> import numpy as np
>>> import pandas as pd
>>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     index=ps.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e'])
   a  b  c  d  e
1  1  2  3  4  5
4  6  7  8  9  0

Constructing DataFrame from Pandas DataFrame with Pandas index:

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     columns=['a', 'b', 'c', 'd', 'e'])
>>> ps.DataFrame(data=pdf, index=pd.Index([1, 4]))
     a    b    c    d    e
1  6.0  7.0  8.0  9.0  0.0
4  NaN  NaN  NaN  NaN  NaN

Constructing DataFrame from Pandas DataFrame with pandas-on-Spark index:

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
...     columns=['a', 'b', 'c', 'd', 'e'])
>>> ps.DataFrame(data=pdf, index=ps.Index([1, 4]))
     a    b    c    d    e
1  6.0  7.0  8.0  9.0  0.0
4  NaN  NaN  NaN  NaN  NaN

Constructing DataFrame from Spark DataFrame with Pandas index:

>>> import pandas as pd
>>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
>>> ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))
Traceback (most recent call last):
  ...
ValueError: Cannot combine the series or dataframe...'compute.ops_on_diff_frames' option.

Enable ‘compute.ops_on_diff_frames’ to combine SparkDataFrame and Pandas index

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     ps.DataFrame(data=sdf, index=pd.Index([0, 1, 2]))
        x    y
0    Data  1.0
1  Bricks  2.0
2    None  NaN

Constructing DataFrame from Spark DataFrame with pandas-on-Spark index:

>>> import pandas as pd
>>> sdf = spark.createDataFrame([("Data", 1), ("Bricks", 2)], ["x", "y"])
>>> ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2]))
Traceback (most recent call last):
  ...
ValueError: Cannot combine the series or dataframe...'compute.ops_on_diff_frames' option.

Enable ‘compute.ops_on_diff_frames’ to combine Spark DataFrame and pandas-on-Spark index

>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     ps.DataFrame(data=sdf, index=ps.Index([0, 1, 2]))
        x    y
0    Data  1.0
1  Bricks  2.0
2    None  NaN

Methods

abs()

Return a Series/DataFrame with absolute numeric value of each element.

add(other)

Get Addition of dataframe and other, element-wise (binary operator +).

add_prefix(prefix)

Prefix labels with string prefix.

add_suffix(suffix)

Suffix labels with string suffix.

agg(func)

Aggregate using one or more operations over the specified axis.

aggregate(func)

Aggregate using one or more operations over the specified axis.

align(other[, join, axis, copy])

Align two objects on their axes with the specified join method.

all([axis, bool_only, skipna])

Return whether all elements are True.

any([axis, bool_only])

Return whether any element is True.

append(other[, ignore_index, …])

Append rows of other to the end of caller, returning a new object.

apply(func[, axis, args])

Apply a function along an axis of the DataFrame.

applymap(func)

Apply a function to a Dataframe elementwise.

assign(**kwargs)

Assign new columns to a DataFrame.

astype(dtype)

Cast a pandas-on-Spark object to a specified dtype dtype.

at_time(time[, asof, axis])

Select values at particular time of day (example: 9:30AM).

backfill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.

between_time(start_time, end_time[, …])

Select values between particular times of the day (example: 9:00-9:30 AM).

bfill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.

bool()

Return the bool of a single element in the current object.

boxplot(**kwds)

Make a box plot of the Series columns.

clip([lower, upper])

Trim values at input threshold(s).

combine_first(other)

Update null elements with value in the same location in other.

copy([deep])

Make a copy of this object’s indices and data.

corr([method, min_periods])

Compute pairwise correlation of columns, excluding NA/null values.

corrwith(other[, axis, drop, method])

Compute pairwise correlation.

count([axis, numeric_only])

Count non-NA cells for each column.

cov([min_periods, ddof])

Compute pairwise covariance of columns, excluding NA/null values.

cummax([skipna])

Return cumulative maximum over a DataFrame or Series axis.

cummin([skipna])

Return cumulative minimum over a DataFrame or Series axis.

cumprod([skipna])

Return cumulative product over a DataFrame or Series axis.

cumsum([skipna])

Return cumulative sum over a DataFrame or Series axis.

describe([percentiles])

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

diff([periods, axis])

First discrete difference of element.

div(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

divide(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

dot(other)

Compute the matrix multiplication between the DataFrame and others.

drop([labels, axis, index, columns])

Drop specified labels from columns.

drop_duplicates([subset, keep, inplace, …])

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

droplevel(level[, axis])

Return DataFrame with requested index / column level(s) removed.

dropna([axis, how, thresh, subset, inplace])

Remove missing values.

duplicated([subset, keep])

Return boolean Series denoting duplicate rows, optionally only considering certain columns.

eq(other)

Compare if the current value is equal to the other.

equals(other)

Compare if the current value is equal to the other.

eval(expr[, inplace])

Evaluate a string describing operations on DataFrame columns.

ewm([com, span, halflife, alpha, …])

Provide exponentially weighted window transformations.

expanding([min_periods])

Provide expanding transformations.

explode(column[, ignore_index])

Transform each element of a list-like to a row, replicating index values.

ffill([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

fillna([value, method, axis, inplace, limit])

Fill NA/NaN values.

filter([items, like, regex, axis])

Subset rows or columns of dataframe according to labels in the specified index.

first(offset)

Select first periods of time series data based on a date offset.

first_valid_index()

Retrieves the index of the first valid value.

floordiv(other)

Get Integer division of dataframe and other, element-wise (binary operator //).

from_dict(data[, orient, dtype, columns])

Construct DataFrame from dict of array-like or dicts.

from_records(data[, index, exclude, …])

Convert structured or recorded ndarray to DataFrame.

ge(other)

Compare if the current value is greater than or equal to the other.

get(key[, default])

Get item from object for given key (DataFrame column, Panel slice, etc.).

get_dtype_counts()

Return counts of unique dtypes in this object.

groupby(by[, axis, as_index, dropna])

Group DataFrame or Series using one or more columns.

gt(other)

Compare if the current value is greater than the other.

head([n])

Return the first n rows.

hist([bins])

Draw one histogram of the DataFrame’s columns.

idxmax([axis])

Return index of first occurrence of maximum over requested axis.

idxmin([axis])

Return index of first occurrence of minimum over requested axis.

info([verbose, buf, max_cols])

Print a concise summary of a DataFrame.

insert(loc, column, value[, allow_duplicates])

Insert column into DataFrame at specified location.

interpolate([method, limit, …])

Fill NaN values using an interpolation method.

isin(values)

Whether each element in the DataFrame is contained in values.

isna()

Detects missing values for items in the current Dataframe.

isnull()

Detects missing values for items in the current Dataframe.

items()

Iterator over (column name, Series) pairs.

iteritems()

This is an alias of items.

iterrows()

Iterate over DataFrame rows as (index, Series) pairs.

itertuples([index, name])

Iterate over DataFrame rows as namedtuples.

join(right[, on, how, lsuffix, rsuffix])

Join columns of another DataFrame.

kde([bw_method, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

keys()

Return alias for columns.

kurt([axis, skipna, numeric_only])

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

kurtosis([axis, skipna, numeric_only])

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

last(offset)

Select final periods of time series data based on a date offset.

last_valid_index()

Return index for last non-NA/null value.

le(other)

Compare if the current value is less than or equal to the other.

lt(other)

Compare if the current value is less than the other.

mad([axis])

Return the mean absolute deviation of values.

mask(cond[, other])

Replace values where the condition is True.

max([axis, skipna, numeric_only])

Return the maximum of the values.

mean([axis, skipna, numeric_only])

Return the mean of the values.

median([axis, skipna, numeric_only, accuracy])

Return the median of the values for the requested axis.

melt([id_vars, value_vars, var_name, value_name])

Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

merge(right[, how, on, left_on, right_on, …])

Merge DataFrame objects with a database-style join.

min([axis, skipna, numeric_only])

Return the minimum of the values.

mod(other)

Get Modulo of dataframe and other, element-wise (binary operator %).

mode([axis, numeric_only, dropna])

Get the mode(s) of each element along the selected axis.

mul(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

multiply(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

ne(other)

Compare if the current value is not equal to the other.

nlargest(n, columns[, keep])

Return the first n rows ordered by columns in descending order.

notna()

Detects non-missing values for items in the current Dataframe.

notnull()

Detects non-missing values for items in the current Dataframe.

nsmallest(n, columns[, keep])

Return the first n rows ordered by columns in ascending order.

nunique([axis, dropna, approx, rsd])

Return number of unique elements in the object.

pad([axis, inplace, limit])

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.

pct_change([periods])

Percentage change between the current and a prior element.

pipe(func, *args, **kwargs)

Apply func(self, *args, **kwargs).

pivot([index, columns, values])

Return reshaped DataFrame organized by given index / column values.

pivot_table([values, index, columns, …])

Create a spreadsheet-style pivot table as a DataFrame.

pop(item)

Return item and drop from frame.

pow(other)

Get Exponential power of series of dataframe and other, element-wise (binary operator **).

prod([axis, skipna, numeric_only, min_count])

Return the product of the values.

product([axis, skipna, numeric_only, min_count])

Return the product of the values.

quantile([q, axis, numeric_only, accuracy])

Return value at the given quantile.

query(expr[, inplace])

Query the columns of a DataFrame with a boolean expression.

radd(other)

Get Addition of dataframe and other, element-wise (binary operator +).

rank([method, ascending, numeric_only])

Compute numerical data ranks (1 through n) along axis.

rdiv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

reindex([labels, index, columns, axis, …])

Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.

reindex_like(other[, copy])

Return a DataFrame with matching indices as other object.

rename([mapper, index, columns, axis, …])

Alter axes labels.

rename_axis([mapper, index, columns, axis, …])

Set the name of the axis for the index or columns.

replace([to_replace, value, inplace, limit, …])

Returns a new DataFrame replacing a value with another value.

resample(rule[, closed, label, on])

Resample time-series data.

reset_index([level, drop, inplace, …])

Reset the index, or a level of it.

rfloordiv(other)

Get Integer division of dataframe and other, element-wise (binary operator //).

rmod(other)

Get Modulo of dataframe and other, element-wise (binary operator %).

rmul(other)

Get Multiplication of dataframe and other, element-wise (binary operator *).

rolling(window[, min_periods])

Provide rolling transformations.

round([decimals])

Round a DataFrame to a variable number of decimal places.

rpow(other)

Get Exponential power of dataframe and other, element-wise (binary operator **).

rsub(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

rtruediv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

sample([n, frac, replace, random_state, …])

Return a random sample of items from an axis of object.

select_dtypes([include, exclude])

Return a subset of the DataFrame’s columns based on the column dtypes.

sem([axis, skipna, ddof, numeric_only])

Return unbiased standard error of the mean over requested axis.

set_index(keys[, drop, append, inplace])

Set the DataFrame index (row labels) using one or more existing columns.

shift([periods, fill_value])

Shift DataFrame by desired number of periods.

skew([axis, skipna, numeric_only])

Return unbiased skew normalized by N-1.

sort_index([axis, level, ascending, …])

Sort object by labels (along an axis)

sort_values(by[, ascending, inplace, …])

Sort by the values along either axis.

squeeze([axis])

Squeeze 1 dimensional axis objects into scalars.

stack()

Stack the prescribed level(s) from columns to index.

std([axis, skipna, ddof, numeric_only])

Return sample standard deviation.

sub(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

subtract(other)

Get Subtraction of dataframe and other, element-wise (binary operator -).

sum([axis, skipna, numeric_only, min_count])

Return the sum of the values.

swapaxes(i, j[, copy])

Interchange axes and swap values axes appropriately.

swaplevel([i, j, axis])

Swap levels i and j in a MultiIndex on a particular axis.

tail([n])

Return the last n rows.

take(indices[, axis])

Return the elements in the given positional indices along an axis.

to_clipboard([excel, sep])

Copy object to the system clipboard.

to_csv([path, sep, na_rep, columns, header, …])

Write object to a comma-separated values (csv) file.

to_delta(path[, mode, partition_cols, index_col])

Write the DataFrame out as a Delta Lake table.

to_dict([orient, into])

Convert the DataFrame to a dictionary.

to_excel(excel_writer[, sheet_name, na_rep, …])

Write object to an Excel sheet.

to_html([buf, columns, col_space, header, …])

Render a DataFrame as an HTML table.

to_json([path, compression, num_files, …])

Convert the object to a JSON string.

to_latex([buf, columns, col_space, header, …])

Render an object to a LaTeX tabular environment table.

to_markdown([buf, mode])

Print Series or DataFrame in Markdown-friendly format.

to_numpy()

A NumPy ndarray representing the values in this DataFrame or Series.

to_orc(path[, mode, partition_cols, index_col])

Write a DataFrame to the ORC format.

to_pandas()

Return a pandas DataFrame.

to_parquet(path[, mode, partition_cols, …])

Write the DataFrame out as a Parquet file or directory.

to_records([index, column_dtypes, index_dtypes])

Convert DataFrame to a NumPy record array.

to_spark([index_col])

Spark related features.

to_spark_io([path, format, mode, …])

Write the DataFrame out to a Spark data source.

to_string([buf, columns, col_space, header, …])

Render a DataFrame to a console-friendly tabular output.

to_table(name[, format, mode, …])

Write the DataFrame into a Spark table.

transform(func[, axis])

Call func on self producing a Series with transformed values and that has the same length as its input.

transpose()

Transpose index and columns.

truediv(other)

Get Floating division of dataframe and other, element-wise (binary operator /).

truncate([before, after, axis, copy])

Truncate a Series or DataFrame before and after some index value.

unstack()

Pivot the (necessarily hierarchical) index labels.

update(other[, join, overwrite])

Modify in place using non-NA values from another DataFrame.

var([axis, ddof, numeric_only])

Return unbiased variance.

where(cond[, other, axis])

Replace values where the condition is False.

xs(key[, axis, level])

Return cross-section from the DataFrame.

Attributes

T

Transpose index and columns.

at

Access a single value for a row/column label pair.

axes

Return a list representing the axes of the DataFrame.

columns

The column labels of the DataFrame.

dtypes

Return the dtypes in the DataFrame.

empty

Returns true if the current DataFrame is empty.

iat

Access a single value for a row/column pair by integer position.

iloc

Purely integer-location based indexing for selection by position.

index

The index (row labels) Column of the DataFrame.

loc

Access a group of rows and columns by label(s) or a boolean Series.

ndim

Return an int representing the number of array dimensions.

shape

Return a tuple representing the dimensionality of the DataFrame.

size

Return an int representing the number of elements in this object.

style

Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.

values

Return a Numpy representation of the DataFrame or the Series.