pyspark.pandas.groupby.GroupBy.sum#

GroupBy.sum(numeric_only=False, min_count=0)[source]#

Compute sum of group values

New in version 3.3.0.

Parameters
numeric_onlybool, default False

Include only float, int, boolean columns.

New in version 3.4.0.

Changed in version 4.0.0.

min_countint, default 0

The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 3.4.0.

Notes

There is a behavior difference between pandas-on-Spark and pandas:

  • when there is a non-numeric aggregation column, it will be ignored

    even if numeric_only is False.

Examples

>>> df = ps.DataFrame({"A": [1, 2, 1, 2], "B": [True, False, False, True],
...                    "C": [3, 4, 3, 4], "D": ["a", "a", "b", "a"]})
>>> df.groupby("A").sum().sort_index()
   B  C   D
A
1  1  6  ab
2  1  8  aa
>>> df.groupby("D").sum().sort_index()
   A  B   C
D
a  5  2  11
b  1  0   3
>>> df.groupby("D").sum(min_count=3).sort_index()
     A    B     C
D
a  5.0  2.0  11.0
b  NaN  NaN   NaN