pyspark.pandas.Series.diff#

Series.diff(periods=1)[source]#

First discrete difference of element.

Calculates the difference of a Series element compared with another element in the DataFrame (default is the element in the same column of the previous row).

Note

the current implementation of diff uses Spark’s Window without specifying partition specification. This leads to moveing all data into a single partition in a single machine and could cause serious performance degradation. Avoid this method with very large datasets.

Parameters

periodsint, default 1: Periods to shift for calculating difference, accepts negative values.

Returns

diffedSeries

Examples

>>> df = ps.DataFrame({'a': [1, 2, 3, 4, 5, 6],
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]}, columns=['a', 'b', 'c'])
>>> df
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36

>>> df.b.diff()
  NaN
  0.0
  1.0
  1.0
  2.0
  3.0
Name: b, dtype: float64

Difference with previous value

>>> df.c.diff(periods=3)
   NaN
   NaN
   NaN
  15.0
  21.0
  27.0
Name: c, dtype: float64

Difference with following value

>>> df.c.diff(periods=-1)
  -3.0
  -5.0
  -7.0
  -9.0
 -11.0
   NaN
Name: c, dtype: float64