pyspark.sql.GroupedData.pivot#

GroupedData.pivot(pivot_col, values=None)[source]#

Pivots a column of the current DataFrame and performs the specified aggregation.

New in version 1.6.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

pivot_colstr

Name of the column to pivot.

valueslist, optional

List of values that will be translated to columns in the output DataFrame.

If values is not provided, Spark will eagerly compute the distinct values in pivot_col so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values.

Examples

>>> from pyspark.sql import Row
>>> df1 = spark.createDataFrame([
...     Row(course="dotNET", year=2012, earnings=10000),
...     Row(course="Java", year=2012, earnings=20000),
...     Row(course="dotNET", year=2012, earnings=5000),
...     Row(course="dotNET", year=2013, earnings=48000),
...     Row(course="Java", year=2013, earnings=30000),
... ])
>>> df1.show()
+------+----+--------+
|course|year|earnings|
+------+----+--------+
|dotNET|2012|   10000|
|  Java|2012|   20000|
|dotNET|2012|    5000|
|dotNET|2013|   48000|
|  Java|2013|   30000|
+------+----+--------+
>>> df2 = spark.createDataFrame([
...     Row(training="expert", sales=Row(course="dotNET", year=2012, earnings=10000)),
...     Row(training="junior", sales=Row(course="Java", year=2012, earnings=20000)),
...     Row(training="expert", sales=Row(course="dotNET", year=2012, earnings=5000)),
...     Row(training="junior", sales=Row(course="dotNET", year=2013, earnings=48000)),
...     Row(training="expert", sales=Row(course="Java", year=2013, earnings=30000)),
... ])  
>>> df2.show()  
+--------+--------------------+
|training|               sales|
+--------+--------------------+
|  expert|{dotNET, 2012, 10...|
|  junior| {Java, 2012, 20000}|
|  expert|{dotNET, 2012, 5000}|
|  junior|{dotNET, 2013, 48...|
|  expert| {Java, 2013, 30000}|
+--------+--------------------+

Compute the sum of earnings for each year by course with each course as a separate column

>>> df1.groupBy("year").pivot("course", ["dotNET", "Java"]).sum("earnings").show()
+----+------+-----+
|year|dotNET| Java|
+----+------+-----+
|2012| 15000|20000|
|2013| 48000|30000|
+----+------+-----+

Or without specifying column values (less efficient)

>>> df1.groupBy("year").pivot("course").sum("earnings").show()
+----+-----+------+
|year| Java|dotNET|
+----+-----+------+
|2012|20000| 15000|
|2013|30000| 48000|
+----+-----+------+
>>> df2.groupBy("sales.year").pivot("sales.course").sum("sales.earnings").show()
... 
+----+-----+------+
|year| Java|dotNET|
+----+-----+------+
|2012|20000| 15000|
|2013|30000| 48000|
+----+-----+------+