pyspark.sql.GroupedData.pivot#
- GroupedData.pivot(pivot_col, values=None)[source]#
Pivots a column of the current
DataFrame
and performs the specified aggregation.New in version 1.6.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- pivot_colstr
Name of the column to pivot.
- valueslist, optional
List of values that will be translated to columns in the output DataFrame.
If
values
is not provided, Spark will eagerly compute the distinct values inpivot_col
so it can determine the resulting schema of the transformation. To avoid any eager computations, provide an explicit list of values.
Examples
>>> from pyspark.sql import Row >>> df1 = spark.createDataFrame([ ... Row(course="dotNET", year=2012, earnings=10000), ... Row(course="Java", year=2012, earnings=20000), ... Row(course="dotNET", year=2012, earnings=5000), ... Row(course="dotNET", year=2013, earnings=48000), ... Row(course="Java", year=2013, earnings=30000), ... ]) >>> df1.show() +------+----+--------+ |course|year|earnings| +------+----+--------+ |dotNET|2012| 10000| | Java|2012| 20000| |dotNET|2012| 5000| |dotNET|2013| 48000| | Java|2013| 30000| +------+----+--------+ >>> df2 = spark.createDataFrame([ ... Row(training="expert", sales=Row(course="dotNET", year=2012, earnings=10000)), ... Row(training="junior", sales=Row(course="Java", year=2012, earnings=20000)), ... Row(training="expert", sales=Row(course="dotNET", year=2012, earnings=5000)), ... Row(training="junior", sales=Row(course="dotNET", year=2013, earnings=48000)), ... Row(training="expert", sales=Row(course="Java", year=2013, earnings=30000)), ... ]) >>> df2.show() +--------+--------------------+ |training| sales| +--------+--------------------+ | expert|{dotNET, 2012, 10...| | junior| {Java, 2012, 20000}| | expert|{dotNET, 2012, 5000}| | junior|{dotNET, 2013, 48...| | expert| {Java, 2013, 30000}| +--------+--------------------+
Compute the sum of earnings for each year by course with each course as a separate column
>>> df1.groupBy("year").pivot("course", ["dotNET", "Java"]).sum("earnings").show() +----+------+-----+ |year|dotNET| Java| +----+------+-----+ |2012| 15000|20000| |2013| 48000|30000| +----+------+-----+
Or without specifying column values (less efficient)
>>> df1.groupBy("year").pivot("course").sum("earnings").show() +----+-----+------+ |year| Java|dotNET| +----+-----+------+ |2012|20000| 15000| |2013|30000| 48000| +----+-----+------+ >>> df2.groupBy("sales.year").pivot("sales.course").sum("sales.earnings").show() ... +----+-----+------+ |year| Java|dotNET| +----+-----+------+ |2012|20000| 15000| |2013|30000| 48000| +----+-----+------+