pyspark.sql.DataFrameWriter.partitionBy#

DataFrameWriter.partitionBy(*cols)[source]#

Partitions the output by the given columns on the file system.

If specified, the output is laid out on the file system similar to Hive’s partitioning scheme.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colsstr or list

name of columns

Examples

Write a DataFrame into a Parquet file in a partitioned manner, and read it back.

>>> import tempfile
>>> import os
>>> with tempfile.TemporaryDirectory(prefix="partitionBy") as d:
...     # Write a DataFrame into a Parquet file in a partitioned manner.
...     spark.createDataFrame(
...         [{"age": 100, "name": "Hyukjin Kwon"}, {"age": 120, "name": "Ruifeng Zheng"}]
...     ).write.partitionBy("name").mode("overwrite").format("parquet").save(d)
...
...     # Read the Parquet file as a DataFrame.
...     spark.read.parquet(d).sort("age").show()
...
...     # Read one partition as a DataFrame.
...     spark.read.parquet(f"{d}{os.path.sep}name=Hyukjin Kwon").show()
+---+-------------+
|age|         name|
+---+-------------+
|100| Hyukjin Kwon|
|120|Ruifeng Zheng|
+---+-------------+
+---+
|age|
+---+
|100|
+---+