pyspark.sql.functions.sum_distinct#
- pyspark.sql.functions.sum_distinct(col)[source]#
Aggregate function: returns the sum of distinct values in the expression.
New in version 3.2.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- col
Column
or str target column to compute on.
- col
- Returns
Column
the column for computed results.
Examples
Example 1: Using sum_distinct function on a column with all distinct values
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1,), (2,), (3,), (4,)], ["numbers"]) >>> df.select(sf.sum_distinct('numbers')).show() +---------------------+ |sum(DISTINCT numbers)| +---------------------+ | 10| +---------------------+
Example 2: Using sum_distinct function on a column with no distinct values
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(1,), (1,), (1,), (1,)], ["numbers"]) >>> df.select(sf.sum_distinct('numbers')).show() +---------------------+ |sum(DISTINCT numbers)| +---------------------+ | 1| +---------------------+
Example 3: Using sum_distinct function on a column with null and duplicate values
>>> from pyspark.sql import functions as sf >>> df = spark.createDataFrame([(None,), (1,), (1,), (2,)], ["numbers"]) >>> df.select(sf.sum_distinct('numbers')).show() +---------------------+ |sum(DISTINCT numbers)| +---------------------+ | 3| +---------------------+
Example 4: Using sum_distinct function on a column with all None values
>>> from pyspark.sql import functions as sf >>> from pyspark.sql.types import StructType, StructField, IntegerType >>> schema = StructType([StructField("numbers", IntegerType(), True)]) >>> df = spark.createDataFrame([(None,), (None,), (None,), (None,)], schema=schema) >>> df.select(sf.sum_distinct('numbers')).show() +---------------------+ |sum(DISTINCT numbers)| +---------------------+ | NULL| +---------------------+