pyspark.sql.functions.approx_count_distinct#

pyspark.sql.functions.approx_count_distinct(col, rsd=None)[source]#

This aggregate function returns a new Column, which estimates the approximate distinct count of elements in a specified column or a group of columns.

New in version 2.1.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colColumn or str

The label of the column to count distinct values in.

rsdfloat, optional

The maximum allowed relative standard deviation (default = 0.05). If rsd < 0.01, it would be more efficient to use count_distinct().

Returns
Column

A new Column object representing the approximate unique count.

Examples

Example 1: Counting distinct values in a single column DataFrame representing integers

>>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.createDataFrame([1,2,2,3], "int")
>>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
+---------------+
|distinct_values|
+---------------+
|              3|
+---------------+

Example 2: Counting distinct values in a single column DataFrame representing strings

>>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.createDataFrame([("apple",), ("orange",), ("apple",), ("banana",)], ['fruit'])
>>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+---------------+
|distinct_fruits|
+---------------+
|              3|
+---------------+

Example 3: Counting distinct values in a DataFrame with multiple columns

>>> from pyspark.sql.functions import approx_count_distinct, struct
>>> df = spark.createDataFrame([("Alice", 1),
...                             ("Alice", 2),
...                             ("Bob", 3),
...                             ("Bob", 3)], ["name", "value"])
>>> df = df.withColumn("combined", struct("name", "value"))
>>> df.agg(approx_count_distinct("combined").alias('distinct_pairs')).show()
+--------------+
|distinct_pairs|
+--------------+
|             3|
+--------------+

Example 4: Counting distinct values with a specified relative standard deviation

>>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.range(100000)
>>> df.agg(approx_count_distinct("id").alias('with_default_rsd'),
...        approx_count_distinct("id", 0.1).alias('with_rsd_0.1')).show()
+----------------+------------+
|with_default_rsd|with_rsd_0.1|
+----------------+------------+
|           95546|      102065|
+----------------+------------+