Spark 3.0.1 ScalaDoc < Back

Packages

package root

Definition Classes
root
package org

Definition Classes
root
package apache

Definition Classes
org
package spark
Core Spark functionality.
Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.
In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.
Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.
Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.
Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

Definition Classes
apache
package sql
Allows the execution of relational queries, including those expressed in SQL using Spark.
Allows the execution of relational queries, including those expressed in SQL using Spark.

Definition Classes
spark
package api
Contains API classes that are specific to a single language (i.e.
Contains API classes that are specific to a single language (i.e. Java).

Definition Classes
sql
package avro

Definition Classes
sql
package catalog

Definition Classes
sql
package connector

Definition Classes
sql
package expressions

Definition Classes
sql
package hive
Support for running Spark SQL queries using functionality from Apache Hive (does not require an existing Hive installation).
Support for running Spark SQL queries using functionality from Apache Hive (does not require an existing Hive installation). Supported Hive features include:
- Using HiveQL to express queries.
- Reading metadata from the Hive Metastore using HiveSerDes.
- Hive UDFs, UDAs, UDTs
Definition Classes
sql
package jdbc

Definition Classes
sql
package sources
A set of APIs for adding data sources to Spark SQL.
A set of APIs for adding data sources to Spark SQL.

Definition Classes
sql
package streaming

Definition Classes
sql
package types
Contains a type system for attributes produced by relations, including complex types like structs, arrays and maps.
Contains a type system for attributes produced by relations, including complex types like structs, arrays and maps.

Definition Classes
sql
package util

Definition Classes
sql
package vectorized

Definition Classes
sql
AnalysisException
Column
ColumnName
CreateTableWriter
DataFrameNaFunctions
DataFrameReader
DataFrameStatFunctions
DataFrameWriter
DataFrameWriterV2
Dataset
DatasetHolder
Encoder
Encoders
ExperimentalMethods
ForeachWriter
KeyValueGroupedDataset
LowPrioritySQLImplicits
RelationalGroupedDataset
Row
RowFactory
RuntimeConfig
SQLContext
SQLImplicits
SaveMode
SparkSession
SparkSessionExtensions
TypedColumn
UDFRegistration
WriteConfigMethods
functions

c

org.apache.spark.sql

KeyValueGroupedDataset

class KeyValueGroupedDataset[K, V] extends Serializable

A Dataset has been logically grouped by a user specified grouping key. Users should not construct a KeyValueGroupedDataset directly, but should instead call groupByKey on an existing Dataset.

Source: KeyValueGroupedDataset.scala
Since: 2.0.0

Linear Supertypes

Serializable, Serializable, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

KeyValueGroupedDataset
Serializable
Serializable
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def agg[U1, U2, U3, U4, U5, U6, U7, U8](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5], col6: TypedColumn[V, U6], col7: TypedColumn[V, U7], col8: TypedColumn[V, U8]): Dataset[(K, U1, U2, U3, U4, U5, U6, U7, U8)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
3.0.0
def agg[U1, U2, U3, U4, U5, U6, U7](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5], col6: TypedColumn[V, U6], col7: TypedColumn[V, U7]): Dataset[(K, U1, U2, U3, U4, U5, U6, U7)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
3.0.0
def agg[U1, U2, U3, U4, U5, U6](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5], col6: TypedColumn[V, U6]): Dataset[(K, U1, U2, U3, U4, U5, U6)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
3.0.0
def agg[U1, U2, U3, U4, U5](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4], col5: TypedColumn[V, U5]): Dataset[(K, U1, U2, U3, U4, U5)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
3.0.0
def agg[U1, U2, U3, U4](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3], col4: TypedColumn[V, U4]): Dataset[(K, U1, U2, U3, U4)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
1.6.0
def agg[U1, U2, U3](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2], col3: TypedColumn[V, U3]): Dataset[(K, U1, U2, U3)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
1.6.0
def agg[U1, U2](col1: TypedColumn[V, U1], col2: TypedColumn[V, U2]): Dataset[(K, U1, U2)]
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.
Computes the given aggregations, returning a Dataset of tuples for each unique key and the result of computing these aggregations over all elements in the group.

Since
1.6.0
def agg[U1](col1: TypedColumn[V, U1]): Dataset[(K, U1)]
Computes the given aggregation, returning a Dataset of tuples for each unique key and the result of computing this aggregation over all elements in the group.
Computes the given aggregation, returning a Dataset of tuples for each unique key and the result of computing this aggregation over all elements in the group.

Since
1.6.0
def aggUntyped(columns: TypedColumn[_, _]*): Dataset[_]
Internal helper function for building typed aggregations that return tuples.
Internal helper function for building typed aggregations that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

Attributes
protected
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
def cogroup[U, R](other: KeyValueGroupedDataset[K, U], f: CoGroupFunction[K, V, U, R], encoder: Encoder[R]): Dataset[R]
(Java-specific) Applies the given function to each cogrouped data.
(Java-specific) Applies the given function to each cogrouped data. For each unique group, the function will be passed the grouping key and 2 iterators containing all elements in the group from Dataset this and other. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.

Since
1.6.0
def cogroup[U, R](other: KeyValueGroupedDataset[K, U])(f: (K, Iterator[V], Iterator[U]) ⇒ TraversableOnce[R])(implicit arg0: Encoder[R]): Dataset[R]
(Scala-specific) Applies the given function to each cogrouped data.
(Scala-specific) Applies the given function to each cogrouped data. For each unique group, the function will be passed the grouping key and 2 iterators containing all elements in the group from Dataset this and other. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.

Since
1.6.0
def count(): Dataset[(K, Long)]
Returns a Dataset that contains a tuple with each key and the number of items present for that key.
Returns a Dataset that contains a tuple with each key and the number of items present for that key.

Since
1.6.0
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def flatMapGroups[U](f: FlatMapGroupsFunction[K, V, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.

Since
1.6.0
def flatMapGroups[U](f: (K, Iterator[V]) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.

Since
1.6.0
def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, U], outputMode: OutputMode, stateEncoder: Encoder[S], outputEncoder: Encoder[U], timeoutConf: GroupStateTimeout): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
outputMode
The output mode of the function.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.

Since
2.2.0
def flatMapGroupsWithState[S, U](outputMode: OutputMode, timeoutConf: GroupStateTimeout)(func: (K, Iterator[V], GroupState[S]) ⇒ Iterator[U])(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
outputMode
The output mode of the function.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.
func
Function to be called on every group.

Since
2.2.0
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
Annotations
@native()
def hashCode(): Int

Definition Classes
AnyRef → Any
Annotations
@native()
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def keyAs[L](implicit arg0: Encoder[L]): KeyValueGroupedDataset[L, V]
Returns a new KeyValueGroupedDataset where the type of the key has been mapped to the specified type.
Returns a new KeyValueGroupedDataset where the type of the key has been mapped to the specified type. The mapping of key columns to the type follows the same rules as as on Dataset.

Since
1.6.0
def keys: Dataset[K]
Returns a Dataset that contains each unique key.
Returns a Dataset that contains each unique key. This is equivalent to doing mapping over the Dataset to extract the keys and then running a distinct operation on those.

Since
1.6.0
def mapGroups[U](f: MapGroupsFunction[K, V, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data.
(Java-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.

Since
1.6.0
def mapGroups[U](f: (K, Iterator[V]) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data.
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.

Since
1.6.0
def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, U], stateEncoder: Encoder[S], outputEncoder: Encoder[U], timeoutConf: GroupStateTimeout): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.

Since
2.2.0
def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, U], stateEncoder: Encoder[S], outputEncoder: Encoder[U]): Dataset[U]
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Java-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group.
stateEncoder
Encoder for the state type.
outputEncoder
Encoder for the output type. See Encoder for more details on what types are encodable to Spark SQL.

Since
2.2.0
def mapGroupsWithState[S, U](timeoutConf: GroupStateTimeout)(func: (K, Iterator[V], GroupState[S]) ⇒ U)(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
timeoutConf
Timeout configuration for groups that do not receive data for a while. See Encoder for more details on what types are encodable to Spark SQL.
func
Function to be called on every group.

Since
2.2.0
def mapGroupsWithState[S, U](func: (K, Iterator[V], GroupState[S]) ⇒ U)(implicit arg0: Encoder[S], arg1: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state.
(Scala-specific) Applies the given function to each group of data, while maintaining a user-defined per-group state. The result Dataset will represent the objects returned by the function. For a static batch Dataset, the function will be invoked once per group. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations. See org.apache.spark.sql.streaming.GroupState for more details.
S
The type of the user-defined state. Must be encodable to Spark SQL types.
U
The type of the output objects. Must be encodable to Spark SQL types.
func
Function to be called on every group. See Encoder for more details on what types are encodable to Spark SQL.

Since
2.2.0
def mapValues[W](func: MapFunction[V, W], encoder: Encoder[W]): KeyValueGroupedDataset[K, W]
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data.
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data. The grouping key is unchanged by this.
```
// Create Integer values grouped by String key from a Dataset<Tuple2<String, Integer>>
Dataset<Tuple2<String, Integer>> ds = ...;
KeyValueGroupedDataset<String, Integer> grouped =
  ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, Encoders.INT());
```
Since
2.1.0
def mapValues[W](func: (V) ⇒ W)(implicit arg0: Encoder[W]): KeyValueGroupedDataset[K, W]
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data.
Returns a new KeyValueGroupedDataset where the given function func has been applied to the data. The grouping key is unchanged by this.
```
// Create values grouped by key from a Dataset[(K, V)]
ds.groupByKey(_._1).mapValues(_._2) // Scala
```
Since
2.1.0
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit

Definition Classes
AnyRef
Annotations
@native()
val queryExecution: QueryExecution
def reduceGroups(f: ReduceFunction[V]): Dataset[(K, V)]
(Java-specific) Reduces the elements of each group of data using the specified binary function.
(Java-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.

Since
1.6.0
def reduceGroups(f: (V, V) ⇒ V): Dataset[(K, V)]
(Scala-specific) Reduces the elements of each group of data using the specified binary function.
(Scala-specific) Reduces the elements of each group of data using the specified binary function. The given function must be commutative and associative or the result may be non-deterministic.

Since
1.6.0
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
KeyValueGroupedDataset → AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... ) @native()

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped