public class GroupedData
extends java.lang.Object
DataFrame
, created by DataFrame.groupBy
.
Modifier | Constructor and Description |
---|---|
protected |
GroupedData(DataFrame df,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
org.apache.spark.sql.GroupedData.GroupType groupType) |
Modifier and Type | Method and Description |
---|---|
DataFrame |
agg(Column expr,
Column... exprs)
Compute aggregates by specifying a series of aggregate columns.
|
DataFrame |
agg(Column expr,
scala.collection.Seq<Column> exprs)
Compute aggregates by specifying a series of aggregate columns.
|
DataFrame |
agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
(Scala-specific) Compute aggregates by specifying a map from column name to
aggregate methods.
|
DataFrame |
agg(java.util.Map<java.lang.String,java.lang.String> exprs)
(Java-specific) Compute aggregates by specifying a map from column name to
aggregate methods.
|
DataFrame |
agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr,
scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
(Scala-specific) Compute aggregates by specifying a map from column name to
aggregate methods.
|
static GroupedData |
apply(DataFrame df,
scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs,
org.apache.spark.sql.GroupedData.GroupType groupType) |
DataFrame |
avg(scala.collection.Seq<java.lang.String> colNames)
Compute the mean value for each numeric columns for each group.
|
DataFrame |
avg(java.lang.String... colNames)
Compute the mean value for each numeric columns for each group.
|
DataFrame |
count()
Count the number of rows for each group.
|
DataFrame |
max(scala.collection.Seq<java.lang.String> colNames)
Compute the max value for each numeric columns for each group.
|
DataFrame |
max(java.lang.String... colNames)
Compute the max value for each numeric columns for each group.
|
DataFrame |
mean(scala.collection.Seq<java.lang.String> colNames)
Compute the average value for each numeric columns for each group.
|
DataFrame |
mean(java.lang.String... colNames)
Compute the average value for each numeric columns for each group.
|
DataFrame |
min(scala.collection.Seq<java.lang.String> colNames)
Compute the min value for each numeric column for each group.
|
DataFrame |
min(java.lang.String... colNames)
Compute the min value for each numeric column for each group.
|
DataFrame |
sum(scala.collection.Seq<java.lang.String> colNames)
Compute the sum for each numeric columns for each group.
|
DataFrame |
sum(java.lang.String... colNames)
Compute the sum for each numeric columns for each group.
|
protected GroupedData(DataFrame df, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, org.apache.spark.sql.GroupedData.GroupType groupType)
public static GroupedData apply(DataFrame df, scala.collection.Seq<org.apache.spark.sql.catalyst.expressions.Expression> groupingExprs, org.apache.spark.sql.GroupedData.GroupType groupType)
public DataFrame agg(Column expr, Column... exprs)
spark.sql.retainGroupColumns
to false.
The available aggregate methods are defined in functions
.
// Selects the age of the oldest employee and the aggregate expense for each department
// Scala:
import org.apache.spark.sql.functions._
df.groupBy("department").agg(max("age"), sum("expense"))
// Java:
import static org.apache.spark.sql.functions.*;
df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change
to that behavior, set config variable spark.sql.retainGroupColumns
to false
.
// Scala, 1.3.x:
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// Java, 1.3.x:
df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
expr
- (undocumented)exprs
- (undocumented)public DataFrame mean(java.lang.String... colNames)
avg
.
The resulting DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the average values for them.
colNames
- (undocumented)public DataFrame max(java.lang.String... colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the max values for them.
colNames
- (undocumented)public DataFrame avg(java.lang.String... colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the mean values for them.
colNames
- (undocumented)public DataFrame min(java.lang.String... colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the min values for them.
colNames
- (undocumented)public DataFrame sum(java.lang.String... colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the sum for them.
colNames
- (undocumented)public DataFrame agg(scala.Tuple2<java.lang.String,java.lang.String> aggExpr, scala.collection.Seq<scala.Tuple2<java.lang.String,java.lang.String>> aggExprs)
DataFrame
will also contain the grouping columns.
The available aggregate methods are avg
, max
, min
, sum
, count
.
// Selects the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(
"age" -> "max",
"expense" -> "sum"
)
aggExpr
- (undocumented)aggExprs
- (undocumented)public DataFrame agg(scala.collection.immutable.Map<java.lang.String,java.lang.String> exprs)
DataFrame
will also contain the grouping columns.
The available aggregate methods are avg
, max
, min
, sum
, count
.
// Selects the age of the oldest employee and the aggregate expense for each department
df.groupBy("department").agg(Map(
"age" -> "max",
"expense" -> "sum"
))
exprs
- (undocumented)public DataFrame agg(java.util.Map<java.lang.String,java.lang.String> exprs)
DataFrame
will also contain the grouping columns.
The available aggregate methods are avg
, max
, min
, sum
, count
.
// Selects the age of the oldest employee and the aggregate expense for each department
import com.google.common.collect.ImmutableMap;
df.groupBy("department").agg(ImmutableMap.of("age", "max", "expense", "sum"));
exprs
- (undocumented)public DataFrame agg(Column expr, scala.collection.Seq<Column> exprs)
spark.sql.retainGroupColumns
to false.
The available aggregate methods are defined in functions
.
// Selects the age of the oldest employee and the aggregate expense for each department
// Scala:
import org.apache.spark.sql.functions._
df.groupBy("department").agg(max("age"), sum("expense"))
// Java:
import static org.apache.spark.sql.functions.*;
df.groupBy("department").agg(max("age"), sum("expense"));
Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change
to that behavior, set config variable spark.sql.retainGroupColumns
to false
.
// Scala, 1.3.x:
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// Java, 1.3.x:
df.groupBy("department").agg(col("department"), max("age"), sum("expense"));
expr
- (undocumented)exprs
- (undocumented)public DataFrame count()
DataFrame
will also contain the grouping columns.
public DataFrame mean(scala.collection.Seq<java.lang.String> colNames)
avg
.
The resulting DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the average values for them.
colNames
- (undocumented)public DataFrame max(scala.collection.Seq<java.lang.String> colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the max values for them.
colNames
- (undocumented)public DataFrame avg(scala.collection.Seq<java.lang.String> colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the mean values for them.
colNames
- (undocumented)public DataFrame min(scala.collection.Seq<java.lang.String> colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the min values for them.
colNames
- (undocumented)public DataFrame sum(scala.collection.Seq<java.lang.String> colNames)
DataFrame
will also contain the grouping columns.
When specified columns are given, only compute the sum for them.
colNames
- (undocumented)