public final class DataFrameStatFunctions
extends java.lang.Object
DataFrame
s.
Modifier and Type | Method and Description |
---|---|
double |
corr(java.lang.String col1,
java.lang.String col2)
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
|
double |
corr(java.lang.String col1,
java.lang.String col2,
java.lang.String method)
Calculates the correlation of two columns of a DataFrame.
|
double |
cov(java.lang.String col1,
java.lang.String col2)
Calculate the sample covariance of two numerical columns of a DataFrame.
|
DataFrame |
crosstab(java.lang.String col1,
java.lang.String col2)
Computes a pair-wise frequency table of the given columns.
|
DataFrame |
freqItems(scala.collection.Seq<java.lang.String> cols)
(Scala-specific) Finding frequent items for columns, possibly with false positives.
|
DataFrame |
freqItems(scala.collection.Seq<java.lang.String> cols,
double support)
(Scala-specific) Finding frequent items for columns, possibly with false positives.
|
DataFrame |
freqItems(java.lang.String[] cols)
Finding frequent items for columns, possibly with false positives.
|
DataFrame |
freqItems(java.lang.String[] cols,
double support)
Finding frequent items for columns, possibly with false positives.
|
<T> DataFrame |
sampleBy(java.lang.String col,
java.util.Map<T,java.lang.Double> fractions,
long seed)
Returns a stratified sample without replacement based on the fraction given on each stratum.
|
<T> DataFrame |
sampleBy(java.lang.String col,
scala.collection.immutable.Map<T,java.lang.Object> fractions,
long seed)
Returns a stratified sample without replacement based on the fraction given on each stratum.
|
public double cov(java.lang.String col1, java.lang.String col2)
col1
- the name of the first columncol2
- the name of the second column
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
.withColumn("rand2", rand(seed=27))
df.stat.cov("rand1", "rand2")
res1: Double = 0.065...
public double corr(java.lang.String col1, java.lang.String col2, java.lang.String method)
col1
- the name of the columncol2
- the name of the column to calculate the correlation againstmethod
- (undocumented)
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
.withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2")
res1: Double = 0.613...
public double corr(java.lang.String col1, java.lang.String col2)
col1
- the name of the columncol2
- the name of the column to calculate the correlation against
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
.withColumn("rand2", rand(seed=27))
df.stat.corr("rand1", "rand2", "pearson")
res1: Double = 0.613...
public DataFrame crosstab(java.lang.String col1, java.lang.String col2)
col1
and the column names will
be the distinct values of col2
. The name of the first column will be $col1_$col2
. Counts
will be returned as Long
s. Pairs that have no occurrences will have zero as their counts.
Null elements will be replaced by "null", and back ticks will be dropped from elements if they
exist.
col1
- The name of the first column. Distinct items will make the first item of
each row.col2
- The name of the second column. Distinct items will make the column names
of the DataFrame.
val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
(3, 3))).toDF("key", "value")
val ct = df.stat.crosstab("key", "value")
ct.show()
+---------+---+---+---+
|key_value| 1| 2| 3|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
public DataFrame freqItems(java.lang.String[] cols, double support)
http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou
.
The support
should be greater than 1e-4.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
cols
- the names of the columns to search frequent items in.support
- The minimum frequency for an item to be considered frequent
. Should be greater
than 1e-4.
val rows = Seq.tabulate(100) { i =>
if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = sqlContext.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems| b_freqItems|
+-----------+-------------+
| [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)
freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
+----------+
| freq_ab|
+----------+
| [1,-1.0]|
| ... |
+----------+
public DataFrame freqItems(java.lang.String[] cols)
http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou
.
Uses a default
support of 1%.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
cols
- the names of the columns to search frequent items in.public DataFrame freqItems(scala.collection.Seq<java.lang.String> cols, double support)
http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou
.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
cols
- the names of the columns to search frequent items in.support
- (undocumented)
val rows = Seq.tabulate(100) { i =>
if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
}
val df = sqlContext.createDataFrame(rows).toDF("a", "b")
// find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
// "a" and "b"
val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
freqSingles.show()
+-----------+-------------+
|a_freqItems| b_freqItems|
+-----------+-------------+
| [1, 99]|[-1.0, -99.0]|
+-----------+-------------+
// find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
val pairDf = df.select(struct("a", "b").as("a-b"))
val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
+----------+
| freq_ab|
+----------+
| [1,-1.0]|
| ... |
+----------+
public DataFrame freqItems(scala.collection.Seq<java.lang.String> cols)
http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou
.
Uses a default
support of 1%.
This function is meant for exploratory data analysis, as we make no guarantee about the
backward compatibility of the schema of the resulting DataFrame
.
cols
- the names of the columns to search frequent items in.public <T> DataFrame sampleBy(java.lang.String col, scala.collection.immutable.Map<T,java.lang.Object> fractions, long seed)
col
- column that defines stratafractions
- sampling fraction for each stratum. If a stratum is not specified, we treat
its fraction as zero.seed
- random seedDataFrame
that represents the stratified sample
val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
(3, 3))).toDF("key", "value")
val fractions = Map(1 -> 1.0, 3 -> 0.5)
df.stat.sampleBy("key", fractions, 36L).show()
+---+-----+
|key|value|
+---+-----+
| 1| 1|
| 1| 2|
| 3| 2|
+---+-----+
public <T> DataFrame sampleBy(java.lang.String col, java.util.Map<T,java.lang.Double> fractions, long seed)
col
- column that defines stratafractions
- sampling fraction for each stratum. If a stratum is not specified, we treat
its fraction as zero.seed
- random seedDataFrame
that represents the stratified sample