DataFrameStatFunctions

java.lang.Object
- org.apache.spark.sql.DataFrameStatFunctions

public final class DataFrameStatFunctions
extends java.lang.Object

:: Experimental :: Statistic functions for DataFrames.

Since:: 1.4.0

Method Summary

Methods
Modifier and Type	Method and Description
`double`	`corr(java.lang.String col1, java.lang.String col2)` Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
`double`	`corr(java.lang.String col1, java.lang.String col2, java.lang.String method)` Calculates the correlation of two columns of a DataFrame.
`double`	`cov(java.lang.String col1, java.lang.String col2)` Calculate the sample covariance of two numerical columns of a DataFrame.
`DataFrame`	`crosstab(java.lang.String col1, java.lang.String col2)` Computes a pair-wise frequency table of the given columns.
`DataFrame`	`freqItems(scala.collection.Seq<java.lang.String> cols)` (Scala-specific) Finding frequent items for columns, possibly with false positives.
`DataFrame`	`freqItems(scala.collection.Seq<java.lang.String> cols, double support)` (Scala-specific) Finding frequent items for columns, possibly with false positives.
`DataFrame`	`freqItems(java.lang.String[] cols)` Finding frequent items for columns, possibly with false positives.
`DataFrame`	`freqItems(java.lang.String[] cols, double support)` Finding frequent items for columns, possibly with false positives.
`<T> DataFrame`	`sampleBy(java.lang.String col, java.util.Map<T,java.lang.Double> fractions, long seed)` Returns a stratified sample without replacement based on the fraction given on each stratum.
`<T> DataFrame`	`sampleBy(java.lang.String col, scala.collection.immutable.Map<T,java.lang.Object> fractions, long seed)` Returns a stratified sample without replacement based on the fraction given on each stratum.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - cov
```
public double cov(java.lang.String col1,
         java.lang.String col2)
```
    Calculate the sample covariance of two numerical columns of a DataFrame.
    Parameters:
    col1 - the name of the first column
    col2 - the name of the second column
    
    Returns:
    the covariance of the two columns.
    
    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.cov("rand1", "rand2") res1: Double = 0.065...
    Since:
    
    1.4.0
  - corr
```
public double corr(java.lang.String col1,
          java.lang.String col2,
          java.lang.String method)
```
    Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
    Parameters:
    col1 - the name of the column
    col2 - the name of the column to calculate the correlation against
    method - (undocumented)
    
    Returns:
    The Pearson Correlation Coefficient as a Double.
    
    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2") res1: Double = 0.613...
    Since:
    
    1.4.0
  - corr
```
public double corr(java.lang.String col1,
          java.lang.String col2)
```
    Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
    Parameters:
    col1 - the name of the column
    col2 - the name of the column to calculate the correlation against
    
    Returns:
    The Pearson Correlation Coefficient as a Double.
    
    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2", "pearson") res1: Double = 0.613...
    Since:
    
    1.4.0
  - crosstab
```
public DataFrame crosstab(java.lang.String col1,
                 java.lang.String col2)
```
    Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be $col1_$col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.
    Parameters:
    col1 - The name of the first column. Distinct items will make the first item of each row.
    col2 - The name of the second column. Distinct items will make the column names of the DataFrame.
    
    Returns:
    A DataFrame containing for the contingency table.
    
    val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show() +---------+---+---+---+ |key_value| 1| 2| 3| +---------+---+---+---+ | 2| 2| 0| 1| | 1| 1| 1| 0| | 3| 0| 1| 1| +---------+---+---+---+
    Since:
    
    1.4.0
  - freqItems
```
public DataFrame freqItems(java.lang.String[] cols,
                  double support)
```
    Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.
    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
    Parameters:
    cols - the names of the columns to search frequent items in.
    support - The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.
    
    Returns:
    A Local DataFrame with the Array of frequent items for each column.
    
    val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = sqlContext.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
    Since:
    
    1.4.0
  - freqItems
```
public DataFrame freqItems(java.lang.String[] cols)
```
    Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.
    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
    
    Parameters:
    cols - the names of the columns to search frequent items in.
    
    Returns:
    A Local DataFrame with the Array of frequent items for each column.
    Since:
    
    1.4.0
  - freqItems
```
public DataFrame freqItems(scala.collection.Seq<java.lang.String> cols,
                  double support)
```
    (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou.
    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
    Parameters:
    cols - the names of the columns to search frequent items in.
    support - (undocumented)
    
    Returns:
    A Local DataFrame with the Array of frequent items for each column.
    
    val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, -1.0) else (i, i * -1.0) } val df = sqlContext.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4) freqSingles.show() +-----------+-------------+ |a_freqItems| b_freqItems| +-----------+-------------+ | [1, 99]|[-1.0, -99.0]| +-----------+-------------+ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("a-b")) val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1) freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show() +----------+ | freq_ab| +----------+ | [1,-1.0]| | ... | +----------+
    Since:
    
    1.4.0
  - freqItems
```
public DataFrame freqItems(scala.collection.Seq<java.lang.String> cols)
```
    (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.
    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
    
    Parameters:
    cols - the names of the columns to search frequent items in.
    
    Returns:
    A Local DataFrame with the Array of frequent items for each column.
    Since:
    
    1.4.0
  - sampleBy
```
public <T> DataFrame sampleBy(java.lang.String col,
                     scala.collection.immutable.Map<T,java.lang.Object> fractions,
                     long seed)
```
    Returns a stratified sample without replacement based on the fraction given on each stratum.
    Parameters:
    col - column that defines strata
    fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
    seed - random seed
    
    Returns:
    a new DataFrame that represents the stratified sample
    
    val df = sqlContext.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val fractions = Map(1 -> 1.0, 3 -> 0.5) df.stat.sampleBy("key", fractions, 36L).show() +---+-----+ |key|value| +---+-----+ | 1| 1| | 1| 2| | 3| 2| +---+-----+
    Since:
    
    1.5.0
  - sampleBy
```
public <T> DataFrame sampleBy(java.lang.String col,
                     java.util.Map<T,java.lang.Double> fractions,
                     long seed)
```
    Returns a stratified sample without replacement based on the fraction given on each stratum.
    
    Parameters:
    col - column that defines strata
    fractions - sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
    seed - random seed
    
    Returns:
    a new DataFrame that represents the stratified sample
    Since:
    
    1.5.0

Class DataFrameStatFunctions

Method Summary

Methods inherited from class java.lang.Object

Method Detail

cov

corr

corr

crosstab

freqItems

freqItems

freqItems

freqItems

sampleBy

sampleBy