Compute the Pearson correlation matrix for the input Dataset of Vectors.
Compute the Pearson correlation matrix for the input Dataset of Vectors.
:: Experimental :: Compute the correlation matrix for the input Dataset of Vectors using the specified method.
:: Experimental ::
Compute the correlation matrix for the input Dataset of Vectors using the specified method.
Methods currently supported: pearson
(default), spearman
.
A dataset or a dataframe
The name of the column of vectors for which the correlation coefficient needs to be computed. This must be a column of the dataset, and it must contain Vector objects.
String specifying the method to use for computing correlation.
Supported: pearson
(default), spearman
A dataframe that contains the correlation matrix of the column of vectors. This dataframe contains a single row and a single column of name '$METHODNAME($COLUMN)'.
if the column is not a valid column in the dataset, or if the content of this column is not of type Vector. Here is how to access the correlation coefficient:
val data: Dataset[Vector] = ... val Row(coeff: Matrix) = Correlation.corr(data, "value").head // coeff now contains the Pearson correlation matrix.
For Spearman, a rank correlation, we need to create an RDD[Double] for each column
and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
which is fairly costly. Cache the input Dataset before calling corr with method = "spearman"
to avoid recomputing the common lineage.
API for correlation functions in MLlib, compatible with DataFrames and Datasets.
The functions in this package generalize the functions in org.apache.spark.sql.Dataset#stat to spark.ml's Vector types.