pyspark.pandas.DataFrame.corr#

DataFrame.corr(method='pearson', min_periods=None)[source]#

Compute pairwise correlation of columns, excluding NA/null values.

New in version 3.3.0.

Parameters
method{‘pearson’, ‘spearman’, ‘kendall’}
  • pearson : standard correlation coefficient

  • spearman : Spearman rank correlation

  • kendall : Kendall Tau correlation coefficient

Changed in version 3.4.0: support ‘kendall’ for method parameter

min_periodsint, optional

Minimum number of observations required per pair of columns to have a valid result.

New in version 3.4.0.

Returns
DataFrame

Notes

  1. Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

  2. The complexity of Kendall correlation is O(#row * #row), if the dataset is too large, sampling ahead of correlation computation is recommended.

Examples

>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df.corr('pearson')
          dogs      cats
dogs  1.000000 -0.851064
cats -0.851064  1.000000
>>> df.corr('spearman')
          dogs      cats
dogs  1.000000 -0.948683
cats -0.948683  1.000000
>>> df.corr('kendall')
          dogs      cats
dogs  1.000000 -0.912871
cats -0.912871  1.000000