pyspark.pandas.Series.corr¶
-
Series.
corr
(other: pyspark.pandas.series.Series, method: str = 'pearson', min_periods: Optional[int] = None) → float[source]¶ Compute correlation with other Series, excluding missing values.
New in version 3.3.0.
- Parameters
- otherSeries
- method{‘pearson’, ‘spearman’, ‘kendall’}
pearson : standard correlation coefficient
spearman : Spearman rank correlation
kendall : Kendall Tau correlation coefficient
Changed in version 3.4.0: support ‘kendall’ for method parameter
- min_periodsint, optional
Minimum number of observations needed to have a valid result.
New in version 3.4.0.
- Returns
- correlationfloat
Notes
The complexity of Kendall correlation is O(#row * #row), if the dataset is too large, sampling ahead of correlation computation is recommended.
Examples
>>> df = ps.DataFrame({'s1': [.2, .0, .6, .2], ... 's2': [.3, .6, .0, .1]}) >>> s1 = df.s1 >>> s2 = df.s2 >>> s1.corr(s2, method='pearson') -0.85106...
>>> s1.corr(s2, method='spearman') -0.94868...
>>> s1.corr(s2, method='kendall') -0.91287...
>>> s1 = ps.Series([1, np.nan, 2, 1, 1, 2, 3]) >>> s2 = ps.Series([3, 4, 1, 1, 5])
>>> with ps.option_context("compute.ops_on_diff_frames", True): ... s1.corr(s2, method="pearson") -0.52223...
>>> with ps.option_context("compute.ops_on_diff_frames", True): ... s1.corr(s2, method="spearman") -0.54433...
>>> with ps.option_context("compute.ops_on_diff_frames", True): ... s1.corr(s2, method="kendall") -0.51639...
>>> with ps.option_context("compute.ops_on_diff_frames", True): ... s1.corr(s2, method="kendall", min_periods=5) nan