You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by ru...@apache.org on 2022/11/10 07:42:48 UTC
[spark] branch master updated: [SPARK-40877][DOC][FOLLOW-UP] Update the doc of `DataFrame.stat.crosstab `
This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 40a9a6ef5b8 [SPARK-40877][DOC][FOLLOW-UP] Update the doc of `DataFrame.stat.crosstab `
40a9a6ef5b8 is described below
commit 40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3
Author: Ruifeng Zheng <ru...@apache.org>
AuthorDate: Thu Nov 10 15:42:19 2022 +0800
[SPARK-40877][DOC][FOLLOW-UP] Update the doc of `DataFrame.stat.crosstab `
### What changes were proposed in this pull request?
remove the outdated comments
### Why are the changes needed?
the limitations are not true after [reimplementation](https://github.com/apache/spark/pull/38340)
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
doc - only
Closes #38579 from zhengruifeng/doc_crosstab.
Lead-authored-by: Ruifeng Zheng <ru...@apache.org>
Co-authored-by: Ruifeng Zheng <ru...@foxmail.com>
Signed-off-by: Ruifeng Zheng <ru...@apache.org>
---
python/pyspark/sql/dataframe.py | 3 +--
.../src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala | 2 --
2 files changed, 1 insertion(+), 4 deletions(-)
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 3c787f8900f..6d5014918bf 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -4217,8 +4217,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
def crosstab(self, col1: str, col2: str) -> "DataFrame":
"""
Computes a pair-wise frequency table of the given columns. Also known as a contingency
- table. The number of distinct values for each column should be less than 1e4. At most 1e6
- non-zero pair frequencies will be returned.
+ table.
The first column of each row will be the distinct values of `col1` and the column names
will be the distinct values of `col2`. The name of the first column will be `$col1_$col2`.
Pairs that have no occurrences will have zero as their counts.
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
index efd430633d7..7511c21fa76 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
@@ -181,8 +181,6 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
/**
* Computes a pair-wise frequency table of the given columns. Also known as a contingency table.
- * The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero
- * pair frequencies will be returned.
* The first column of each row will be the distinct values of `col1` and the column names will
* be the distinct values of `col2`. The name of the first column will be `col1_col2`. Counts
* will be returned as `Long`s. Pairs that have no occurrences will have zero as their counts.
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org