You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "HyukjinKwon (via GitHub)" <gi...@apache.org> on 2023/11/06 20:28:25 UTC

[PR] [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct` [spark]

HyukjinKwon opened a new pull request, #43686:
URL: https://github.com/apache/spark/pull/43686

   ### What changes were proposed in this pull request?
   
   This PR proposes to improve the docstring of `count_distinct`.
   
   ### Why are the changes needed?
   
   For end users, and better usability of PySpark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it fixes the user facing documentation.
   
   ### How was this patch tested?
   
   Manually tested.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43686: [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct`
URL: https://github.com/apache/spark/pull/43686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct` [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43686:
URL: https://github.com/apache/spark/pull/43686#issuecomment-1799343621

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45260][PYTHON][DOCS] Refine docstring of `count_distinct` [spark]

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #43686:
URL: https://github.com/apache/spark/pull/43686#discussion_r1384260727


##########
python/pyspark/sql/functions.py:
##########
@@ -4626,26 +4626,38 @@ def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column:
 
     Examples
     --------
-    >>> from pyspark.sql import types
-    >>> df1 = spark.createDataFrame([1, 1, 3], types.IntegerType())
-    >>> df2 = spark.createDataFrame([1, 2], types.IntegerType())
-    >>> df1.join(df2).show()
-    +-----+-----+
-    |value|value|
-    +-----+-----+
-    |    1|    1|
-    |    1|    2|
-    |    1|    1|
-    |    1|    2|
-    |    3|    1|
-    |    3|    2|
-    +-----+-----+
-    >>> df1.join(df2).select(count_distinct(df1.value, df2.value)).show()
-    +----------------------------+
-    |count(DISTINCT value, value)|
-    +----------------------------+
-    |                           4|
-    +----------------------------+
+    Example 1: Counting distinct values of a single column
+
+    >>> from pyspark.sql import functions as sf
+    >>> df = spark.createDataFrame([(1,), (1,), (3,)], ["value"])
+    >>> df.select(sf.count_distinct(df.value)).show()
+    +---------------------+
+    |count(DISTINCT value)|
+    +---------------------+
+    |                    2|
+    +---------------------+
+
+    Example 2: Counting distinct values of multiple columns
+
+    >>> from pyspark.sql import functions as sf
+    >>> df1 = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
+    >>> df1.select(sf.count_distinct(df1.value1, df1.value2)).show()

Review Comment:
   ```suggestion
       >>> df = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
       >>> df.select(sf.count_distinct(df.value1, df.value2)).show()
   ```



##########
python/pyspark/sql/functions.py:
##########
@@ -4626,26 +4626,38 @@ def count_distinct(col: "ColumnOrName", *cols: "ColumnOrName") -> Column:
 
     Examples
     --------
-    >>> from pyspark.sql import types
-    >>> df1 = spark.createDataFrame([1, 1, 3], types.IntegerType())
-    >>> df2 = spark.createDataFrame([1, 2], types.IntegerType())
-    >>> df1.join(df2).show()
-    +-----+-----+
-    |value|value|
-    +-----+-----+
-    |    1|    1|
-    |    1|    2|
-    |    1|    1|
-    |    1|    2|
-    |    3|    1|
-    |    3|    2|
-    +-----+-----+
-    >>> df1.join(df2).select(count_distinct(df1.value, df2.value)).show()
-    +----------------------------+
-    |count(DISTINCT value, value)|
-    +----------------------------+
-    |                           4|
-    +----------------------------+
+    Example 1: Counting distinct values of a single column
+
+    >>> from pyspark.sql import functions as sf
+    >>> df = spark.createDataFrame([(1,), (1,), (3,)], ["value"])
+    >>> df.select(sf.count_distinct(df.value)).show()
+    +---------------------+
+    |count(DISTINCT value)|
+    +---------------------+
+    |                    2|
+    +---------------------+
+
+    Example 2: Counting distinct values of multiple columns
+
+    >>> from pyspark.sql import functions as sf
+    >>> df1 = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
+    >>> df1.select(sf.count_distinct(df1.value1, df1.value2)).show()
+    +------------------------------+
+    |count(DISTINCT value1, value2)|
+    +------------------------------+
+    |                             2|
+    +------------------------------+
+
+    Example 3: Counting distinct values with column names as strings
+
+    >>> from pyspark.sql import functions as sf
+    >>> df3 = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
+    >>> df3.select(sf.count_distinct("value1", "value2")).show()

Review Comment:
   ```suggestion
       >>> df = spark.createDataFrame([(1, 1), (1, 2)], ["value1", "value2"])
       >>> df.select(sf.count_distinct("value1", "value2")).show()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org