You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/09/01 11:01:19 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #42770: [SPARK-45049][CONNECT][TESTS] Enable doctests for `coalesce/repartition/repartitionByRange`

zhengruifeng opened a new pull request, #42770:
URL: https://github.com/apache/spark/pull/42770

   ### What changes were proposed in this pull request?
   Enable doctests for `coalesce/repartition/repartitionByRange`, by using `explain` instead of `rdd.getNumPartitions`
   
   
   ### Why are the changes needed?
   test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   updated doctests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   NO


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #42770: [SPARK-45049][CONNECT][TESTS] Enable doctests for `coalesce/repartition/repartitionByRange`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #42770:
URL: https://github.com/apache/spark/pull/42770#issuecomment-1703776479

   cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1314345886


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   @allisonwang-db I found there was a discussion on `getNumPartitions` [here](https://github.com/apache/spark/pull/16708), it seems the arguments still stand today.
   
   @HyukjinKwon good idea, let me have a try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #42770:
URL: https://github.com/apache/spark/pull/42770#issuecomment-1705847362

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1314343527


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   Can we do sth like:
   
   ```python
   len(set(spark.range(10).repartition(2).select(sf.spark_partition_id()).collect()))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1314400083


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   One way for such cases is to use `foreachPartition` in that case with printing each out. The problem would be the output wouldn't be caught in the Python test (as they are printed out from a separate thread from JVM IIRC).
   
   So, maybe we could do sth like:
   
   ```python
   """
   >>> df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
   >>> df.repartition(2, "age").foreachPartition(lambda it: print(list(it)))  # doctest: +SKIP
   [Row(age=14, name='Tom')]
   [Row(age=23, name='Alice'), Row(age=16, name='Bob')]
   
   ...
   """
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1314388163


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   I found that current changes can reflect the explainations like
   
   ```
   Repartition the data into 7 partitions by 'age' and 'name columns.
   ```
   
   ```
           Repartition the data into 2 partitions by range in 'age' column.
           For example, the first partition can have ``(14, "Tom")``, and the second
           partition would have ``(16, "Bob")`` and ``(23, "Alice")``.
   ```
   
   while if we use the numPartitions (not matter `rdd.getNumPartitions` or `spark_partition_id()`) in the examples, it can not provide such information.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1314388163


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   I found that current changes can reflect the explainations like
   
   ```
           Repartition the data into 7 partitions by 'age' and 'name columns.
   ```
   
   ```
           Repartition the data into 2 partitions by range in 'age' column.
           For example, the first partition can have ``(14, "Tom")``, and the second
           partition would have ``(16, "Bob")`` and ``(23, "Alice")``.
   ```
   
   while if we use the numPartitions (not matter `rdd.getNumPartitions` or `spark_partition_id()`) in the examples, it can not provide such information.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1313926440


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   Hmm, I don't think displaying the physical plan is very user-friendly; it can be daunting for new Spark users who might not understand what `RoundRobinPartitioning(10)` means. Personally I prefer the original examples.
   
   Just curious, is it feasible to support `getNumPartitions()` in Spark Connect? This API is actually pretty useful.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon closed pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`
URL: https://github.com/apache/spark/pull/42770


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #42770: [SPARK-45049][CONNECT][DOCS][TESTS] Refine docstrings of `coalesce/repartition/repartitionByRange`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #42770:
URL: https://github.com/apache/spark/pull/42770#discussion_r1314400083


##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   One way for such cases is to use `foreachPartition` in that case with printing each out. The problem would be the output wouldn't be caught in the Python test (as they are printed out from a separate thread from JVM IIRC).
   
   So, maybe we could do sth like:
   
   ```python
   df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
   df.repartition(2, "age").foreachPartition(lambda it: print(list(it)))  # doctest: +SKIP
   [Row(age=14, name='Tom')]
   [Row(age=23, name='Alice'), Row(age=16, name='Bob')]
   ```
   



##########
python/pyspark/sql/dataframe.py:
##########
@@ -1809,18 +1810,27 @@ def repartition(  # type: ignore[misc]
 
         Repartition the data into 10 partitions.
 
-        >>> df.repartition(10).rdd.getNumPartitions()
-        10
+        >>> df.repartition(10).explain()
+        == Physical Plan ==

Review Comment:
   One way for such cases is to use `foreachPartition` in that case with printing each out. The problem would be the output wouldn't be caught in the Python test (as they are printed out from a separate thread from JVM IIRC).
   
   So, maybe we could do sth like:
   
   ```python
   ... df = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
   ... df.repartition(2, "age").foreachPartition(lambda it: print(list(it)))  # doctest: +SKIP
   [Row(age=14, name='Tom')]
   [Row(age=23, name='Alice'), Row(age=16, name='Bob')]
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org