You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/05 00:59:12 UTC

[GitHub] [spark] dongjoon-hyun opened a new pull request #35737: [SPARK-38418][PYSPARK] Add PySpark cleanShuffleDependencies developer API

dongjoon-hyun opened a new pull request #35737:
URL: https://github.com/apache/spark/pull/35737


   ### What changes were proposed in this pull request?
   
   This PR aims to add `cleanShuffleDependencies` developer API to PySpark RDD like Scala.
   
   ### Why are the changes needed?
   
   This is required for a feature parity in PySpark.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, but this is a new API addition.
   
   ### How was this patch tested?
   
   Pass the CIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35737:
URL: https://github.com/apache/spark/pull/35737#issuecomment-1059672778


   Thank you, @HyukjinKwon . Merged to master for Apache Spark 3.3.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35737:
URL: https://github.com/apache/spark/pull/35737#issuecomment-1059842594


   Hi, @thecassion .
   - First of all, it sounds like irrelevant questions to PR.
   - Although Apache Spark 3.3 is not released yet, we are going to enter `Feature Freeze` stage on March 15th by creating `branch-3.3` and the first RC1 will be available on April. Then,  you can install RC1 via `pip`.
   - If you want to build a snapshot binary distribution, you can do that also manually before RC1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35737:
URL: https://github.com/apache/spark/pull/35737#issuecomment-1059632763


   Could you review this, @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #35737:
URL: https://github.com/apache/spark/pull/35737#discussion_r820009110



##########
File path: python/pyspark/rdd.py
##########
@@ -465,6 +465,21 @@ def getCheckpointFile(self) -> Optional[str]:
 
         return checkpointFile.get() if checkpointFile.isDefined() else None
 
+    def cleanShuffleDependencies(self, blocking: bool = False) -> None:
+        """
+        Removes an RDD's shuffles and it's non-persisted ancestors.
+
+        When running without a shuffle service, cleaning up shuffle files enables downscaling.
+        If you use the RDD after this call, you should checkpoint and materialize it first.
+
+        .. versionadded:: 3.3.0
+
+        Notes

Review comment:
       Maybe we should also better add Parameters section (I know many of them don't have it here but should be best to have it)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #35737:
URL: https://github.com/apache/spark/pull/35737


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #35737:
URL: https://github.com/apache/spark/pull/35737#discussion_r820009353



##########
File path: python/pyspark/rdd.py
##########
@@ -465,6 +465,21 @@ def getCheckpointFile(self) -> Optional[str]:
 
         return checkpointFile.get() if checkpointFile.isDefined() else None
 
+    def cleanShuffleDependencies(self, blocking: bool = False) -> None:
+        """
+        Removes an RDD's shuffles and it's non-persisted ancestors.
+
+        When running without a shuffle service, cleaning up shuffle files enables downscaling.
+        If you use the RDD after this call, you should checkpoint and materialize it first.
+
+        .. versionadded:: 3.3.0
+
+        Notes

Review comment:
       Oh also we should add it in the list at https://github.com/apache/spark/blob/master/python/docs/source/reference/pyspark.rst




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #35737:
URL: https://github.com/apache/spark/pull/35737#discussion_r820009742



##########
File path: python/pyspark/rdd.py
##########
@@ -465,6 +465,21 @@ def getCheckpointFile(self) -> Optional[str]:
 
         return checkpointFile.get() if checkpointFile.isDefined() else None
 
+    def cleanShuffleDependencies(self, blocking: bool = False) -> None:
+        """
+        Removes an RDD's shuffles and it's non-persisted ancestors.
+
+        When running without a shuffle service, cleaning up shuffle files enables downscaling.
+        If you use the RDD after this call, you should checkpoint and materialize it first.
+
+        .. versionadded:: 3.3.0
+
+        Notes

Review comment:
       Oh, thank you. I'll take a look at them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] thecassion commented on pull request #35737: [SPARK-38418][PYSPARK] Add PySpark `cleanShuffleDependencies` developer API

Posted by GitBox <gi...@apache.org>.
thecassion commented on pull request #35737:
URL: https://github.com/apache/spark/pull/35737#issuecomment-1059791176


   Hello all, How can I install this version 3.3 on my local machine using pip? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org