You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/27 09:13:20 UTC

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37995: [SPARK-40556][PS][SQL] Unpersist the intermediate datasets cached in `AttachDistributedSequenceExec`

HyukjinKwon commented on code in PR #37995:
URL: https://github.com/apache/spark/pull/37995#discussion_r980983915


##########
python/pyspark/pandas/series.py:
##########
@@ -6442,6 +6445,8 @@ def argmin(self, axis: Axis = None, skipna: bool = True) -> int:
             raise ValueError("axis can only be 0 or 'index'")
         sdf = self._internal.spark_frame.select(self.spark.column, NATURAL_ORDER_COLUMN_NAME)
         seq_col_name = verify_temp_column_name(sdf, "__distributed_sequence_column__")
+
+        cached = sdf.cache()

Review Comment:
   I think we should probably add an internal util with a context manager, and fix `attach_distributed_sequence_column`'s documentation to say that we should always use them together.
   
   The problem is that it's very easy to forget to cache/uncache when you call `attach_distributed_sequence_column`, and it's unclear where to cache/uncache especially to new developers 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org