You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/27 02:58:43 UTC

[GitHub] [spark] HyukjinKwon commented on a change in pull request #29242: [SPARK-31448] [PYTHON] Fix storage level used in cache() in dataframe.py

HyukjinKwon commented on a change in pull request #29242:
URL: https://github.com/apache/spark/pull/29242#discussion_r460621898



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -674,7 +674,7 @@ def cache(self):
         .. note:: The default storage level has changed to `MEMORY_AND_DISK` to match Scala in 2.0.
         """
         self.is_cached = True
-        self._jdf.cache()
+        self.persist(StorageLevel.MEMORY_AND_DISK)

Review comment:
       DataFrame itself is not serialized via Python unless you want to collect/createDataFrame/Python UDFs way whereas Python RDD are serialized via Python pickle. The code paths should be same as Scala's. I don't think this change is necessary.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org