You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/09/15 13:44:00 UTC
[jira] [Resolved] (SPARK-31448) Difference in Storage Levels used
in cache() and persist() for pyspark dataframes
[ https://issues.apache.org/jira/browse/SPARK-31448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-31448.
----------------------------------
Fix Version/s: 3.1.0
Resolution: Fixed
Issue resolved by pull request 29242
[https://github.com/apache/spark/pull/29242]
> Difference in Storage Levels used in cache() and persist() for pyspark dataframes
> ---------------------------------------------------------------------------------
>
> Key: SPARK-31448
> URL: https://issues.apache.org/jira/browse/SPARK-31448
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3
> Reporter: Abhishek Dixit
> Assignee: Abhishek Dixit
> Priority: Major
> Fix For: 3.1.0
>
>
> There is a difference in default storage level *MEMORY_AND_DISK* in pyspark and scala.
> *Scala*: StorageLevel(true, true, false, true)
> *Pyspark:* StorageLevel(True, True, False, False)
>
> *Problem Description:*
> Calling *df.cache()* for pyspark dataframe directly invokes Scala method cache() and Storage Level used is StorageLevel(true, true, false, true).
> But calling *df.persist()* for pyspark dataframe sets the newStorageLevel=StorageLevel(true, true, false, false) inside pyspark and then invokes Scala function persist(newStorageLevel).
> *Possible Fix:*
> Invoke pyspark function persist inside pyspark function cache instead of calling the scala function directly.
> I can raise a PR for this fix if someone can confirm that this is a bug and the possible fix is the correct approach.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org