You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Francesco Cavrini (Jira)" <ji...@apache.org> on 2020/01/20 08:44:00 UTC

[jira] [Created] (SPARK-30580) Why can PySpark persist data only in serialised format?

Francesco Cavrini created SPARK-30580:
-----------------------------------------

             Summary: Why can PySpark persist data only in serialised format?
                 Key: SPARK-30580
                 URL: https://issues.apache.org/jira/browse/SPARK-30580
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.0
            Reporter: Francesco Cavrini


The storage levels in PySpark allow to persist data only in serialised format. There is also [a comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]] explicitly stating that "Since the data is always serialized on the Python side, all the constants use the serialized formats." While that makes totally sense for RDDs, it is not clear to me why it is not possible to persist data without serialisation when using the dataframe/dataset APIs. In theory, in such cases, the persist would only be a directive and data would never leave the JVM, thus allowing for un-serialised persistence, correct? Many thanks for the feedback!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org