You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/01/23 02:10:00 UTC

[jira] [Commented] (SPARK-30580) Why can PySpark persist data only in serialised format?

    [ https://issues.apache.org/jira/browse/SPARK-30580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021677#comment-17021677 ] 

Hyukjin Kwon commented on SPARK-30580:
--------------------------------------

Let's ask questions to mailing list. You could have a better answer there. See https://spark.apache.org/community.html

> Why can PySpark persist data only in serialised format?
> -------------------------------------------------------
>
>                 Key: SPARK-30580
>                 URL: https://issues.apache.org/jira/browse/SPARK-30580
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Francesco Cavrini
>            Priority: Minor
>              Labels: performance
>
> The storage levels in PySpark allow to persist data only in serialised format. There is also [a comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]] explicitly stating that "Since the data is always serialized on the Python side, all the constants use the serialized formats." While that makes totally sense for RDDs, it is not clear to me why it is not possible to persist data without serialisation when using the dataframe/dataset APIs. In theory, in such cases, the persist would only be a directive and data would never leave the JVM, thus allowing for un-serialised persistence, correct? Many thanks for the feedback!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org