You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/05/22 22:40:17 UTC

[jira] [Resolved] (SPARK-677) PySpark should not collect results through local filesystem

     [ https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen resolved SPARK-677.
------------------------------
          Resolution: Fixed
       Fix Version/s: 1.2.2
                      1.3.1
                      1.4.0
    Target Version/s: 1.3.1, 1.2.2, 1.4.0  (was: 1.2.2, 1.3.1, 1.4.0)

This was fixed for 1.3.1, 1.2.2, and 1.4.0.  I don't think that we'l do a 1.1.x backport, so I'm going to mark this as resolved.

> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
>                 Key: SPARK-677
>                 URL: https://issues.apache.org/jira/browse/SPARK-677
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0, 1.4.0
>            Reporter: Josh Rosen
>            Assignee: Davies Liu
>             Fix For: 1.4.0, 1.3.1, 1.2.2
>
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data to the disk and reads it back in order to collect() RDDs.  On large enough datasets, this data will spill from the buffer cache and write to the physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org