You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2014/11/21 04:46:33 UTC

[jira] [Updated] (SPARK-4517) Improve memory efficiency for python broadcast

     [ https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Davies Liu updated SPARK-4517:
------------------------------
    Target Version/s: 1.3.0  (was: 1.2.0)

> Improve memory efficiency for python broadcast
> ----------------------------------------------
>
>                 Key: SPARK-4517
>                 URL: https://issues.apache.org/jira/browse/SPARK-4517
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Davies Liu
>
> Currently, the Python broadcast (TorrentBroadcast) will have multiple copies in :
> 1) 1 copy in python driver
> 2) 1 copy in disks of driver (serialized and compressed)
> 3) 2 copies in JVM driver (one is unserialized, one is serialized and compressed)
> 4) 2 copies in executor (one is unserialized, one is serialized and compressed)
> 5) one copy in each python worker.
> Some of them are different in HTTPBroadcast:
> 3)  one copy in memory of driver, one copy in disk (serialized and compressed)
> 4) one copy in memory of executor
> If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in executor (x is the number of python worker, it's the number of CPUs usually).
> The Python broadcast is already serialized and compressed in Python, it should not be serialized and compressed again in JVM. Also, JVM does not need to know the content of it, so it could be out of JVM.
> So, we should have specified broadcast implementation for Python, it stores the serialized and compressed data in disks, transferred to executors in p2p way (similar to TorrentBroadcast), sent to python workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org