You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiao Ming Bao (JIRA)" <ji...@apache.org> on 2016/09/20 02:28:20 UTC

[jira] [Created] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

Xiao Ming Bao created SPARK-17602:
-------------------------------------

             Summary: PySpark - Performance Optimization Large Size of Broadcast Variable
                 Key: SPARK-17602
                 URL: https://issues.apache.org/jira/browse/SPARK-17602
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.5.1
         Environment: Linux
            Reporter: Xiao Ming Bao
             Fix For: 2.0.0


Problem: currently at executor side, the broadcast variable is written to disk as file and each python work process reads the bd from local disk and de-serialize to python object before executing a task, when the size of broadcast  variables is large, the read/de-serialization takes a lot of time. And when the python worker is NOT reused and the number of task is large, this performance would be very bad since python worker needs to read/de-serialize for each task. 

Brief of the solution:
 transfer the broadcast variable to daemon python process via file (or socket/mmap) and deserialize file to object in daemon python process, after worker python process forked by daemon python process, worker python process would automatically has the deserialzied object and use it directly because of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org