You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Miao Wang (JIRA)" <ji...@apache.org> on 2016/09/20 22:59:20 UTC

[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

    [ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15508065#comment-15508065 ] 

Miao Wang commented on SPARK-17602:
-----------------------------------

Does this change also benefit/impact Windows OS? 

> PySpark - Performance Optimization Large Size of Broadcast Variable
> -------------------------------------------------------------------
>
>                 Key: SPARK-17602
>                 URL: https://issues.apache.org/jira/browse/SPARK-17602
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.6.2, 2.0.0
>         Environment: Linux
>            Reporter: Xiao Ming Bao
>         Attachments: PySpark – Performance Optimization for Large Size of Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to disk as file and each python work process reads the bd from local disk and de-serialize to python object before executing a task, when the size of broadcast  variables is large, the read/de-serialization takes a lot of time. And when the python worker is NOT reused and the number of task is large, this performance would be very bad since python worker needs to read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or socket/mmap) and deserialize file to object in daemon python process, after worker python process forked by daemon python process, worker python process would automatically has the deserialzied object and use it directly because of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org