You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/10/19 09:09:00 UTC
[jira] [Work logged] (BEAM-5775) Make the spark runner not serialize data unless spark is spilling to disk

     [ https://issues.apache.org/jira/browse/BEAM-5775?focusedWorklogId=330906&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330906 ]

ASF GitHub Bot logged work on BEAM-5775:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Oct/19 09:08
            Start Date: 19/Oct/19 09:08
    Worklog Time Spent: 10m 
      Work Description: stale[bot] commented on issue #8371: [BEAM-5775] Move (most) of the batch spark pipelines' transformations to using lazy serialization.
URL: https://github.com/apache/beam/pull/8371#issuecomment-544118190
 
 
   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 330906)
    Time Spent: 12h  (was: 11h 50m)

> Make the spark runner not serialize data unless spark is spilling to disk
> -------------------------------------------------------------------------
>
>                 Key: BEAM-5775
>                 URL: https://issues.apache.org/jira/browse/BEAM-5775
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Mike Kaplinskiy
>            Assignee: Mike Kaplinskiy
>            Priority: Minor
>             Fix For: 2.17.0
>
>          Time Spent: 12h
>  Remaining Estimate: 0h
>
> Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. This lets Spark keep the data in memory avoiding the serialization round trip. Unfortunately the logic is fairly coarse - as soon as you switch to MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen to keep the data in memory, incurring the serialization overhead.
>  
> Ideally Beam would serialize the data lazily - as Spark chooses to spill to disk. This would be a change in behavior when using beam, but luckily Spark has a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER will keep the data serialized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)