You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Ismaël Mejía (Jira)" <ji...@apache.org> on 2019/10/30 22:48:00 UTC

[jira] [Updated] (BEAM-8470) Create a new Spark runner based on Spark Structured streaming framework

     [ https://issues.apache.org/jira/browse/BEAM-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ismaël Mejía updated BEAM-8470:
-------------------------------
    Status: Open  (was: Triage Needed)

> Create a new Spark runner based on Spark Structured streaming framework
> -----------------------------------------------------------------------
>
>                 Key: BEAM-8470
>                 URL: https://issues.apache.org/jira/browse/BEAM-8470
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>            Priority: Major
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> h1. Why is it worth creating a new runner based on structured streaming:
> Because this new framework brings:
>  * Unified batch and streaming semantics:
>  * no more RDD/DStream distinction, as in Beam (only PCollection)
>  * Better state management:
>  * incremental state instead of saving all each time
>  * No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get
>  * Schemas in datasets:
>  * The dataset knows the structure of the data (fields) and can optimize later on
>  * Schemas in PCollection in Beam
>  * New Source API
>  * Very close to Beam bounded source and unbounded sources
> h1. Why make a new runner from scratch?
>  * Structured streaming framework is very different from the RDD/Dstream framework
> h1. We hope to gain
>  * More up to date runner in terms of libraries: leverage new features
>  * Leverage learnt practices from the previous runners
>  * Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code.
>  * Simplify the code and ease the maintenance
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)