You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Ismaël Mejía (Jira)" <ji...@apache.org> on 2020/01/24 08:29:00 UTC
[jira] [Updated] (BEAM-8470) Create a new Spark runner based on
Spark Structured streaming framework
[ https://issues.apache.org/jira/browse/BEAM-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía updated BEAM-8470:
-------------------------------
Issue Type: New Feature (was: Improvement)
> Create a new Spark runner based on Spark Structured streaming framework
> -----------------------------------------------------------------------
>
> Key: BEAM-8470
> URL: https://issues.apache.org/jira/browse/BEAM-8470
> Project: Beam
> Issue Type: New Feature
> Components: runner-spark
> Reporter: Etienne Chauchot
> Assignee: Etienne Chauchot
> Priority: Major
> Labels: structured-streaming
> Fix For: 2.18.0
>
> Time Spent: 16h
> Remaining Estimate: 0h
>
> h1. Why is it worth creating a new runner based on structured streaming:
> Because this new framework brings:
> * Unified batch and streaming semantics:
> * no more RDD/DStream distinction, as in Beam (only PCollection)
> * Better state management:
> * incremental state instead of saving all each time
> * No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get
> * Schemas in datasets:
> * The dataset knows the structure of the data (fields) and can optimize later on
> * Schemas in PCollection in Beam
> * New Source API
> * Very close to Beam bounded source and unbounded sources
> h1. Why make a new runner from scratch?
> * Structured streaming framework is very different from the RDD/Dstream framework
> h1. We hope to gain
> * More up to date runner in terms of libraries: leverage new features
> * Leverage learnt practices from the previous runners
> * Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code.
> * Simplify the code and ease the maintenance
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)