You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Etienne Chauchot (Jira)" <ji...@apache.org> on 2019/10/24 13:27:00 UTC
[jira] [Created] (BEAM-8470) Create a new Spark runner based on
Spark Structured streaming framework
Etienne Chauchot created BEAM-8470:
--------------------------------------
Summary: Create a new Spark runner based on Spark Structured streaming framework
Key: BEAM-8470
URL: https://issues.apache.org/jira/browse/BEAM-8470
Project: Beam
Issue Type: Improvement
Components: runner-spark
Reporter: Etienne Chauchot
Assignee: Etienne Chauchot
h1. Why is it worth creating a new runner based on structured streaming:
Because this new framework brings:
* Unified batch and streaming semantics:
* no more RDD/DStream distinction, as in Beam (only PCollection)
* Better state management:
* incremental state instead of saving all each time
* No more synchronous saving delaying computation: per batch and partition delta file saved asynchronously + in-memory hashmap synchronous put/get
* Schemas in datasets:
* The dataset knows the structure of the data (fields) and can optimize later on
* Schemas in PCollection in Beam
* New Source API
* Very close to Beam bounded source and unbounded sources
h1. Why make a new runner from scratch?
* Structured streaming framework is very different from the RDD/Dstream framework
h1. We hope to gain
* More up to date runner in terms of libraries: leverage new features
* Leverage learnt practices from the previous runners
* Better performance thanks to the DAG optimizer (catalyst) and by simplifying the code.
* Simplify the code and ease the maintenance
--
This message was sent by Atlassian Jira
(v8.3.4#803005)