You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ahmed Mahran <ah...@gmail.com> on 2016/07/10 00:27:51 UTC

Reactive Spark: generalizing streaming API from micro-batches to event-based

Hi,

I'd like to present an idea about generalizing the legacy streaming API a
bit. The streaming API assumes an equi-frequent micro-batches model such
that streaming data are allocated and jobs are submitted into a batch every
fixed amount of time (aka batchDuration). This model could be extended a
bit; instead of generating a batch every batchDuration, batch generation
could be event based; such that Spark listens to event sources and
generates batches upon events. The equi-frequent micro-batches becomes
equivalent to a timer event source that fires a timer event every
batchDuration.

This allows a fine grain scheduling of Spark jobs. The same code could run
as either streaming or batching. With this model, a batch job could easily
be configured to run periodically. There would be no need to deploy or
configure an external scheduler (like Apache Oozie or linux crons). This
model easily allows jobs with dependencies that span across time, like
daily logs transformations concluded by weekly aggregations.

Please find more details at
https://github.com/mashin-io/rich-spark/blob/eventum-master/docs/reactive-spark-doc.md

I'd like to discuss maybe how worth this idea could be (especially given
the structured streaming API).

Looking forward to having your precious feedback.

Regards,
Ahmed Mahran