You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Peter Zende <pe...@gmail.com> on 2018/05/02 11:36:01 UTC

Wiring batch and stream together

Hi,

We have a Flink streaming pipeline (1.4.2) which reads from Kafka, uses
mapWithState with RocksDB and writes the updated states to Cassandra.
We also would like to reprocess the ingested records from HDFS. For this we
consider computing the latest state of the records over the whole dataset
in a batch
manner instead of reading them record by record.

What are the options (best practices) to bring batch and streaming together
(FLINK-2320 is open at the moment)? Is it possible to build the RocksDB
state "offline"
and share it with the streaming job?

Ideally the best would be to have one job which switches from batch to
streaming once all records have been read from HDFS.


Thanks,
Peter

Re: Wiring batch and stream together

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Peter,

Building the state for a DataStream job in a DataSet (batch) job is
currently not possible.

You can however, implement a DataStream job that reads batch data and
builds the state. When all data was processed, you'd need to save the state
as a savepoint and can resume a streaming job from there.
However, there are a couple of challenges on the way.
If the logic depends on time, you might need to read the data in time order
which is not easy. Alternatively, you can collect all data in state and
perform computations at the end.
Another problem is a seamless switch from historic to real-time data and
also identifying the right time when you can take the savepoint is not that
easy.

There was a good talk at Flink Forward exactly about this topic, that I'd
recommend to watch. [1]

Best,
Fabian

[1]
https://data-artisans.com/flink-forward/resources/bootstrapping-state-in-apache-flink

2018-05-02 13:36 GMT+02:00 Peter Zende <pe...@gmail.com>:

> Hi,
>
> We have a Flink streaming pipeline (1.4.2) which reads from Kafka, uses
> mapWithState with RocksDB and writes the updated states to Cassandra.
> We also would like to reprocess the ingested records from HDFS. For this
> we consider computing the latest state of the records over the whole
> dataset in a batch
> manner instead of reading them record by record.
>
> What are the options (best practices) to bring batch and streaming
> together (FLINK-2320 is open at the moment)? Is it possible to build the
> RocksDB state "offline"
> and share it with the streaming job?
>
> Ideally the best would be to have one job which switches from batch to
> streaming once all records have been read from HDFS.
>
>
> Thanks,
> Peter
>