You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Chengzhi Zhao <w....@gmail.com> on 2018/02/26 23:07:27 UTC

Suggested way to backfill for datastream

Hey, flink community,

I have a question on backfill data and want to get some ideas on how people
think.

I have a stream of data using BucketingSink to S3 then to Redshift. If
something changed with the logic in flink and I need to backfill some
dates, for example, we are streaming data for today but also need to
backfill the data for 02/01/2018 - 02/10/2018.

What's the suggested way to implement it? Should I have a separated process
to backfill the data then close it?

Thanks,
Chengzhi

Re: Suggested way to backfill for datastream

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi Chengzhi,

Yes, generally speaking, you would launch a separated job to do the backfilling, and then shut down the job after the backfilling is completed.
For this to work, you’ll also have to keep in mind that writes to the external sink must be idempotent.

Are you using Kafka as the data source?

If yes, then in Flink 1.5.0 we will be supporting specifying the startup position for the Flink Kafka Consumer using a specific timestamp.
You will not, however, be able to set an ending timestamp for the consumption. Therefore, what you could do, is to monitor whether or not the backfilling job has reached the head of the stream, and then close it.

If you are using Kinesis as the data source, then the Flink Kinesis connector already supports startup using timestamps (but again, cannot specify an ending timestamp).

I have been thinking about allowing users to set a set of ending partition offsets / ending timestamp so that when using the Kinesis / Kafka consumer it is easier to consume only a static set of data.
This might have been useful in your case, but as of now this isn’t on the roadmap yet.

Cheers,
Gordon

On 27 February 2018 at 7:07:37 AM, Chengzhi Zhao (w.zhaochengzhi@gmail.com) wrote:

Hey, flink community,

I have a question on backfill data and want to get some ideas on how people think.

I have a stream of data using BucketingSink to S3 then to Redshift. If something changed with the logic in flink and I need to backfill some dates, for example, we are streaming data for today but also need to backfill the data for 02/01/2018 - 02/10/2018.

What's the suggested way to implement it? Should I have a separated process to backfill the data then close it?

Thanks,
Chengzhi