You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Tao Xia <ta...@udacity.com> on 2018/04/30 21:15:08 UTC

merge/union fast and slow streams based on event timestamp

I am running into a problem when processing the past 7 days of data from
multiple streams.  I am trying to union the streams based on event
timestamp.

The problem is that there are streams are significant big than other
streams. For example if one stream has 1,000 event/sec and the other stream
has 1,000,000 event/sec.

I am using a PrirotyQueue to sort the event based on event timestamp. Since
the fast(smaller) streams watermarks moves much faster than the
slow(bigger) streams, there are lots of events  from the faster streams
ended up in the Queue waiting for the slower stream to catch up and
eventually ran out of memory.

Is there anyway we can send back pressure on the fast streams so they can
slow down?  or somehow to coordinate the watermarks between all the streams?

I am planning to use an external storage to tracking the low watermarks
between all the streams. so we don't read the event we cannot handle into
the PriorityQueue.

Any better suggestions?

Thanks,
Tao

Re: merge/union fast and slow streams based on event timestamp

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Tao,

there's no built-in mechanism for this in Flink.
Throttling streams and creating back pressure is not a good idea in general
because it prevents checkpoint barriers (which must be aligned with the
events) to arrive a the operators.
This might cause checkpoints to time out.

The only way to prevent this is to not emit records from a source.
Once a records is emitted, it should be processed if possible.

Best, Fabian

2018-04-30 23:15 GMT+02:00 Tao Xia <ta...@udacity.com>:

> I am running into a problem when processing the past 7 days of data from
> multiple streams.  I am trying to union the streams based on event
> timestamp.
>
> The problem is that there are streams are significant big than other
> streams. For example if one stream has 1,000 event/sec and the other stream
> has 1,000,000 event/sec.
>
> I am using a PrirotyQueue to sort the event based on event timestamp.
> Since the fast(smaller) streams watermarks moves much faster than the
> slow(bigger) streams, there are lots of events  from the faster streams
> ended up in the Queue waiting for the slower stream to catch up and
> eventually ran out of memory.
>
> Is there anyway we can send back pressure on the fast streams so they can
> slow down?  or somehow to coordinate the watermarks between all the streams?
>
> I am planning to use an external storage to tracking the low watermarks
> between all the streams. so we don't read the event we cannot handle into
> the PriorityQueue.
>
> Any better suggestions?
>
> Thanks,
> Tao
>