You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Kaustubh Rudrawar <ka...@box.com> on 2019/02/08 02:19:56 UTC

Flink Job and Watermarking

Hi,

I'm writing a job that wants to make an HTTP request once a watermark has
reached all tasks of an operator. It would be great if this could be
determined from outside the Flink job, but I don't think it's possible to
access watermark information for the job as a whole. Below is a workaround
I've come up with:

   1. Read messages from Kafka using the provided KafkaSource. Event time
   will be defined as a timestamp within the message.
   2. Key the stream based on an id from the message.
   3. DedupOperator that dedupes messages. This operator will run with a
   parallelism of N.
   4. An operator that persists the messages to S3. It doesn't need to
   output anything - it should ideally be a Sink (if it were a sink we could
   use the StreamingFileSink).
   5. Implement an operator that will make an HTTP request once
   processWatermark is called for time T. A parallelism of 1 will be used for
   this operator as it will do very little work. Because it has a parallelism
   of 1, the operator in step 4 cannot send anything to it as it could become
   a throughput bottleneck.

Does this implementation seem like a valid workaround? Any other
alternatives I should consider?

Thanks for your help,
Kaustubh

Re: Flink Job and Watermarking

Posted by Chesnay Schepler <ch...@apache.org>.

Have you considered using the metric system to access the current 
watermarks for each operator? (see 
https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#io)

On 08.02.2019 03:19, Kaustubh Rudrawar wrote:
> Hi,
>
> I'm writing a job that wants to make an HTTP request once a watermark 
> has reached all tasks of an operator. It would be great if this could 
> be determined from outside the Flink job, but I don't think it's 
> possible to access watermark information for the job as a whole. Below 
> is a workaround I've come up with:
>
>  1. Read messages from Kafka using the provided KafkaSource. Event
>     time will be defined as a timestamp within the message.
>  2. Key the stream based on an id from the message.
>  3. DedupOperator that dedupes messages. This operator will run with a
>     parallelism of N.
>  4. An operator that persists the messages to S3. It doesn't need to
>     output anything - it should ideally be a Sink (if it were a sink
>     we could use the StreamingFileSink).
>  5. Implement an operator that will make an HTTP request once
>     processWatermark is called for time T. A parallelism of 1 will be
>     used for this operator as it will do very little work. Because it
>     has a parallelism of 1, the operator in step 4 cannot send
>     anything to it as it could become a throughput bottleneck.
>
> Does this implementation seem like a valid workaround? Any other 
> alternatives I should consider?
>
> Thanks for your help,
> Kaustubh
>
>