You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Konstantin Knauf <ko...@ververica.com> on 2019/06/03 08:01:20 UTC

Re: How to build dependencies and connections between stream jobs?

Hi Henry,

Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are
in general the most common way to connect multiple streaming jobs. The
dependencies between streaming jobs are in my experience of a different
nature though. For batch jobs, it makes sense to schedule one after the
other or having more complicated relationships. Streaming jobs are all
processing data continuously, so the "coordination" happens on a different
level.

To avoid duplication, you can use the Kafka exactly-once sink, but this
comes with a latency penalty (transactions are only committed on checkpoint
completion).

Generally, I would advise to always attach meaningful timestamps to your
records, so that you can use watermarking [1] to trade off between latency
and completeness. These could also be used to identify late records
(resulting from catch up after recovers), which should be ignored by
downstream jobs.

There are other users, who assign a unique ID to every message going
through there system and only use idempotent operations (set operations)
within Flink, because messages are sometimes already duplicated before
reaching the stream processor. For downstream jobs, where an upstream job
might duplicate records, this could be a viable, yet limiting, approach as
well.

Hope this helps and let me know, what you think.

Cheers,

Konstantin







[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks





On Thu, May 30, 2019 at 11:39 AM 徐涛 <ha...@gmail.com> wrote:

> Hi Experts,
>         In batch computing, there are products like Azkaban or airflow to
> manage batch job dependencies. By using the dependency management tool, we
> can build a large-scale system consist of small jobs.
>         In stream processing, it is not practical to put all dependencies
> in one job, because it will make the job being too complicated, and the
> state is too large. I want to build a large-scale realtime system which is
> consist of many Kafka sources and many streaming jobs, but the first thing
> I can think of is how to build the dependencies and connections between
> streaming jobs.
>         The only method I can think of is using a self-implemented retract
> Kafka sink, each streaming job is connected by Kafka topic. But because
> each job may fail and retry, for example, the message in Kafka topic may
> look like this:
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         { “retract”:”true”, “id”:”1”, “amount”:100 }
>         { “retract”:”true”, “id”:”2”, “amount”:200 }
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         if the topic is “topic_1”, the SQL in the downstream job may look
> like this:
>                 select
>                         id, latest(amount)
>                 from topic_1
>                 where retract=“false"
>                 group by id
>
>         But it will also make big state because each id is being grouped.
>         I wonder if using Kafka to connect streaming jobs is applicable,
> how to build a large-scale realtime system consists of many streaming job?
>
>         Thanks a lot.
>
> Best
> Henry



-- 

Konstantin Knauf | Solutions Architect

+49 160 91394525


Planned Absences: 7.6.2019, 20. - 21.06.2019


<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: How to build dependencies and connections between stream jobs?

Posted by 徐涛 <ha...@gmail.com>.
Hi Knauf,
	The solution that I can think of to coordinate between different stream jobs is :
	For example there are two streaming jobs, Job_1 and Job_2:
	Job_1:   receive data from the original kafka topic,  TOPIC_ORIG  for example, sink the data to another kafka topic, TOPIC_JOB_1_SINK for example. It should be mentioned that:  ① I implement a retract kafka sink   ②I do not use kafka exactly-once sink ③ every record in the TOPIC_JOB_1_SINK should have one unique key.  ④ each record with the same key should be send to the same kafka partition.
	Job_2:  receive data from TOPIC_JOB_1_SINK, first group by the unique key and get the latest value, then go on with the logic of job 2 , finally sink the data to final sink(es, hbase, mysql for example)
		     Here I group by unique key first, because Job_1 may fail and retry, so some dirty data may be included in the TOPIC_JOB_1_SINK.

	So from the overview:
	Job_1															Job_2
	-------------------------------------------------------------------------------------                         -----------------------------------------------------------------------------------------------------------------------------------------------------------
	|      TOPIC_ORIG -> Logic_Job_1 -> TOPIC_JOB_1_SINK      |           ——>     |     TOPIC_JOB_1_SINK -> GROUP_BY_UNIQUE_KEY_GET_LATEST -> Logic_Job_2 -> FINAL_JOB_2_SINK    |
	-------------------------------------------------------------------------------------                         -----------------------------------------------------------------------------------------------------------------------------------------------------------


	Would you please help review the solution, if there are some better solutions, kindly let me know about it , thank you.


Best 
Henry

> 在 2019年6月3日,下午4:01,Konstantin Knauf <ko...@ververica.com> 写道:
> 
> Hi Henry, 
> 
> Apache Kafka or other message queue like Apache Pulsar or AWS Kinesis are in general the most common way to connect multiple streaming jobs. The dependencies between streaming jobs are in my experience of a different nature though. For batch jobs, it makes sense to schedule one after the other or having more complicated relationships. Streaming jobs are all processing data continuously, so the "coordination" happens on a different level. 
> 
> To avoid duplication, you can use the Kafka exactly-once sink, but this comes with a latency penalty (transactions are only committed on checkpoint completion). 
> 
> Generally, I would advise to always attach meaningful timestamps to your records, so that you can use watermarking [1] to trade off between latency and completeness. These could also be used to identify late records (resulting from catch up after recovers), which should be ignored by downstream jobs. 
> 
> There are other users, who assign a unique ID to every message going through there system and only use idempotent operations (set operations) within Flink, because messages are sometimes already duplicated before reaching the stream processor. For downstream jobs, where an upstream job might duplicate records, this could be a viable, yet limiting, approach as well. 
> 
> Hope this helps and let me know, what you think. 
> 
> Cheers, 
> 
> Konstantin
> 
> 
> 
> 
> 
> 
> 
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks <https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html#event-time-and-watermarks> 
> 
> 
> 
> 
> 
> On Thu, May 30, 2019 at 11:39 AM 徐涛 <happydexutao@gmail.com <ma...@gmail.com>> wrote:
> Hi Experts,
>         In batch computing, there are products like Azkaban or airflow to manage batch job dependencies. By using the dependency management tool, we can build a large-scale system consist of small jobs.
>         In stream processing, it is not practical to put all dependencies in one job, because it will make the job being too complicated, and the state is too large. I want to build a large-scale realtime system which is consist of many Kafka sources and many streaming jobs, but the first thing I can think of is how to build the dependencies and connections between streaming jobs. 
>         The only method I can think of is using a self-implemented retract Kafka sink, each streaming job is connected by Kafka topic. But because each job may fail and retry, for example, the message in Kafka topic may look like this:
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         { “retract”:”true”, “id”:”1”, “amount”:100 }
>         { “retract”:”true”, “id”:”2”, “amount”:200 }
>         { “retract”:”false”, “id”:”1”, “amount”:100 }
>         { “retract”:”false”, “id”:”2”, “amount”:200 }
>         if the topic is “topic_1”, the SQL in the downstream job may look like this:
>                 select 
>                         id, latest(amount) 
>                 from topic_1
>                 where retract=“false"
>                 group by id
> 
>         But it will also make big state because each id is being grouped.
>         I wonder if using Kafka to connect streaming jobs is applicable, how to build a large-scale realtime system consists of many streaming job?
> 
>         Thanks a lot.
> 
> Best
> Henry
> 
> 
> -- 
> Konstantin Knauf | Solutions Architect
> +49 160 91394525
> 
> Planned Absences: 7.6.2019, 20. - 21.06.2019
> 
>  <https://www.ververica.com/>
> Follow us @VervericaData
> --
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference
> Stream Processing | Event Driven | Real Time
> --
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen