You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Rahul Raj <ra...@gmail.com> on 2017/09/22 08:46:51 UTC

Clarifications on FLINK-KAFKA consumer

Hi,

I have just started working with FLINK and I am working on a project which
involves reading KAFKA data and processing it. Following questions came to
my mind:

1. Will FLink-Kafka consumer 0.8x run on multiple task slots or a single
task slot? Basically I mean if its going to be a parallel operation or a
non parallel operation?

2. If its a parallel operation, then do multiple task slots read data from
single kafka partition or multiple kafka partition?

3. If data is read from multiple Kafka partition, then how duplication is
avoided? Is it done from KAFKA or by FLINK?

Rahul Raj

Re: Clarifications on FLINK-KAFKA consumer

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.
Hi Rahul!

1. Will FLink-Kafka consumer 0.8x run on multiple task slots or a single task slot? Basically I mean if its going to be a parallel operation or a non parallel operation?
Yes, the FlinkKafkaConsumer is a parallel consumer.

2. If its a parallel operation, then do multiple task slots read data from single kafka partition or multiple kafka partition?
Each single parallel instance of a FlinkKafkaConsumer source can subscribe to multiple Kafka partitions. Each Kafka partition is handled by exactly one FlinkKafkaConsumer parallel instance.

3. If data is read from multiple Kafka partition, then how duplication is avoided? Is it done from KAFKA or by FLINK?
Yes, the FlinkKafkaConsumer is a parallel consumer.I’m not sure exactly what you are referring to by “duplication” here. Do you mean duplication in the data itself in the Kafka topics, or duplicated consumption by Flink?
If it is the former: prior to Kafka 0.11, Kafka writes did not support transactions and therefore can only have at-least-once writes.
If you mean the latter: the FlinkKafkaConsumer achieves exactly-once guarantees when consuming from Kafka topics using Flink’s checkpointing mechanism. You can read about that here [1][2].
Hope the pointers help!

- Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/stream_checkpointing.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/connectors/kafka.html#kafka-consumers-and-fault-tolerance

On 22 September 2017 at 10:46:55 AM, Rahul Raj (rahulrajmsrit@gmail.com) wrote:

Hi,

I have just started working with FLINK and I am working on a project which involves reading KAFKA data and processing it. Following questions came to my mind:

1. Will FLink-Kafka consumer 0.8x run on multiple task slots or a single task slot? Basically I mean if its going to be a parallel operation or a non parallel operation?

2. If its a parallel operation, then do multiple task slots read data from single kafka partition or multiple kafka partition?

3. If data is read from multiple Kafka partition, then how duplication is avoided? Is it done from KAFKA or by FLINK?

Rahul Raj