You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Demin Alexey <di...@gmail.com> on 2016/11/15 12:19:54 UTC

KafkaIO for HA jobs

Hi

I have some question about kafka connector:
1) in code i can see *consumer.assign(source.assignedPartitions);* as
result 2 client with same group name can't use inner load balancer from
kafka

2) other problem "how resume reading kafka after restart". In past we
depend on inner kafka mechanism for store offset in kafka/zookeeper, but
with KafkaIO i didn't found how i can make this behavior.


Additional clarification about processing:
sometime we need 2 separate job with same group.id for HA, if one job was
killed, second job can start handle messages after rebalance by kafka.

Could you help me how we can build our requirement with beam processing?

Thans
Alexey Diomin

Re: KafkaIO for HA jobs

Posted by Raghu Angadi <ra...@google.com>.

I see. HA in Beam, as in generic distributed processing platforms like
Hadoop, is typically provided within a job, users shouldn't have to run two
jobs.

A simple Kafka consumer app, Kafka Streams app, or a generic distributed
streaming app (Spark, Flink, etc) all provide different abstractions. All
of these have different trade offs.

E.g.  A Beam app running on Dataflow would claim to provide this level of
HA without requiring you to run multiple instances of it. It handles all
the machine failures, upscaling or downscaling as the load changes etc.

Raghu.

On Tue, Nov 15, 2016 at 2:46 PM, Demin Alexey <di...@gmail.com> wrote:

> I have kafka topic with 10 patritions (for example), 2 jobs with same
> group.id, kafka balanced by 5 partion in every reader.
>
> If one job was killed, kafka rebalance all 10 patritions to second reader
> and processing still working without downtime. After restart first job
> every reader will return to read by 5 partitions each.
>
> Very simple but stable way get HA in product.
>
> p.s. kafka rebalance/check heartbeat/manage connections inner without
> client application magic.
>
> 2016-11-16 1:56 GMT+04:00 Raghu Angadi <ra...@google.com>:
>
>>
>> On Tue, Nov 15, 2016 at 1:50 PM, Demin Alexey <di...@gmail.com> wrote:
>>
>>> 2 separate job with same code and same group.id, if one job was killed,
>>> second job can start handle messages after rebalance by kafka.
>>>
>>
>> Do you want to use the same group.id for both? So you don't want the the
>> second job to consume until the first one fails?
>>
>> Can you explain how this works generally (outside Beam context). If you
>> have two running processes use the same group.id, I would think both of
>> them read part of the stream, right? I mean, what is stopping the second
>> job until the first one is killed?
>>
>> Raghu.
>>
>
>

Re: KafkaIO for HA jobs

Posted by Demin Alexey <di...@gmail.com>.

I have kafka topic with 10 patritions (for example), 2 jobs with same
group.id, kafka balanced by 5 partion in every reader.

If one job was killed, kafka rebalance all 10 patritions to second reader
and processing still working without downtime. After restart first job
every reader will return to read by 5 partitions each.

Very simple but stable way get HA in product.

p.s. kafka rebalance/check heartbeat/manage connections inner without
client application magic.

2016-11-16 1:56 GMT+04:00 Raghu Angadi <ra...@google.com>:

>
> On Tue, Nov 15, 2016 at 1:50 PM, Demin Alexey <di...@gmail.com> wrote:
>
>> 2 separate job with same code and same group.id, if one job was killed,
>> second job can start handle messages after rebalance by kafka.
>>
>
> Do you want to use the same group.id for both? So you don't want the the
> second job to consume until the first one fails?
>
> Can you explain how this works generally (outside Beam context). If you
> have two running processes use the same group.id, I would think both of
> them read part of the stream, right? I mean, what is stopping the second
> job until the first one is killed?
>
> Raghu.
>

Re: KafkaIO for HA jobs

Posted by Raghu Angadi <ra...@google.com>.

On Tue, Nov 15, 2016 at 1:50 PM, Demin Alexey <di...@gmail.com> wrote:

> 2 separate job with same code and same group.id, if one job was killed,
> second job can start handle messages after rebalance by kafka.
>

Do you want to use the same group.id for both? So you don't want the the
second job to consume until the first one fails?

Can you explain how this works generally (outside Beam context). If you
have two running processes use the same group.id, I would think both of
them read part of the stream, right? I mean, what is stopping the second
job until the first one is killed?

Raghu.

Re: KafkaIO for HA jobs

Posted by Demin Alexey <di...@gmail.com>.

Hi, Raghu

Thanks for clarification.

But how I can implement solution for HA?
2 separate job with same code and same group.id, if one job was killed,
second job can start handle messages after rebalance by kafka.

Thanks
Alexey Diomin




2016-11-15 21:31 GMT+04:00 Raghu Angadi <ra...@google.com>:

> Demin,
>
> 1) KafkaIO does not depend on consumer group.id at all. So you can have
> multiple parallel pipelines reading from the same topic. Note
> that consumer.assign(partitions) does not need group id, only
> consumer.subscribe() does.
>
> Sometimes users might want to set a group.id for various reasons. E.g. if
> you want to monitor consumption of a specific group.id externally. You
> can set a group.id if you want.
>
> 2) Beam applications store the consumption state internally in their
> checkpoint (sort of like Kafka storing offsets for a consumer group). That
> said, 'restart' support from different runners (Flink, Spark etc) might be
> bit different. Please specify your set up and we can confirm.
>
> Raghu.
>
>
>
> On Tue, Nov 15, 2016 at 4:19 AM, Demin Alexey <di...@gmail.com> wrote:
>
>> Hi
>>
>> I have some question about kafka connector:
>> 1) in code i can see *consumer.assign(source.assignedPartitions);* as
>> result 2 client with same group name can't use inner load balancer from
>> kafka
>>
>> 2) other problem "how resume reading kafka after restart". In past we
>> depend on inner kafka mechanism for store offset in kafka/zookeeper, but
>> with KafkaIO i didn't found how i can make this behavior.
>>
>>
>> Additional clarification about processing:
>> sometime we need 2 separate job with same group.id for HA, if one job
>> was killed, second job can start handle messages after rebalance by kafka.
>>
>> Could you help me how we can build our requirement with beam processing?
>>
>> Thans
>> Alexey Diomin
>>
>>
>

Re: KafkaIO for HA jobs

Posted by Raghu Angadi <ra...@google.com>.

Demin,

1) KafkaIO does not depend on consumer group.id at all. So you can have
multiple parallel pipelines reading from the same topic. Note
that consumer.assign(partitions) does not need group id, only
consumer.subscribe() does.

Sometimes users might want to set a group.id for various reasons. E.g. if
you want to monitor consumption of a specific group.id externally. You can
set a group.id if you want.

2) Beam applications store the consumption state internally in their
checkpoint (sort of like Kafka storing offsets for a consumer group). That
said, 'restart' support from different runners (Flink, Spark etc) might be
bit different. Please specify your set up and we can confirm.

Raghu.

On Tue, Nov 15, 2016 at 4:19 AM, Demin Alexey <di...@gmail.com> wrote:

> Hi
>
> I have some question about kafka connector:
> 1) in code i can see *consumer.assign(source.assignedPartitions);* as
> result 2 client with same group name can't use inner load balancer from
> kafka
>
> 2) other problem "how resume reading kafka after restart". In past we
> depend on inner kafka mechanism for store offset in kafka/zookeeper, but
> with KafkaIO i didn't found how i can make this behavior.
>
>
> Additional clarification about processing:
> sometime we need 2 separate job with same group.id for HA, if one job was
> killed, second job can start handle messages after rebalance by kafka.
>
> Could you help me how we can build our requirement with beam processing?
>
> Thans
> Alexey Diomin
>
>