You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Bae, Jae Hyeon" <me...@gmail.com> on 2014/08/17 20:41:00 UTC

ZkClient bug can bring down broker/consumer on zookeeper push in EC2 environment

Recently, we found the serious ZkClient bug, actual Apache Zookeeper client
bug, which can bring down broker/consumer on zookeeper push.

We're running kafka and zookeeeper in AWS EC2 environment. Zookeeper
instances are bound with EIP to give the static hostname for each instance,
which means even if the EC2 instance is terminated and replaced with the
new one, it will have the same hostname but its private IP bound to the
hostname can be changed.

The scenario is, if we do rolling push all zookeeper server instances by
terminating and waiting until the new instance joins to the quorum one by
one, finally, ZkClient will try to connect to the old IP addresses which do
not exist any more due to DNS caching on Apache Zookeeper client side,
please refer to https://issues.apache.org/jira/browse/ZOOKEEPER-338

So, we need to restart kafka brokers and consumers to refresh DNS cache. To
solve this problem, I sent the following pull request to ZkClient,
https://github.com/sgroschupf/zkclient/pull/26

Please review the above PR. If new version of ZkClient with the following
fix is not released on the schedule of kafka 0.8.2 release, I'd like kafka
to ship the internally built ZkClient with the fix. I will really
appreciate.

Thank you
Best, Jae

Re: ZkClient bug can bring down broker/consumer on zookeeper push in EC2 environment

Posted by "Bae, Jae Hyeon" <me...@gmail.com>.
I am not sure how many companies are running zookeeper like we are doing
and I haven't seen any other companies terminating all EC2 instances to
update zookeeper cluster. But in the worst case, zookeeper EC2 instances
can be replaced before restarting kafka broker and consumer, sharing the
workaround might be useful for some other companies.

The solution is, putting zookeeper's EIP address entry into /etc/hosts file
in kafka broker or consumer instances. For example, if zookeeper's EIP is
33.44.55.66 and ec2-33-44-55-66.amazonaws.com, adding the following entry
will prevent zookeeper client to cache private IP address which will be
invalidated.

33.44.55.66  ec2-33-44-55-66.compute-1.amazonaws.com

and the zookeeper connection string should be
ec2-33-44-55-66.compute-1.amazonaws.com:2181, public host name.

If the kafka is running under different security group with zookeeper,
kafka instance should add its public IP address to the ingress permission
list of zookeeper security group. This can be a little harmful because
automating this process should be done with additional packaging.

On Tue, Sep 16, 2014 at 9:10 AM, Neha Narkhede <ne...@gmail.com>
wrote:

> Would you mind sharing your workaround with the community?
>
> On Mon, Sep 15, 2014 at 10:17 PM, Bae, Jae Hyeon <me...@gmail.com>
> wrote:
>
> > The above pull request didn't work perfectly. After a bunch of testing
> > experiment, we decided that fixing zkclient itself isn't easy. So we
> > decided to go with another workaround.
> >
> > We're expecting zookeeper 3.5.0 will be stabilized as soon as possible
> with
> > the feature to refresh connections and the future version of kafka will
> > ship that zookeeper version.
> >
> > On Mon, Sep 15, 2014 at 8:42 PM, Neha Narkhede <ne...@gmail.com>
> > wrote:
> >
> > > Thanks for reporting this issue. Agree that this is a problem for Kafka
> > > users using AWS. Please can you open a JIRA so we can keep track of
> this?
> > >
> > > On Sun, Aug 17, 2014 at 11:41 AM, Bae, Jae Hyeon <me...@gmail.com>
> > > wrote:
> > >
> > > > Recently, we found the serious ZkClient bug, actual Apache Zookeeper
> > > client
> > > > bug, which can bring down broker/consumer on zookeeper push.
> > > >
> > > > We're running kafka and zookeeeper in AWS EC2 environment. Zookeeper
> > > > instances are bound with EIP to give the static hostname for each
> > > instance,
> > > > which means even if the EC2 instance is terminated and replaced with
> > the
> > > > new one, it will have the same hostname but its private IP bound to
> the
> > > > hostname can be changed.
> > > >
> > > > The scenario is, if we do rolling push all zookeeper server instances
> > by
> > > > terminating and waiting until the new instance joins to the quorum
> one
> > by
> > > > one, finally, ZkClient will try to connect to the old IP addresses
> > which
> > > do
> > > > not exist any more due to DNS caching on Apache Zookeeper client
> side,
> > > > please refer to https://issues.apache.org/jira/browse/ZOOKEEPER-338
> > > >
> > > > So, we need to restart kafka brokers and consumers to refresh DNS
> > cache.
> > > To
> > > > solve this problem, I sent the following pull request to ZkClient,
> > > > https://github.com/sgroschupf/zkclient/pull/26
> > > >
> > > > Please review the above PR. If new version of ZkClient with the
> > following
> > > > fix is not released on the schedule of kafka 0.8.2 release, I'd like
> > > kafka
> > > > to ship the internally built ZkClient with the fix. I will really
> > > > appreciate.
> > > >
> > > > Thank you
> > > > Best, Jae
> > > >
> > >
> >
>

Re: ZkClient bug can bring down broker/consumer on zookeeper push in EC2 environment

Posted by Neha Narkhede <ne...@gmail.com>.
Would you mind sharing your workaround with the community?

On Mon, Sep 15, 2014 at 10:17 PM, Bae, Jae Hyeon <me...@gmail.com> wrote:

> The above pull request didn't work perfectly. After a bunch of testing
> experiment, we decided that fixing zkclient itself isn't easy. So we
> decided to go with another workaround.
>
> We're expecting zookeeper 3.5.0 will be stabilized as soon as possible with
> the feature to refresh connections and the future version of kafka will
> ship that zookeeper version.
>
> On Mon, Sep 15, 2014 at 8:42 PM, Neha Narkhede <ne...@gmail.com>
> wrote:
>
> > Thanks for reporting this issue. Agree that this is a problem for Kafka
> > users using AWS. Please can you open a JIRA so we can keep track of this?
> >
> > On Sun, Aug 17, 2014 at 11:41 AM, Bae, Jae Hyeon <me...@gmail.com>
> > wrote:
> >
> > > Recently, we found the serious ZkClient bug, actual Apache Zookeeper
> > client
> > > bug, which can bring down broker/consumer on zookeeper push.
> > >
> > > We're running kafka and zookeeeper in AWS EC2 environment. Zookeeper
> > > instances are bound with EIP to give the static hostname for each
> > instance,
> > > which means even if the EC2 instance is terminated and replaced with
> the
> > > new one, it will have the same hostname but its private IP bound to the
> > > hostname can be changed.
> > >
> > > The scenario is, if we do rolling push all zookeeper server instances
> by
> > > terminating and waiting until the new instance joins to the quorum one
> by
> > > one, finally, ZkClient will try to connect to the old IP addresses
> which
> > do
> > > not exist any more due to DNS caching on Apache Zookeeper client side,
> > > please refer to https://issues.apache.org/jira/browse/ZOOKEEPER-338
> > >
> > > So, we need to restart kafka brokers and consumers to refresh DNS
> cache.
> > To
> > > solve this problem, I sent the following pull request to ZkClient,
> > > https://github.com/sgroschupf/zkclient/pull/26
> > >
> > > Please review the above PR. If new version of ZkClient with the
> following
> > > fix is not released on the schedule of kafka 0.8.2 release, I'd like
> > kafka
> > > to ship the internally built ZkClient with the fix. I will really
> > > appreciate.
> > >
> > > Thank you
> > > Best, Jae
> > >
> >
>

Re: ZkClient bug can bring down broker/consumer on zookeeper push in EC2 environment

Posted by "Bae, Jae Hyeon" <me...@gmail.com>.
The above pull request didn't work perfectly. After a bunch of testing
experiment, we decided that fixing zkclient itself isn't easy. So we
decided to go with another workaround.

We're expecting zookeeper 3.5.0 will be stabilized as soon as possible with
the feature to refresh connections and the future version of kafka will
ship that zookeeper version.

On Mon, Sep 15, 2014 at 8:42 PM, Neha Narkhede <ne...@gmail.com>
wrote:

> Thanks for reporting this issue. Agree that this is a problem for Kafka
> users using AWS. Please can you open a JIRA so we can keep track of this?
>
> On Sun, Aug 17, 2014 at 11:41 AM, Bae, Jae Hyeon <me...@gmail.com>
> wrote:
>
> > Recently, we found the serious ZkClient bug, actual Apache Zookeeper
> client
> > bug, which can bring down broker/consumer on zookeeper push.
> >
> > We're running kafka and zookeeeper in AWS EC2 environment. Zookeeper
> > instances are bound with EIP to give the static hostname for each
> instance,
> > which means even if the EC2 instance is terminated and replaced with the
> > new one, it will have the same hostname but its private IP bound to the
> > hostname can be changed.
> >
> > The scenario is, if we do rolling push all zookeeper server instances by
> > terminating and waiting until the new instance joins to the quorum one by
> > one, finally, ZkClient will try to connect to the old IP addresses which
> do
> > not exist any more due to DNS caching on Apache Zookeeper client side,
> > please refer to https://issues.apache.org/jira/browse/ZOOKEEPER-338
> >
> > So, we need to restart kafka brokers and consumers to refresh DNS cache.
> To
> > solve this problem, I sent the following pull request to ZkClient,
> > https://github.com/sgroschupf/zkclient/pull/26
> >
> > Please review the above PR. If new version of ZkClient with the following
> > fix is not released on the schedule of kafka 0.8.2 release, I'd like
> kafka
> > to ship the internally built ZkClient with the fix. I will really
> > appreciate.
> >
> > Thank you
> > Best, Jae
> >
>

Re: ZkClient bug can bring down broker/consumer on zookeeper push in EC2 environment

Posted by Neha Narkhede <ne...@gmail.com>.
Thanks for reporting this issue. Agree that this is a problem for Kafka
users using AWS. Please can you open a JIRA so we can keep track of this?

On Sun, Aug 17, 2014 at 11:41 AM, Bae, Jae Hyeon <me...@gmail.com> wrote:

> Recently, we found the serious ZkClient bug, actual Apache Zookeeper client
> bug, which can bring down broker/consumer on zookeeper push.
>
> We're running kafka and zookeeeper in AWS EC2 environment. Zookeeper
> instances are bound with EIP to give the static hostname for each instance,
> which means even if the EC2 instance is terminated and replaced with the
> new one, it will have the same hostname but its private IP bound to the
> hostname can be changed.
>
> The scenario is, if we do rolling push all zookeeper server instances by
> terminating and waiting until the new instance joins to the quorum one by
> one, finally, ZkClient will try to connect to the old IP addresses which do
> not exist any more due to DNS caching on Apache Zookeeper client side,
> please refer to https://issues.apache.org/jira/browse/ZOOKEEPER-338
>
> So, we need to restart kafka brokers and consumers to refresh DNS cache. To
> solve this problem, I sent the following pull request to ZkClient,
> https://github.com/sgroschupf/zkclient/pull/26
>
> Please review the above PR. If new version of ZkClient with the following
> fix is not released on the schedule of kafka 0.8.2 release, I'd like kafka
> to ship the internally built ZkClient with the fix. I will really
> appreciate.
>
> Thank you
> Best, Jae
>