You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jun Rao <ju...@confluent.io> on 2017/10/27 22:42:18 UTC

[DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Hi, Everyone,

We created "KIP-217: Expose a timeout to allow an expired ZK session to be
re-created".

https://cwiki.apache.org/confluence/display/KAFKA/KIP-217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+to+be+re-created

Please take a look and provide your feedback.

Thanks,

Jun

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Stephane Maarek <st...@simplemachines.com.au>.

Thanks Jun for the clarification

It sounds like this kip is complementary to the zookeeper-2184 and can move
forward without it. We should still push hard for zookeeper-2184 to go
through (saw you commented on it earlier)

LGTM!

On 2 Nov. 2017 12:34 pm, "Jun Rao" <ju...@confluent.io> wrote:

> Hi, Stephane,
>
> 3) The difference is that currently, there is no retry when re-creating the
> Zookeeper object when a ZK session expires. So, if the re-creation of
> Zookeeper fails, the broker just logs the error and the Zookeeper object
> will never be created again. With this KIP, we will keep retrying the
> creation of Zookeeper until success.
>
> Thanks,
>
> Jun
>
> On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>
> > Hi Jun,
> >
> > Thanks for the reply.
> >
> > 1) The reason I'm asking about it is I wonder if it's not worth focusing
> > the development efforts on taking ownership of the existing PR (
> > https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> > rebase it and have it merged into the ZK codebase shortly.  I feel this
> KIP
> > might introduce a setting that could be deprecated shortly and confuse
> the
> > end user a bit further with one more knob to turn.
> >
> > 3) I'm not sure if I fully understand, sorry for the beginner's question:
> > if the default timeout is infinite, then it won't change anything to how
> > Kafka works from today, does it? (unless I'm missing something sorry). If
> > not set to infinite, then we introduce the risk of a whole cluster
> shutting
> > down at once?
> >
> > Thanks,
> > Stephane
> >
> > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> >
> >     Hi, Stephane,
> >
> >     Thanks for the reply.
> >
> >     1) Fixing the issue in ZK will be ideal. Not sure when it will happen
> >     though. Once it's fixed, we can probably deprecate this config.
> >
> >     2) That could be useful. Is there a java api to do that at runtime?
> > Also,
> >     invalidating DNS cache doesn't always fix the issue of unresolved
> > host. In
> >     some of the cases, human intervention is needed.
> >
> >     3) The default timeout is infinite though.
> >
> >     Jun
> >
> >
> >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> >     stephane@simplemachines.com.au> wrote:
> >
> >     > Hi Jun,
> >     >
> >     > I think this is very helpful. Restarting Kafka brokers in case of
> > zookeeper
> >     > host change is not a well known operation.
> >     >
> >     > Few questions:
> >     > 1) would it not be worth fixing the problem at the source ? This
> has
> > been
> >     > stuck for a while though, maybe a little push would help :
> >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > issue/ZOOKEEPER-2184
> >     >
> >     > 2) upon recreating the zookeeper object , is it not possible to
> > invalidate
> >     > the DNS cache so that it resolves the new hostname?
> >     >
> >     > 3) could the cluster be down in this situation: one migrates an
> > entire
> >     > zookeeper cluster to new machines (one by one). The quorum is still
> > alive
> >     > without downtime, but now every broker in a cluster can't resolve
> > zookeeper
> >     > at the same time. They all shut down at the same time after the new
> >     > time-out setting.
> >     >
> >     > Thanks !
> >     > Stéphane
> >     >
> >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> >     >
> >     > > Hi, Everyone,
> >     > >
> >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> > session to
> >     > be
> >     > > re-created".
> >     > >
> >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > to+be+re-created
> >     > >
> >     > > Please take a look and provide your feedback.
> >     > >
> >     > > Thanks,
> >     > >
> >     > > Jun
> >     > >
> >     >
> >
> >
> >
> >
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Ted Yu <yu...@gmail.com>.

The following JIRA provides some background on why upgrading immediately
following new release may not be prudent (though I expect this to be rare):

ZOOKEEPER-2347

On Thu, Nov 2, 2017 at 3:00 PM, Ted Yu <yu...@gmail.com> wrote:

> Stephane:
> bq. hasn't acted in over a year
>
> The above fact implies some reluctance from the zookeeper community to
> fully solve the issue (maybe due to technical issues).
> Anyway, we should plan on not relying on the fix to go through in the near
> future.
>
> As for Jun's latest suggestion, I think we should add periodic logging
> indicating the retry.
>
> A KIP is not needed if we go that route.
>
> Cheers
>
> On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>
>> Hi Jun
>>
>> I think this is a better option. Would that change require a kip then as
>> it's not a change in public API ?
>>
>> @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems
>> that the owner of the pr hasn't acted in over a year and I think someone
>> needs to take ownership of that. Additionally, this would be a change in
>> Kafka zookeeper client dependency, so no need to update your zookeeper
>> quorum to benefit from the change
>>
>> Thanks
>> Stéphane
>>
>>
>> On 3 Nov. 2017 8:45 am, "Jun Rao" <ju...@confluent.io> wrote:
>>
>> Stephane, Jeff,
>>
>> Another option is to not expose the reconnect timeout config and just
>> retry
>> the creation of Zookeeper forever. This is an improvement from the current
>> situation and if zookeeper-2184 is fixed in the future, we don't need to
>> deprecate the config.
>>
>> Thanks,
>>
>> Jun
>>
>> On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
>> >
>> > I think adding the session recreation on Kafka side should benefit Kafka
>> > users, especially those who don't plan to move to 3.4.12+ in the near
>> > future.
>> >
>> > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:
>> >
>> > > Hi, Stephane,
>> > >
>> > > 3) The difference is that currently, there is no retry when
>> re-creating
>> > the
>> > > Zookeeper object when a ZK session expires. So, if the re-creation of
>> > > Zookeeper fails, the broker just logs the error and the Zookeeper
>> object
>> > > will never be created again. With this KIP, we will keep retrying the
>> > > creation of Zookeeper until success.
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
>> > > stephane@simplemachines.com.au> wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > Thanks for the reply.
>> > > >
>> > > > 1) The reason I'm asking about it is I wonder if it's not worth
>> > focusing
>> > > > the development efforts on taking ownership of the existing PR (
>> > > > https://github.com/apache/zookeeper/pull/150)  to fix
>> ZOOKEEPER-2184,
>> > > > rebase it and have it merged into the ZK codebase shortly.  I feel
>> this
>> > > KIP
>> > > > might introduce a setting that could be deprecated shortly and
>> confuse
>> > > the
>> > > > end user a bit further with one more knob to turn.
>> > > >
>> > > > 3) I'm not sure if I fully understand, sorry for the beginner's
>> > question:
>> > > > if the default timeout is infinite, then it won't change anything to
>> > how
>> > > > Kafka works from today, does it? (unless I'm missing something
>> sorry).
>> > If
>> > > > not set to infinite, then we introduce the risk of a whole cluster
>> > > shutting
>> > > > down at once?
>> > > >
>> > > > Thanks,
>> > > > Stephane
>> > > >
>> > > > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
>> > > >
>> > > >     Hi, Stephane,
>> > > >
>> > > >     Thanks for the reply.
>> > > >
>> > > >     1) Fixing the issue in ZK will be ideal. Not sure when it will
>> > happen
>> > > >     though. Once it's fixed, we can probably deprecate this config.
>> > > >
>> > > >     2) That could be useful. Is there a java api to do that at
>> runtime?
>> > > > Also,
>> > > >     invalidating DNS cache doesn't always fix the issue of
>> unresolved
>> > > > host. In
>> > > >     some of the cases, human intervention is needed.
>> > > >
>> > > >     3) The default timeout is infinite though.
>> > > >
>> > > >     Jun
>> > > >
>> > > >
>> > > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
>> > > >     stephane@simplemachines.com.au> wrote:
>> > > >
>> > > >     > Hi Jun,
>> > > >     >
>> > > >     > I think this is very helpful. Restarting Kafka brokers in case
>> of
>> > > > zookeeper
>> > > >     > host change is not a well known operation.
>> > > >     >
>> > > >     > Few questions:
>> > > >     > 1) would it not be worth fixing the problem at the source ?
>> This
>> > > has
>> > > > been
>> > > >     > stuck for a while though, maybe a little push would help :
>> > > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
>> > > > issue/ZOOKEEPER-2184
>> > > >     >
>> > > >     > 2) upon recreating the zookeeper object , is it not possible
>> to
>> > > > invalidate
>> > > >     > the DNS cache so that it resolves the new hostname?
>> > > >     >
>> > > >     > 3) could the cluster be down in this situation: one migrates
>> an
>> > > > entire
>> > > >     > zookeeper cluster to new machines (one by one). The quorum is
>> > still
>> > > > alive
>> > > >     > without downtime, but now every broker in a cluster can't
>> resolve
>> > > > zookeeper
>> > > >     > at the same time. They all shut down at the same time after
>> the
>> > new
>> > > >     > time-out setting.
>> > > >     >
>> > > >     > Thanks !
>> > > >     > Stéphane
>> > > >     >
>> > > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
>> > > >     >
>> > > >     > > Hi, Everyone,
>> > > >     > >
>> > > >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
>> > > > session to
>> > > >     > be
>> > > >     > > re-created".
>> > > >     > >
>> > > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
>> > > > to+be+re-created
>> > > >     > >
>> > > >     > > Please take a look and provide your feedback.
>> > > >     > >
>> > > >     > > Thanks,
>> > > >     > >
>> > > >     > > Jun
>> > > >     > >
>> > > >     >
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Jun Rao <ju...@confluent.io>.

Ok. Based on the discussion, it seems that doing infinite re-creation is
better. I will cancel the KIP.

Thanks,

Jun

On Thu, Nov 2, 2017 at 6:14 PM, Jeff Widman <je...@jeffwidman.com> wrote:

> +1 for permanent retry under the covers (without an exposed/later
> deprecated config).
>
> That said, I understand the reality that sometimes we have to workaround an
> unfixed issue in another project, so if you think best to expose a config,
> then I have no objections. Mainly I wanted to make sure you'd tried to get
> upstream to fix as that is almost always a cleaner solution.
>
> > The above fact implies some reluctance from the zookeeper community to
> fully
> solve the issue (maybe due to technical issues).
>
> @Ted - I spent some time a few months ago poking through issues on the ZK
> issue tracker, and it looked like there wasn't much activity on the project
> lately. So my guess is that it's less about problems with this particular
> solution, and more that the solution has just enough moving parts that no
> one with commit rights has had the time to review it. As a volunteer
> maintainer on a number of projects, I certainly empathize with them,
> although it would be nice to get some more committers onto the Zookeeper
> project who have the time to review some of these semi-abandoned PRs and
> either accept or reject them.
>
>
>
> On Thu, Nov 2, 2017 at 3:00 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Stephane:
> > bq. hasn't acted in over a year
> >
> > The above fact implies some reluctance from the zookeeper community to
> > fully solve the issue (maybe due to technical issues).
> > Anyway, we should plan on not relying on the fix to go through in the
> near
> > future.
> >
> > As for Jun's latest suggestion, I think we should add periodic logging
> > indicating the retry.
> >
> > A KIP is not needed if we go that route.
> >
> > Cheers
> >
> > On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek <
> > stephane@simplemachines.com.au> wrote:
> >
> > > Hi Jun
> > >
> > > I think this is a better option. Would that change require a kip then
> as
> > > it's not a change in public API ?
> > >
> > > @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems
> > > that the owner of the pr hasn't acted in over a year and I think
> someone
> > > needs to take ownership of that. Additionally, this would be a change
> in
> > > Kafka zookeeper client dependency, so no need to update your zookeeper
> > > quorum to benefit from the change
> > >
> > > Thanks
> > > Stéphane
> > >
> > >
> > > On 3 Nov. 2017 8:45 am, "Jun Rao" <ju...@confluent.io> wrote:
> > >
> > > Stephane, Jeff,
> > >
> > > Another option is to not expose the reconnect timeout config and just
> > retry
> > > the creation of Zookeeper forever. This is an improvement from the
> > current
> > > situation and if zookeeper-2184 is fixed in the future, we don't need
> to
> > > deprecate the config.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
> > > >
> > > > I think adding the session recreation on Kafka side should benefit
> > Kafka
> > > > users, especially those who don't plan to move to 3.4.12+ in the near
> > > > future.
> > > >
> > > > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Stephane,
> > > > >
> > > > > 3) The difference is that currently, there is no retry when
> > re-creating
> > > > the
> > > > > Zookeeper object when a ZK session expires. So, if the re-creation
> of
> > > > > Zookeeper fails, the broker just logs the error and the Zookeeper
> > > object
> > > > > will never be created again. With this KIP, we will keep retrying
> the
> > > > > creation of Zookeeper until success.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> > > > > stephane@simplemachines.com.au> wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 1) The reason I'm asking about it is I wonder if it's not worth
> > > > focusing
> > > > > > the development efforts on taking ownership of the existing PR (
> > > > > > https://github.com/apache/zookeeper/pull/150)  to fix
> > > ZOOKEEPER-2184,
> > > > > > rebase it and have it merged into the ZK codebase shortly.  I
> feel
> > > this
> > > > > KIP
> > > > > > might introduce a setting that could be deprecated shortly and
> > > confuse
> > > > > the
> > > > > > end user a bit further with one more knob to turn.
> > > > > >
> > > > > > 3) I'm not sure if I fully understand, sorry for the beginner's
> > > > question:
> > > > > > if the default timeout is infinite, then it won't change anything
> > to
> > > > how
> > > > > > Kafka works from today, does it? (unless I'm missing something
> > > sorry).
> > > > If
> > > > > > not set to infinite, then we introduce the risk of a whole
> cluster
> > > > > shutting
> > > > > > down at once?
> > > > > >
> > > > > > Thanks,
> > > > > > Stephane
> > > > > >
> > > > > > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> > > > > >
> > > > > >     Hi, Stephane,
> > > > > >
> > > > > >     Thanks for the reply.
> > > > > >
> > > > > >     1) Fixing the issue in ZK will be ideal. Not sure when it
> will
> > > > happen
> > > > > >     though. Once it's fixed, we can probably deprecate this
> config.
> > > > > >
> > > > > >     2) That could be useful. Is there a java api to do that at
> > > runtime?
> > > > > > Also,
> > > > > >     invalidating DNS cache doesn't always fix the issue of
> > unresolved
> > > > > > host. In
> > > > > >     some of the cases, human intervention is needed.
> > > > > >
> > > > > >     3) The default timeout is infinite though.
> > > > > >
> > > > > >     Jun
> > > > > >
> > > > > >
> > > > > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> > > > > >     stephane@simplemachines.com.au> wrote:
> > > > > >
> > > > > >     > Hi Jun,
> > > > > >     >
> > > > > >     > I think this is very helpful. Restarting Kafka brokers in
> > case
> > > of
> > > > > > zookeeper
> > > > > >     > host change is not a well known operation.
> > > > > >     >
> > > > > >     > Few questions:
> > > > > >     > 1) would it not be worth fixing the problem at the source ?
> > > This
> > > > > has
> > > > > > been
> > > > > >     > stuck for a while though, maybe a little push would help :
> > > > > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > > > > > issue/ZOOKEEPER-2184
> > > > > >     >
> > > > > >     > 2) upon recreating the zookeeper object , is it not
> possible
> > to
> > > > > > invalidate
> > > > > >     > the DNS cache so that it resolves the new hostname?
> > > > > >     >
> > > > > >     > 3) could the cluster be down in this situation: one
> migrates
> > an
> > > > > > entire
> > > > > >     > zookeeper cluster to new machines (one by one). The quorum
> is
> > > > still
> > > > > > alive
> > > > > >     > without downtime, but now every broker in a cluster can't
> > > resolve
> > > > > > zookeeper
> > > > > >     > at the same time. They all shut down at the same time after
> > the
> > > > new
> > > > > >     > time-out setting.
> > > > > >     >
> > > > > >     > Thanks !
> > > > > >     > Stéphane
> > > > > >     >
> > > > > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io>
> wrote:
> > > > > >     >
> > > > > >     > > Hi, Everyone,
> > > > > >     > >
> > > > > >     > > We created "KIP-217: Expose a timeout to allow an expired
> > ZK
> > > > > > session to
> > > > > >     > be
> > > > > >     > > re-created".
> > > > > >     > >
> > > > > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > > > > > to+be+re-created
> > > > > >     > >
> > > > > >     > > Please take a look and provide your feedback.
> > > > > >     > >
> > > > > >     > > Thanks,
> > > > > >     > >
> > > > > >     > > Jun
> > > > > >     > >
> > > > > >     >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
>
> *Jeff Widman*
> jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
> <><
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Jeff Widman <je...@jeffwidman.com>.

+1 for permanent retry under the covers (without an exposed/later
deprecated config).

That said, I understand the reality that sometimes we have to workaround an
unfixed issue in another project, so if you think best to expose a config,
then I have no objections. Mainly I wanted to make sure you'd tried to get
upstream to fix as that is almost always a cleaner solution.

> The above fact implies some reluctance from the zookeeper community to fully
solve the issue (maybe due to technical issues).

@Ted - I spent some time a few months ago poking through issues on the ZK
issue tracker, and it looked like there wasn't much activity on the project
lately. So my guess is that it's less about problems with this particular
solution, and more that the solution has just enough moving parts that no
one with commit rights has had the time to review it. As a volunteer
maintainer on a number of projects, I certainly empathize with them,
although it would be nice to get some more committers onto the Zookeeper
project who have the time to review some of these semi-abandoned PRs and
either accept or reject them.



On Thu, Nov 2, 2017 at 3:00 PM, Ted Yu <yu...@gmail.com> wrote:

> Stephane:
> bq. hasn't acted in over a year
>
> The above fact implies some reluctance from the zookeeper community to
> fully solve the issue (maybe due to technical issues).
> Anyway, we should plan on not relying on the fix to go through in the near
> future.
>
> As for Jun's latest suggestion, I think we should add periodic logging
> indicating the retry.
>
> A KIP is not needed if we go that route.
>
> Cheers
>
> On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>
> > Hi Jun
> >
> > I think this is a better option. Would that change require a kip then as
> > it's not a change in public API ?
> >
> > @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems
> > that the owner of the pr hasn't acted in over a year and I think someone
> > needs to take ownership of that. Additionally, this would be a change in
> > Kafka zookeeper client dependency, so no need to update your zookeeper
> > quorum to benefit from the change
> >
> > Thanks
> > Stéphane
> >
> >
> > On 3 Nov. 2017 8:45 am, "Jun Rao" <ju...@confluent.io> wrote:
> >
> > Stephane, Jeff,
> >
> > Another option is to not expose the reconnect timeout config and just
> retry
> > the creation of Zookeeper forever. This is an improvement from the
> current
> > situation and if zookeeper-2184 is fixed in the future, we don't need to
> > deprecate the config.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
> > >
> > > I think adding the session recreation on Kafka side should benefit
> Kafka
> > > users, especially those who don't plan to move to 3.4.12+ in the near
> > > future.
> > >
> > > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Stephane,
> > > >
> > > > 3) The difference is that currently, there is no retry when
> re-creating
> > > the
> > > > Zookeeper object when a ZK session expires. So, if the re-creation of
> > > > Zookeeper fails, the broker just logs the error and the Zookeeper
> > object
> > > > will never be created again. With this KIP, we will keep retrying the
> > > > creation of Zookeeper until success.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> > > > stephane@simplemachines.com.au> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 1) The reason I'm asking about it is I wonder if it's not worth
> > > focusing
> > > > > the development efforts on taking ownership of the existing PR (
> > > > > https://github.com/apache/zookeeper/pull/150)  to fix
> > ZOOKEEPER-2184,
> > > > > rebase it and have it merged into the ZK codebase shortly.  I feel
> > this
> > > > KIP
> > > > > might introduce a setting that could be deprecated shortly and
> > confuse
> > > > the
> > > > > end user a bit further with one more knob to turn.
> > > > >
> > > > > 3) I'm not sure if I fully understand, sorry for the beginner's
> > > question:
> > > > > if the default timeout is infinite, then it won't change anything
> to
> > > how
> > > > > Kafka works from today, does it? (unless I'm missing something
> > sorry).
> > > If
> > > > > not set to infinite, then we introduce the risk of a whole cluster
> > > > shutting
> > > > > down at once?
> > > > >
> > > > > Thanks,
> > > > > Stephane
> > > > >
> > > > > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> > > > >
> > > > >     Hi, Stephane,
> > > > >
> > > > >     Thanks for the reply.
> > > > >
> > > > >     1) Fixing the issue in ZK will be ideal. Not sure when it will
> > > happen
> > > > >     though. Once it's fixed, we can probably deprecate this config.
> > > > >
> > > > >     2) That could be useful. Is there a java api to do that at
> > runtime?
> > > > > Also,
> > > > >     invalidating DNS cache doesn't always fix the issue of
> unresolved
> > > > > host. In
> > > > >     some of the cases, human intervention is needed.
> > > > >
> > > > >     3) The default timeout is infinite though.
> > > > >
> > > > >     Jun
> > > > >
> > > > >
> > > > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> > > > >     stephane@simplemachines.com.au> wrote:
> > > > >
> > > > >     > Hi Jun,
> > > > >     >
> > > > >     > I think this is very helpful. Restarting Kafka brokers in
> case
> > of
> > > > > zookeeper
> > > > >     > host change is not a well known operation.
> > > > >     >
> > > > >     > Few questions:
> > > > >     > 1) would it not be worth fixing the problem at the source ?
> > This
> > > > has
> > > > > been
> > > > >     > stuck for a while though, maybe a little push would help :
> > > > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > > > > issue/ZOOKEEPER-2184
> > > > >     >
> > > > >     > 2) upon recreating the zookeeper object , is it not possible
> to
> > > > > invalidate
> > > > >     > the DNS cache so that it resolves the new hostname?
> > > > >     >
> > > > >     > 3) could the cluster be down in this situation: one migrates
> an
> > > > > entire
> > > > >     > zookeeper cluster to new machines (one by one). The quorum is
> > > still
> > > > > alive
> > > > >     > without downtime, but now every broker in a cluster can't
> > resolve
> > > > > zookeeper
> > > > >     > at the same time. They all shut down at the same time after
> the
> > > new
> > > > >     > time-out setting.
> > > > >     >
> > > > >     > Thanks !
> > > > >     > Stéphane
> > > > >     >
> > > > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> > > > >     >
> > > > >     > > Hi, Everyone,
> > > > >     > >
> > > > >     > > We created "KIP-217: Expose a timeout to allow an expired
> ZK
> > > > > session to
> > > > >     > be
> > > > >     > > re-created".
> > > > >     > >
> > > > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > > > > to+be+re-created
> > > > >     > >
> > > > >     > > Please take a look and provide your feedback.
> > > > >     > >
> > > > >     > > Thanks,
> > > > >     > >
> > > > >     > > Jun
> > > > >     > >
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>



-- 

*Jeff Widman*
jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
<><

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Ted Yu <yu...@gmail.com>.

Stephane:
bq. hasn't acted in over a year

The above fact implies some reluctance from the zookeeper community to
fully solve the issue (maybe due to technical issues).
Anyway, we should plan on not relying on the fix to go through in the near
future.

As for Jun's latest suggestion, I think we should add periodic logging
indicating the retry.

A KIP is not needed if we go that route.

Cheers

On Thu, Nov 2, 2017 at 2:54 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Hi Jun
>
> I think this is a better option. Would that change require a kip then as
> it's not a change in public API ?
>
> @ted it was marked as a blocked for 3.4.11 but they pushed it. It seems
> that the owner of the pr hasn't acted in over a year and I think someone
> needs to take ownership of that. Additionally, this would be a change in
> Kafka zookeeper client dependency, so no need to update your zookeeper
> quorum to benefit from the change
>
> Thanks
> Stéphane
>
>
> On 3 Nov. 2017 8:45 am, "Jun Rao" <ju...@confluent.io> wrote:
>
> Stephane, Jeff,
>
> Another option is to not expose the reconnect timeout config and just retry
> the creation of Zookeeper forever. This is an improvement from the current
> situation and if zookeeper-2184 is fixed in the future, we don't need to
> deprecate the config.
>
> Thanks,
>
> Jun
>
> On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
> >
> > I think adding the session recreation on Kafka side should benefit Kafka
> > users, especially those who don't plan to move to 3.4.12+ in the near
> > future.
> >
> > On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Stephane,
> > >
> > > 3) The difference is that currently, there is no retry when re-creating
> > the
> > > Zookeeper object when a ZK session expires. So, if the re-creation of
> > > Zookeeper fails, the broker just logs the error and the Zookeeper
> object
> > > will never be created again. With this KIP, we will keep retrying the
> > > creation of Zookeeper until success.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> > > stephane@simplemachines.com.au> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 1) The reason I'm asking about it is I wonder if it's not worth
> > focusing
> > > > the development efforts on taking ownership of the existing PR (
> > > > https://github.com/apache/zookeeper/pull/150)  to fix
> ZOOKEEPER-2184,
> > > > rebase it and have it merged into the ZK codebase shortly.  I feel
> this
> > > KIP
> > > > might introduce a setting that could be deprecated shortly and
> confuse
> > > the
> > > > end user a bit further with one more knob to turn.
> > > >
> > > > 3) I'm not sure if I fully understand, sorry for the beginner's
> > question:
> > > > if the default timeout is infinite, then it won't change anything to
> > how
> > > > Kafka works from today, does it? (unless I'm missing something
> sorry).
> > If
> > > > not set to infinite, then we introduce the risk of a whole cluster
> > > shutting
> > > > down at once?
> > > >
> > > > Thanks,
> > > > Stephane
> > > >
> > > > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> > > >
> > > >     Hi, Stephane,
> > > >
> > > >     Thanks for the reply.
> > > >
> > > >     1) Fixing the issue in ZK will be ideal. Not sure when it will
> > happen
> > > >     though. Once it's fixed, we can probably deprecate this config.
> > > >
> > > >     2) That could be useful. Is there a java api to do that at
> runtime?
> > > > Also,
> > > >     invalidating DNS cache doesn't always fix the issue of unresolved
> > > > host. In
> > > >     some of the cases, human intervention is needed.
> > > >
> > > >     3) The default timeout is infinite though.
> > > >
> > > >     Jun
> > > >
> > > >
> > > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> > > >     stephane@simplemachines.com.au> wrote:
> > > >
> > > >     > Hi Jun,
> > > >     >
> > > >     > I think this is very helpful. Restarting Kafka brokers in case
> of
> > > > zookeeper
> > > >     > host change is not a well known operation.
> > > >     >
> > > >     > Few questions:
> > > >     > 1) would it not be worth fixing the problem at the source ?
> This
> > > has
> > > > been
> > > >     > stuck for a while though, maybe a little push would help :
> > > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > > > issue/ZOOKEEPER-2184
> > > >     >
> > > >     > 2) upon recreating the zookeeper object , is it not possible to
> > > > invalidate
> > > >     > the DNS cache so that it resolves the new hostname?
> > > >     >
> > > >     > 3) could the cluster be down in this situation: one migrates an
> > > > entire
> > > >     > zookeeper cluster to new machines (one by one). The quorum is
> > still
> > > > alive
> > > >     > without downtime, but now every broker in a cluster can't
> resolve
> > > > zookeeper
> > > >     > at the same time. They all shut down at the same time after the
> > new
> > > >     > time-out setting.
> > > >     >
> > > >     > Thanks !
> > > >     > Stéphane
> > > >     >
> > > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> > > >     >
> > > >     > > Hi, Everyone,
> > > >     > >
> > > >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> > > > session to
> > > >     > be
> > > >     > > re-created".
> > > >     > >
> > > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > > > to+be+re-created
> > > >     > >
> > > >     > > Please take a look and provide your feedback.
> > > >     > >
> > > >     > > Thanks,
> > > >     > >
> > > >     > > Jun
> > > >     > >
> > > >     >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Stephane Maarek <st...@simplemachines.com.au>.

Hi Jun

I think this is a better option. Would that change require a kip then as
it's not a change in public API ?

@ted it was marked as a blocked for 3.4.11 but they pushed it. It seems
that the owner of the pr hasn't acted in over a year and I think someone
needs to take ownership of that. Additionally, this would be a change in
Kafka zookeeper client dependency, so no need to update your zookeeper
quorum to benefit from the change

Thanks
Stéphane


On 3 Nov. 2017 8:45 am, "Jun Rao" <ju...@confluent.io> wrote:

Stephane, Jeff,

Another option is to not expose the reconnect timeout config and just retry
the creation of Zookeeper forever. This is an improvement from the current
situation and if zookeeper-2184 is fixed in the future, we don't need to
deprecate the config.

Thanks,

Jun

On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yu...@gmail.com> wrote:

> ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
>
> I think adding the session recreation on Kafka side should benefit Kafka
> users, especially those who don't plan to move to 3.4.12+ in the near
> future.
>
> On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Stephane,
> >
> > 3) The difference is that currently, there is no retry when re-creating
> the
> > Zookeeper object when a ZK session expires. So, if the re-creation of
> > Zookeeper fails, the broker just logs the error and the Zookeeper object
> > will never be created again. With this KIP, we will keep retrying the
> > creation of Zookeeper until success.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> > stephane@simplemachines.com.au> wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for the reply.
> > >
> > > 1) The reason I'm asking about it is I wonder if it's not worth
> focusing
> > > the development efforts on taking ownership of the existing PR (
> > > https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> > > rebase it and have it merged into the ZK codebase shortly.  I feel
this
> > KIP
> > > might introduce a setting that could be deprecated shortly and confuse
> > the
> > > end user a bit further with one more knob to turn.
> > >
> > > 3) I'm not sure if I fully understand, sorry for the beginner's
> question:
> > > if the default timeout is infinite, then it won't change anything to
> how
> > > Kafka works from today, does it? (unless I'm missing something sorry).
> If
> > > not set to infinite, then we introduce the risk of a whole cluster
> > shutting
> > > down at once?
> > >
> > > Thanks,
> > > Stephane
> > >
> > > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> > >
> > >     Hi, Stephane,
> > >
> > >     Thanks for the reply.
> > >
> > >     1) Fixing the issue in ZK will be ideal. Not sure when it will
> happen
> > >     though. Once it's fixed, we can probably deprecate this config.
> > >
> > >     2) That could be useful. Is there a java api to do that at
runtime?
> > > Also,
> > >     invalidating DNS cache doesn't always fix the issue of unresolved
> > > host. In
> > >     some of the cases, human intervention is needed.
> > >
> > >     3) The default timeout is infinite though.
> > >
> > >     Jun
> > >
> > >
> > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> > >     stephane@simplemachines.com.au> wrote:
> > >
> > >     > Hi Jun,
> > >     >
> > >     > I think this is very helpful. Restarting Kafka brokers in case
of
> > > zookeeper
> > >     > host change is not a well known operation.
> > >     >
> > >     > Few questions:
> > >     > 1) would it not be worth fixing the problem at the source ? This
> > has
> > > been
> > >     > stuck for a while though, maybe a little push would help :
> > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > > issue/ZOOKEEPER-2184
> > >     >
> > >     > 2) upon recreating the zookeeper object , is it not possible to
> > > invalidate
> > >     > the DNS cache so that it resolves the new hostname?
> > >     >
> > >     > 3) could the cluster be down in this situation: one migrates an
> > > entire
> > >     > zookeeper cluster to new machines (one by one). The quorum is
> still
> > > alive
> > >     > without downtime, but now every broker in a cluster can't
resolve
> > > zookeeper
> > >     > at the same time. They all shut down at the same time after the
> new
> > >     > time-out setting.
> > >     >
> > >     > Thanks !
> > >     > Stéphane
> > >     >
> > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> > >     >
> > >     > > Hi, Everyone,
> > >     > >
> > >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> > > session to
> > >     > be
> > >     > > re-created".
> > >     > >
> > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > > to+be+re-created
> > >     > >
> > >     > > Please take a look and provide your feedback.
> > >     > >
> > >     > > Thanks,
> > >     > >
> > >     > > Jun
> > >     > >
> > >     >
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Jun Rao <ju...@confluent.io>.

Stephane, Jeff,

Another option is to not expose the reconnect timeout config and just retry
the creation of Zookeeper forever. This is an improvement from the current
situation and if zookeeper-2184 is fixed in the future, we don't need to
deprecate the config.

Thanks,

Jun

On Thu, Nov 2, 2017 at 9:02 AM, Ted Yu <yu...@gmail.com> wrote:

> ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.
>
> I think adding the session recreation on Kafka side should benefit Kafka
> users, especially those who don't plan to move to 3.4.12+ in the near
> future.
>
> On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Stephane,
> >
> > 3) The difference is that currently, there is no retry when re-creating
> the
> > Zookeeper object when a ZK session expires. So, if the re-creation of
> > Zookeeper fails, the broker just logs the error and the Zookeeper object
> > will never be created again. With this KIP, we will keep retrying the
> > creation of Zookeeper until success.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> > stephane@simplemachines.com.au> wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for the reply.
> > >
> > > 1) The reason I'm asking about it is I wonder if it's not worth
> focusing
> > > the development efforts on taking ownership of the existing PR (
> > > https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> > > rebase it and have it merged into the ZK codebase shortly.  I feel this
> > KIP
> > > might introduce a setting that could be deprecated shortly and confuse
> > the
> > > end user a bit further with one more knob to turn.
> > >
> > > 3) I'm not sure if I fully understand, sorry for the beginner's
> question:
> > > if the default timeout is infinite, then it won't change anything to
> how
> > > Kafka works from today, does it? (unless I'm missing something sorry).
> If
> > > not set to infinite, then we introduce the risk of a whole cluster
> > shutting
> > > down at once?
> > >
> > > Thanks,
> > > Stephane
> > >
> > > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> > >
> > >     Hi, Stephane,
> > >
> > >     Thanks for the reply.
> > >
> > >     1) Fixing the issue in ZK will be ideal. Not sure when it will
> happen
> > >     though. Once it's fixed, we can probably deprecate this config.
> > >
> > >     2) That could be useful. Is there a java api to do that at runtime?
> > > Also,
> > >     invalidating DNS cache doesn't always fix the issue of unresolved
> > > host. In
> > >     some of the cases, human intervention is needed.
> > >
> > >     3) The default timeout is infinite though.
> > >
> > >     Jun
> > >
> > >
> > >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> > >     stephane@simplemachines.com.au> wrote:
> > >
> > >     > Hi Jun,
> > >     >
> > >     > I think this is very helpful. Restarting Kafka brokers in case of
> > > zookeeper
> > >     > host change is not a well known operation.
> > >     >
> > >     > Few questions:
> > >     > 1) would it not be worth fixing the problem at the source ? This
> > has
> > > been
> > >     > stuck for a while though, maybe a little push would help :
> > >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > > issue/ZOOKEEPER-2184
> > >     >
> > >     > 2) upon recreating the zookeeper object , is it not possible to
> > > invalidate
> > >     > the DNS cache so that it resolves the new hostname?
> > >     >
> > >     > 3) could the cluster be down in this situation: one migrates an
> > > entire
> > >     > zookeeper cluster to new machines (one by one). The quorum is
> still
> > > alive
> > >     > without downtime, but now every broker in a cluster can't resolve
> > > zookeeper
> > >     > at the same time. They all shut down at the same time after the
> new
> > >     > time-out setting.
> > >     >
> > >     > Thanks !
> > >     > Stéphane
> > >     >
> > >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> > >     >
> > >     > > Hi, Everyone,
> > >     > >
> > >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> > > session to
> > >     > be
> > >     > > re-created".
> > >     > >
> > >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > > to+be+re-created
> > >     > >
> > >     > > Please take a look and provide your feedback.
> > >     > >
> > >     > > Thanks,
> > >     > >
> > >     > > Jun
> > >     > >
> > >     >
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Ted Yu <yu...@gmail.com>.

ZOOKEEPER-2184 is scheduled for 3.4.12 whose release is unknown.

I think adding the session recreation on Kafka side should benefit Kafka
users, especially those who don't plan to move to 3.4.12+ in the near
future.

On Wed, Nov 1, 2017 at 6:34 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Stephane,
>
> 3) The difference is that currently, there is no retry when re-creating the
> Zookeeper object when a ZK session expires. So, if the re-creation of
> Zookeeper fails, the broker just logs the error and the Zookeeper object
> will never be created again. With this KIP, we will keep retrying the
> creation of Zookeeper until success.
>
> Thanks,
>
> Jun
>
> On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>
> > Hi Jun,
> >
> > Thanks for the reply.
> >
> > 1) The reason I'm asking about it is I wonder if it's not worth focusing
> > the development efforts on taking ownership of the existing PR (
> > https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> > rebase it and have it merged into the ZK codebase shortly.  I feel this
> KIP
> > might introduce a setting that could be deprecated shortly and confuse
> the
> > end user a bit further with one more knob to turn.
> >
> > 3) I'm not sure if I fully understand, sorry for the beginner's question:
> > if the default timeout is infinite, then it won't change anything to how
> > Kafka works from today, does it? (unless I'm missing something sorry). If
> > not set to infinite, then we introduce the risk of a whole cluster
> shutting
> > down at once?
> >
> > Thanks,
> > Stephane
> >
> > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> >
> >     Hi, Stephane,
> >
> >     Thanks for the reply.
> >
> >     1) Fixing the issue in ZK will be ideal. Not sure when it will happen
> >     though. Once it's fixed, we can probably deprecate this config.
> >
> >     2) That could be useful. Is there a java api to do that at runtime?
> > Also,
> >     invalidating DNS cache doesn't always fix the issue of unresolved
> > host. In
> >     some of the cases, human intervention is needed.
> >
> >     3) The default timeout is infinite though.
> >
> >     Jun
> >
> >
> >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> >     stephane@simplemachines.com.au> wrote:
> >
> >     > Hi Jun,
> >     >
> >     > I think this is very helpful. Restarting Kafka brokers in case of
> > zookeeper
> >     > host change is not a well known operation.
> >     >
> >     > Few questions:
> >     > 1) would it not be worth fixing the problem at the source ? This
> has
> > been
> >     > stuck for a while though, maybe a little push would help :
> >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > issue/ZOOKEEPER-2184
> >     >
> >     > 2) upon recreating the zookeeper object , is it not possible to
> > invalidate
> >     > the DNS cache so that it resolves the new hostname?
> >     >
> >     > 3) could the cluster be down in this situation: one migrates an
> > entire
> >     > zookeeper cluster to new machines (one by one). The quorum is still
> > alive
> >     > without downtime, but now every broker in a cluster can't resolve
> > zookeeper
> >     > at the same time. They all shut down at the same time after the new
> >     > time-out setting.
> >     >
> >     > Thanks !
> >     > Stéphane
> >     >
> >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> >     >
> >     > > Hi, Everyone,
> >     > >
> >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> > session to
> >     > be
> >     > > re-created".
> >     > >
> >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > to+be+re-created
> >     > >
> >     > > Please take a look and provide your feedback.
> >     > >
> >     > > Thanks,
> >     > >
> >     > > Jun
> >     > >
> >     >
> >
> >
> >
> >
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Jun Rao <ju...@confluent.io>.

Hi, Stephane,

3) The difference is that currently, there is no retry when re-creating the
Zookeeper object when a ZK session expires. So, if the re-creation of
Zookeeper fails, the broker just logs the error and the Zookeeper object
will never be created again. With this KIP, we will keep retrying the
creation of Zookeeper until success.

Thanks,

Jun

On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Hi Jun,
>
> Thanks for the reply.
>
> 1) The reason I'm asking about it is I wonder if it's not worth focusing
> the development efforts on taking ownership of the existing PR (
> https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> rebase it and have it merged into the ZK codebase shortly.  I feel this KIP
> might introduce a setting that could be deprecated shortly and confuse the
> end user a bit further with one more knob to turn.
>
> 3) I'm not sure if I fully understand, sorry for the beginner's question:
> if the default timeout is infinite, then it won't change anything to how
> Kafka works from today, does it? (unless I'm missing something sorry). If
> not set to infinite, then we introduce the risk of a whole cluster shutting
> down at once?
>
> Thanks,
> Stephane
>
> On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
>
>     Hi, Stephane,
>
>     Thanks for the reply.
>
>     1) Fixing the issue in ZK will be ideal. Not sure when it will happen
>     though. Once it's fixed, we can probably deprecate this config.
>
>     2) That could be useful. Is there a java api to do that at runtime?
> Also,
>     invalidating DNS cache doesn't always fix the issue of unresolved
> host. In
>     some of the cases, human intervention is needed.
>
>     3) The default timeout is infinite though.
>
>     Jun
>
>
>     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
>     stephane@simplemachines.com.au> wrote:
>
>     > Hi Jun,
>     >
>     > I think this is very helpful. Restarting Kafka brokers in case of
> zookeeper
>     > host change is not a well known operation.
>     >
>     > Few questions:
>     > 1) would it not be worth fixing the problem at the source ? This has
> been
>     > stuck for a while though, maybe a little push would help :
>     > https://issues.apache.org/jira/plugins/servlet/mobile#
> issue/ZOOKEEPER-2184
>     >
>     > 2) upon recreating the zookeeper object , is it not possible to
> invalidate
>     > the DNS cache so that it resolves the new hostname?
>     >
>     > 3) could the cluster be down in this situation: one migrates an
> entire
>     > zookeeper cluster to new machines (one by one). The quorum is still
> alive
>     > without downtime, but now every broker in a cluster can't resolve
> zookeeper
>     > at the same time. They all shut down at the same time after the new
>     > time-out setting.
>     >
>     > Thanks !
>     > Stéphane
>     >
>     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
>     >
>     > > Hi, Everyone,
>     > >
>     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> session to
>     > be
>     > > re-created".
>     > >
>     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> to+be+re-created
>     > >
>     > > Please take a look and provide your feedback.
>     > >
>     > > Thanks,
>     > >
>     > > Jun
>     > >
>     >
>
>
>
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Gwen Shapira <gw...@confluent.io>.

Fixing this in ZK won't be enough though. We'll need this included in a
stable release that we'll then bump Kafka's dependencies to include. I
doubt this KIP will be deprecated shortly even if the ZK bug is fixed
immediately.

On Tue, Oct 31, 2017 at 4:59 PM Jeff Widman <je...@jeffwidman.com> wrote:

> Agree with Stephane that it's worth at least taking a shot at trying to get
> ZOOKEEPER-2184 fixed rather than adding a config that will be deprecated in
> the not-too distant future.
>
> I know Zookeeper development feels more like the turtle than the hare these
> days, but Kafka is a high-visibility project, so there's a decent chance
> you'll be able to get the attention of the zookeeper maintainers to get a
> patch merged and possibly even a new release cut incorporating this fix.
>
> On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
> stephane@simplemachines.com.au> wrote:
>
> > Hi Jun,
> >
> > Thanks for the reply.
> >
> > 1) The reason I'm asking about it is I wonder if it's not worth focusing
> > the development efforts on taking ownership of the existing PR (
> > https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> > rebase it and have it merged into the ZK codebase shortly.  I feel this
> KIP
> > might introduce a setting that could be deprecated shortly and confuse
> the
> > end user a bit further with one more knob to turn.
> >
> > 3) I'm not sure if I fully understand, sorry for the beginner's question:
> > if the default timeout is infinite, then it won't change anything to how
> > Kafka works from today, does it? (unless I'm missing something sorry). If
> > not set to infinite, then we introduce the risk of a whole cluster
> shutting
> > down at once?
> >
> > Thanks,
> > Stephane
> >
> > On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
> >
> >     Hi, Stephane,
> >
> >     Thanks for the reply.
> >
> >     1) Fixing the issue in ZK will be ideal. Not sure when it will happen
> >     though. Once it's fixed, we can probably deprecate this config.
> >
> >     2) That could be useful. Is there a java api to do that at runtime?
> > Also,
> >     invalidating DNS cache doesn't always fix the issue of unresolved
> > host. In
> >     some of the cases, human intervention is needed.
> >
> >     3) The default timeout is infinite though.
> >
> >     Jun
> >
> >
> >     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
> >     stephane@simplemachines.com.au> wrote:
> >
> >     > Hi Jun,
> >     >
> >     > I think this is very helpful. Restarting Kafka brokers in case of
> > zookeeper
> >     > host change is not a well known operation.
> >     >
> >     > Few questions:
> >     > 1) would it not be worth fixing the problem at the source ? This
> has
> > been
> >     > stuck for a while though, maybe a little push would help :
> >     > https://issues.apache.org/jira/plugins/servlet/mobile#
> > issue/ZOOKEEPER-2184
> >     >
> >     > 2) upon recreating the zookeeper object , is it not possible to
> > invalidate
> >     > the DNS cache so that it resolves the new hostname?
> >     >
> >     > 3) could the cluster be down in this situation: one migrates an
> > entire
> >     > zookeeper cluster to new machines (one by one). The quorum is still
> > alive
> >     > without downtime, but now every broker in a cluster can't resolve
> > zookeeper
> >     > at the same time. They all shut down at the same time after the new
> >     > time-out setting.
> >     >
> >     > Thanks !
> >     > Stéphane
> >     >
> >     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
> >     >
> >     > > Hi, Everyone,
> >     > >
> >     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> > session to
> >     > be
> >     > > re-created".
> >     > >
> >     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> > to+be+re-created
> >     > >
> >     > > Please take a look and provide your feedback.
> >     > >
> >     > > Thanks,
> >     > >
> >     > > Jun
> >     > >
> >     >
> >
> >
> >
> >
>
>
> --
>
> *Jeff Widman*
> jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
> <><
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Jeff Widman <je...@jeffwidman.com>.

Agree with Stephane that it's worth at least taking a shot at trying to get
ZOOKEEPER-2184 fixed rather than adding a config that will be deprecated in
the not-too distant future.

I know Zookeeper development feels more like the turtle than the hare these
days, but Kafka is a high-visibility project, so there's a decent chance
you'll be able to get the attention of the zookeeper maintainers to get a
patch merged and possibly even a new release cut incorporating this fix.

On Tue, Oct 31, 2017 at 3:28 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Hi Jun,
>
> Thanks for the reply.
>
> 1) The reason I'm asking about it is I wonder if it's not worth focusing
> the development efforts on taking ownership of the existing PR (
> https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184,
> rebase it and have it merged into the ZK codebase shortly.  I feel this KIP
> might introduce a setting that could be deprecated shortly and confuse the
> end user a bit further with one more knob to turn.
>
> 3) I'm not sure if I fully understand, sorry for the beginner's question:
> if the default timeout is infinite, then it won't change anything to how
> Kafka works from today, does it? (unless I'm missing something sorry). If
> not set to infinite, then we introduce the risk of a whole cluster shutting
> down at once?
>
> Thanks,
> Stephane
>
> On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:
>
>     Hi, Stephane,
>
>     Thanks for the reply.
>
>     1) Fixing the issue in ZK will be ideal. Not sure when it will happen
>     though. Once it's fixed, we can probably deprecate this config.
>
>     2) That could be useful. Is there a java api to do that at runtime?
> Also,
>     invalidating DNS cache doesn't always fix the issue of unresolved
> host. In
>     some of the cases, human intervention is needed.
>
>     3) The default timeout is infinite though.
>
>     Jun
>
>
>     On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
>     stephane@simplemachines.com.au> wrote:
>
>     > Hi Jun,
>     >
>     > I think this is very helpful. Restarting Kafka brokers in case of
> zookeeper
>     > host change is not a well known operation.
>     >
>     > Few questions:
>     > 1) would it not be worth fixing the problem at the source ? This has
> been
>     > stuck for a while though, maybe a little push would help :
>     > https://issues.apache.org/jira/plugins/servlet/mobile#
> issue/ZOOKEEPER-2184
>     >
>     > 2) upon recreating the zookeeper object , is it not possible to
> invalidate
>     > the DNS cache so that it resolves the new hostname?
>     >
>     > 3) could the cluster be down in this situation: one migrates an
> entire
>     > zookeeper cluster to new machines (one by one). The quorum is still
> alive
>     > without downtime, but now every broker in a cluster can't resolve
> zookeeper
>     > at the same time. They all shut down at the same time after the new
>     > time-out setting.
>     >
>     > Thanks !
>     > Stéphane
>     >
>     > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
>     >
>     > > Hi, Everyone,
>     > >
>     > > We created "KIP-217: Expose a timeout to allow an expired ZK
> session to
>     > be
>     > > re-created".
>     > >
>     > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>     > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+
> to+be+re-created
>     > >
>     > > Please take a look and provide your feedback.
>     > >
>     > > Thanks,
>     > >
>     > > Jun
>     > >
>     >
>
>
>
>


-- 

*Jeff Widman*
jeffwidman.com <http://www.jeffwidman.com/> | 740-WIDMAN-J (943-6265)
<><

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Stephane Maarek <st...@simplemachines.com.au>.

Hi Jun,

Thanks for the reply.

1) The reason I'm asking about it is I wonder if it's not worth focusing the development efforts on taking ownership of the existing PR (https://github.com/apache/zookeeper/pull/150)  to fix ZOOKEEPER-2184, rebase it and have it merged into the ZK codebase shortly.  I feel this KIP might introduce a setting that could be deprecated shortly and confuse the end user a bit further with one more knob to turn.

3) I'm not sure if I fully understand, sorry for the beginner's question: if the default timeout is infinite, then it won't change anything to how Kafka works from today, does it? (unless I'm missing something sorry). If not set to infinite, then we introduce the risk of a whole cluster shutting down at once?

Thanks,
Stephane

On 31/10/17, 1:00 pm, "Jun Rao" <ju...@confluent.io> wrote:

    Hi, Stephane,
    
    Thanks for the reply.
    
    1) Fixing the issue in ZK will be ideal. Not sure when it will happen
    though. Once it's fixed, we can probably deprecate this config.
    
    2) That could be useful. Is there a java api to do that at runtime? Also,
    invalidating DNS cache doesn't always fix the issue of unresolved host. In
    some of the cases, human intervention is needed.
    
    3) The default timeout is infinite though.
    
    Jun
    
    
    On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
    stephane@simplemachines.com.au> wrote:
    
    > Hi Jun,
    >
    > I think this is very helpful. Restarting Kafka brokers in case of zookeeper
    > host change is not a well known operation.
    >
    > Few questions:
    > 1) would it not be worth fixing the problem at the source ? This has been
    > stuck for a while though, maybe a little push would help :
    > https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-2184
    >
    > 2) upon recreating the zookeeper object , is it not possible to invalidate
    > the DNS cache so that it resolves the new hostname?
    >
    > 3) could the cluster be down in this situation: one migrates an entire
    > zookeeper cluster to new machines (one by one). The quorum is still alive
    > without downtime, but now every broker in a cluster can't resolve zookeeper
    > at the same time. They all shut down at the same time after the new
    > time-out setting.
    >
    > Thanks !
    > Stéphane
    >
    > On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
    >
    > > Hi, Everyone,
    > >
    > > We created "KIP-217: Expose a timeout to allow an expired ZK session to
    > be
    > > re-created".
    > >
    > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
    > > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+to+be+re-created
    > >
    > > Please take a look and provide your feedback.
    > >
    > > Thanks,
    > >
    > > Jun
    > >
    >

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Jun Rao <ju...@confluent.io>.

Hi, Stephane,

Thanks for the reply.

1) Fixing the issue in ZK will be ideal. Not sure when it will happen
though. Once it's fixed, we can probably deprecate this config.

2) That could be useful. Is there a java api to do that at runtime? Also,
invalidating DNS cache doesn't always fix the issue of unresolved host. In
some of the cases, human intervention is needed.

3) The default timeout is infinite though.

Jun


On Sat, Oct 28, 2017 at 11:48 PM, Stephane Maarek <
stephane@simplemachines.com.au> wrote:

> Hi Jun,
>
> I think this is very helpful. Restarting Kafka brokers in case of zookeeper
> host change is not a well known operation.
>
> Few questions:
> 1) would it not be worth fixing the problem at the source ? This has been
> stuck for a while though, maybe a little push would help :
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-2184
>
> 2) upon recreating the zookeeper object , is it not possible to invalidate
> the DNS cache so that it resolves the new hostname?
>
> 3) could the cluster be down in this situation: one migrates an entire
> zookeeper cluster to new machines (one by one). The quorum is still alive
> without downtime, but now every broker in a cluster can't resolve zookeeper
> at the same time. They all shut down at the same time after the new
> time-out setting.
>
> Thanks !
> Stéphane
>
> On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:
>
> > Hi, Everyone,
> >
> > We created "KIP-217: Expose a timeout to allow an expired ZK session to
> be
> > re-created".
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+to+be+re-created
> >
> > Please take a look and provide your feedback.
> >
> > Thanks,
> >
> > Jun
> >
>

Re: [DISCUSS] KIP-217: Expose a timeout to allow an expired ZK session to be re-created

Posted by Stephane Maarek <st...@simplemachines.com.au>.

Hi Jun,

I think this is very helpful. Restarting Kafka brokers in case of zookeeper
host change is not a well known operation.

Few questions:
1) would it not be worth fixing the problem at the source ? This has been
stuck for a while though, maybe a little push would help :
https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-2184

2) upon recreating the zookeeper object , is it not possible to invalidate
the DNS cache so that it resolves the new hostname?

3) could the cluster be down in this situation: one migrates an entire
zookeeper cluster to new machines (one by one). The quorum is still alive
without downtime, but now every broker in a cluster can't resolve zookeeper
at the same time. They all shut down at the same time after the new
time-out setting.

Thanks !
Stéphane

On 28 Oct. 2017 9:42 am, "Jun Rao" <ju...@confluent.io> wrote:

> Hi, Everyone,
>
> We created "KIP-217: Expose a timeout to allow an expired ZK session to be
> re-created".
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 217%3A+Expose+a+timeout+to+allow+an+expired+ZK+session+to+be+re-created
>
> Please take a look and provide your feedback.
>
> Thanks,
>
> Jun
>