You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Manikumar Reddy <ku...@nmsworks.co.in> on 2015/04/24 16:35:18 UTC

New producer: metadata update problem on 2 Node cluster.

We are testing new producer on a 2 node cluster.
Under some node failure scenarios, producer is not able
to update metadata.

Steps to reproduce
1. form a 2 node cluster (K1, K2)
2. create a topic with single partition, replication factor = 2
3. start producing data (producer metadata : K1,K2)
2. Kill leader node (say K1)
3. K2 becomes the leader (producer metadata : K2)
4. Bring back K1 and Kill K2 before metadata.max.age.ms
5. K1 becomes the Leader (producer metadata still contains : K2)

After this point, producer is not able to update the metadata.
producer continuously trying to connect with dead node (K2).

This looks like a bug to me. Am I missing anything?

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Rahul Jain <ra...@gmail.com>.

Sorry, I meant creating a new producer, not consumer.

Here's the code.

Producer - http://pastebin.com/Kqq1ymCX
Consumer - http://pastebin.com/i2Z8PTYB
Callback - http://pastebin.com/x253z7bG

As you'll notice, I am creating a new producer for each message. So the
bootstrap nodes should be refreshed.

I have a single topic (receive.queue) replicated across 3 nodes. I add all
3 nodes to the bootstrap list. On bringing one of the nodes down, some
messages start failing (metadata update timeout error).

As I mentioned earlier, the problem goes away simply by setting the
reconnect.backoff.ms property to 1000ms.





On 7 May 2015 23:18, "Ewen Cheslack-Postava" <ew...@confluent.io> wrote:

> Rahul, the mailing list filters attachments, you'd have to post the code
> somewhere else for people to be able to see it.
>
> But I don't think anyone suggested that creating a new consumer would fix
> anything. Creating a new producer *and discarding the old one* basically
> just makes it start from scratch using the bootstrap nodes, which is why
> that would allow recovery from that condition.
>
> But that's just a workaround. The real issue is that the producer only
> maintains metadata for the nodes that are replicas for the partitions of
> the topics the producer sends data to. In some cases, this is a small set
> of servers and can get the producer stuck if a node goes offline and it
> doesn't have any other nodes that it can try to communicate with to get
> updated metadata (since the topic partitions should have a new leader).
> Falling back on the original bootstrap servers is one solution to this
> problem. Another would be to maintain metadata for additional servers so
> you always have extra "bootstrap" nodes in your current metadata set, even
> if they aren't replicas for any of the topics you're working with.
>
> -Ewen
>
>
>
> On Thu, May 7, 2015 at 12:06 AM, Rahul Jain <ra...@gmail.com> wrote:
>
> > Creating a new consumer instance *does not* solve this problem.
> >
> > Attaching the producer/consumer code that I used for testing.
> >
> >
> >
> > On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava <ewen@confluent.io
> >
> > wrote:
> >
> >> I'm not sure about the old producer behavior in this same failure
> >> scenario,
> >> but creating a new producer instance would resolve the issue since it
> >> would
> >> start with the list of bootstrap nodes and, assuming at least one of
> them
> >> was up, it would be able to fetch up to date metadata.
> >>
> >> On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <jb...@squareup.com>
> wrote:
> >>
> >> > Can you clarify, is this issue here specific to the "new" producer?
> >> With
> >> > the "old" producer, we routinely construct a new producer which makes
> a
> >> > fresh metadata request (via a VIP connected to all nodes in the
> >> cluster).
> >> > Would this approach work with the new producer?
> >> >
> >> > Jason
> >> >
> >> >
> >> > On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <ra...@gmail.com>
> wrote:
> >> >
> >> > > Mayuresh,
> >> > > I was testing this in a development environment and manually brought
> >> > down a
> >> > > node to simulate this. So the dead node never came back up.
> >> > >
> >> > > My colleague and I were able to consistently see this behaviour
> >> several
> >> > > times during the testing.
> >> > > On 5 May 2015 20:32, "Mayuresh Gharat" <gh...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > I agree that to find the least Loaded node the producer should
> fall
> >> > back
> >> > > to
> >> > > > the bootstrap nodes if its not able to connect to any nodes in the
> >> > > current
> >> > > > metadata. That should resolve this.
> >> > > >
> >> > > > Rahul, I suppose the problem went off because the dead node in
> your
> >> > case
> >> > > > might have came back up and allowed for a metadata update. Can you
> >> > > confirm
> >> > > > this?
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Mayuresh
> >> > > >
> >> > > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > We observed the exact same error. Not very clear about the root
> >> cause
> >> > > > > although it appears to be related to leastLoadedNode
> >> implementation.
> >> > > > > Interestingly, the problem went away by increasing the value of
> >> > > > > reconnect.backoff.ms to 1000ms.
> >> > > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <
> ewen@confluent.io>
> >> > > wrote:
> >> > > > >
> >> > > > > > Ok, all of that makes sense. The only way to possibly recover
> >> from
> >> > > that
> >> > > > > > state is either for K2 to come back up allowing the metadata
> >> > refresh
> >> > > to
> >> > > > > > eventually succeed or to eventually try some other node in the
> >> > > cluster.
> >> > > > > > Reusing the bootstrap nodes is one possibility. Another would
> be
> >> > for
> >> > > > the
> >> > > > > > client to get more metadata than is required for the topics it
> >> > needs
> >> > > in
> >> > > > > > order to ensure it has more nodes to use as options when
> looking
> >> > for
> >> > > a
> >> > > > > node
> >> > > > > > to fetch metadata from. I added your description to
> KAFKA-1843,
> >> > > > although
> >> > > > > it
> >> > > > > > might also make sense as a separate bug since fixing it could
> be
> >> > > > > considered
> >> > > > > > incremental progress towards resolving 1843.
> >> > > > > >
> >> > > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> >> > > kumar@nmsworks.co.in
> >> > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Ewen,
> >> > > > > > >
> >> > > > > > >  Thanks for the response.  I agree with you, In some case we
> >> > should
> >> > > > use
> >> > > > > > > bootstrap servers.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > If you have logs at debug level, are you seeing this
> >> message in
> >> > > > > between
> >> > > > > > > the
> >> > > > > > > > connection attempts:
> >> > > > > > > >
> >> > > > > > > > Give up sending metadata request since no node is
> available
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >  Yes, this log came for couple of times.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Also, if you let it continue running, does it recover
> after
> >> the
> >> > > > > > > > metadata.max.age.ms timeout?
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >  It does not reconnect.  It is continuously trying to
> connect
> >> > with
> >> > > > dead
> >> > > > > > > node.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > -Manikumar
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks,
> >> > > > > > Ewen
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > -Regards,
> >> > > > Mayuresh R. Gharat
> >> > > > (862) 250-7125
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Ewen
> >>
> >
> >
>
>
> --
> Thanks,
> Ewen
>

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Rahul, the mailing list filters attachments, you'd have to post the code
somewhere else for people to be able to see it.

But I don't think anyone suggested that creating a new consumer would fix
anything. Creating a new producer *and discarding the old one* basically
just makes it start from scratch using the bootstrap nodes, which is why
that would allow recovery from that condition.

But that's just a workaround. The real issue is that the producer only
maintains metadata for the nodes that are replicas for the partitions of
the topics the producer sends data to. In some cases, this is a small set
of servers and can get the producer stuck if a node goes offline and it
doesn't have any other nodes that it can try to communicate with to get
updated metadata (since the topic partitions should have a new leader).
Falling back on the original bootstrap servers is one solution to this
problem. Another would be to maintain metadata for additional servers so
you always have extra "bootstrap" nodes in your current metadata set, even
if they aren't replicas for any of the topics you're working with.

-Ewen



On Thu, May 7, 2015 at 12:06 AM, Rahul Jain <ra...@gmail.com> wrote:

> Creating a new consumer instance *does not* solve this problem.
>
> Attaching the producer/consumer code that I used for testing.
>
>
>
> On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava <ew...@confluent.io>
> wrote:
>
>> I'm not sure about the old producer behavior in this same failure
>> scenario,
>> but creating a new producer instance would resolve the issue since it
>> would
>> start with the list of bootstrap nodes and, assuming at least one of them
>> was up, it would be able to fetch up to date metadata.
>>
>> On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>>
>> > Can you clarify, is this issue here specific to the "new" producer?
>> With
>> > the "old" producer, we routinely construct a new producer which makes a
>> > fresh metadata request (via a VIP connected to all nodes in the
>> cluster).
>> > Would this approach work with the new producer?
>> >
>> > Jason
>> >
>> >
>> > On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <ra...@gmail.com> wrote:
>> >
>> > > Mayuresh,
>> > > I was testing this in a development environment and manually brought
>> > down a
>> > > node to simulate this. So the dead node never came back up.
>> > >
>> > > My colleague and I were able to consistently see this behaviour
>> several
>> > > times during the testing.
>> > > On 5 May 2015 20:32, "Mayuresh Gharat" <gh...@gmail.com>
>> > wrote:
>> > >
>> > > > I agree that to find the least Loaded node the producer should fall
>> > back
>> > > to
>> > > > the bootstrap nodes if its not able to connect to any nodes in the
>> > > current
>> > > > metadata. That should resolve this.
>> > > >
>> > > > Rahul, I suppose the problem went off because the dead node in your
>> > case
>> > > > might have came back up and allowed for a metadata update. Can you
>> > > confirm
>> > > > this?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Mayuresh
>> > > >
>> > > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com>
>> wrote:
>> > > >
>> > > > > We observed the exact same error. Not very clear about the root
>> cause
>> > > > > although it appears to be related to leastLoadedNode
>> implementation.
>> > > > > Interestingly, the problem went away by increasing the value of
>> > > > > reconnect.backoff.ms to 1000ms.
>> > > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io>
>> > > wrote:
>> > > > >
>> > > > > > Ok, all of that makes sense. The only way to possibly recover
>> from
>> > > that
>> > > > > > state is either for K2 to come back up allowing the metadata
>> > refresh
>> > > to
>> > > > > > eventually succeed or to eventually try some other node in the
>> > > cluster.
>> > > > > > Reusing the bootstrap nodes is one possibility. Another would be
>> > for
>> > > > the
>> > > > > > client to get more metadata than is required for the topics it
>> > needs
>> > > in
>> > > > > > order to ensure it has more nodes to use as options when looking
>> > for
>> > > a
>> > > > > node
>> > > > > > to fetch metadata from. I added your description to KAFKA-1843,
>> > > > although
>> > > > > it
>> > > > > > might also make sense as a separate bug since fixing it could be
>> > > > > considered
>> > > > > > incremental progress towards resolving 1843.
>> > > > > >
>> > > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
>> > > kumar@nmsworks.co.in
>> > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Ewen,
>> > > > > > >
>> > > > > > >  Thanks for the response.  I agree with you, In some case we
>> > should
>> > > > use
>> > > > > > > bootstrap servers.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > If you have logs at debug level, are you seeing this
>> message in
>> > > > > between
>> > > > > > > the
>> > > > > > > > connection attempts:
>> > > > > > > >
>> > > > > > > > Give up sending metadata request since no node is available
>> > > > > > > >
>> > > > > > >
>> > > > > > >  Yes, this log came for couple of times.
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > > Also, if you let it continue running, does it recover after
>> the
>> > > > > > > > metadata.max.age.ms timeout?
>> > > > > > > >
>> > > > > > >
>> > > > > > >  It does not reconnect.  It is continuously trying to connect
>> > with
>> > > > dead
>> > > > > > > node.
>> > > > > > >
>> > > > > > >
>> > > > > > > -Manikumar
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Thanks,
>> > > > > > Ewen
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > -Regards,
>> > > > Mayuresh R. Gharat
>> > > > (862) 250-7125
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Thanks,
>> Ewen
>>
>
>


-- 
Thanks,
Ewen

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Rahul Jain <ra...@gmail.com>.

Creating a new consumer instance *does not* solve this problem.

Attaching the producer/consumer code that I used for testing.



On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava <ew...@confluent.io>
wrote:

> I'm not sure about the old producer behavior in this same failure scenario,
> but creating a new producer instance would resolve the issue since it would
> start with the list of bootstrap nodes and, assuming at least one of them
> was up, it would be able to fetch up to date metadata.
>
> On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <jb...@squareup.com> wrote:
>
> > Can you clarify, is this issue here specific to the "new" producer?  With
> > the "old" producer, we routinely construct a new producer which makes a
> > fresh metadata request (via a VIP connected to all nodes in the cluster).
> > Would this approach work with the new producer?
> >
> > Jason
> >
> >
> > On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <ra...@gmail.com> wrote:
> >
> > > Mayuresh,
> > > I was testing this in a development environment and manually brought
> > down a
> > > node to simulate this. So the dead node never came back up.
> > >
> > > My colleague and I were able to consistently see this behaviour several
> > > times during the testing.
> > > On 5 May 2015 20:32, "Mayuresh Gharat" <gh...@gmail.com>
> > wrote:
> > >
> > > > I agree that to find the least Loaded node the producer should fall
> > back
> > > to
> > > > the bootstrap nodes if its not able to connect to any nodes in the
> > > current
> > > > metadata. That should resolve this.
> > > >
> > > > Rahul, I suppose the problem went off because the dead node in your
> > case
> > > > might have came back up and allowed for a metadata update. Can you
> > > confirm
> > > > this?
> > > >
> > > > Thanks,
> > > >
> > > > Mayuresh
> > > >
> > > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com>
> wrote:
> > > >
> > > > > We observed the exact same error. Not very clear about the root
> cause
> > > > > although it appears to be related to leastLoadedNode
> implementation.
> > > > > Interestingly, the problem went away by increasing the value of
> > > > > reconnect.backoff.ms to 1000ms.
> > > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io>
> > > wrote:
> > > > >
> > > > > > Ok, all of that makes sense. The only way to possibly recover
> from
> > > that
> > > > > > state is either for K2 to come back up allowing the metadata
> > refresh
> > > to
> > > > > > eventually succeed or to eventually try some other node in the
> > > cluster.
> > > > > > Reusing the bootstrap nodes is one possibility. Another would be
> > for
> > > > the
> > > > > > client to get more metadata than is required for the topics it
> > needs
> > > in
> > > > > > order to ensure it has more nodes to use as options when looking
> > for
> > > a
> > > > > node
> > > > > > to fetch metadata from. I added your description to KAFKA-1843,
> > > > although
> > > > > it
> > > > > > might also make sense as a separate bug since fixing it could be
> > > > > considered
> > > > > > incremental progress towards resolving 1843.
> > > > > >
> > > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> > > kumar@nmsworks.co.in
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ewen,
> > > > > > >
> > > > > > >  Thanks for the response.  I agree with you, In some case we
> > should
> > > > use
> > > > > > > bootstrap servers.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > If you have logs at debug level, are you seeing this message
> in
> > > > > between
> > > > > > > the
> > > > > > > > connection attempts:
> > > > > > > >
> > > > > > > > Give up sending metadata request since no node is available
> > > > > > > >
> > > > > > >
> > > > > > >  Yes, this log came for couple of times.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Also, if you let it continue running, does it recover after
> the
> > > > > > > > metadata.max.age.ms timeout?
> > > > > > > >
> > > > > > >
> > > > > > >  It does not reconnect.  It is continuously trying to connect
> > with
> > > > dead
> > > > > > > node.
> > > > > > >
> > > > > > >
> > > > > > > -Manikumar
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > Ewen
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -Regards,
> > > > Mayuresh R. Gharat
> > > > (862) 250-7125
> > > >
> > >
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

I'm not sure about the old producer behavior in this same failure scenario,
but creating a new producer instance would resolve the issue since it would
start with the list of bootstrap nodes and, assuming at least one of them
was up, it would be able to fetch up to date metadata.

On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <jb...@squareup.com> wrote:

> Can you clarify, is this issue here specific to the "new" producer?  With
> the "old" producer, we routinely construct a new producer which makes a
> fresh metadata request (via a VIP connected to all nodes in the cluster).
> Would this approach work with the new producer?
>
> Jason
>
>
> On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <ra...@gmail.com> wrote:
>
> > Mayuresh,
> > I was testing this in a development environment and manually brought
> down a
> > node to simulate this. So the dead node never came back up.
> >
> > My colleague and I were able to consistently see this behaviour several
> > times during the testing.
> > On 5 May 2015 20:32, "Mayuresh Gharat" <gh...@gmail.com>
> wrote:
> >
> > > I agree that to find the least Loaded node the producer should fall
> back
> > to
> > > the bootstrap nodes if its not able to connect to any nodes in the
> > current
> > > metadata. That should resolve this.
> > >
> > > Rahul, I suppose the problem went off because the dead node in your
> case
> > > might have came back up and allowed for a metadata update. Can you
> > confirm
> > > this?
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com> wrote:
> > >
> > > > We observed the exact same error. Not very clear about the root cause
> > > > although it appears to be related to leastLoadedNode implementation.
> > > > Interestingly, the problem went away by increasing the value of
> > > > reconnect.backoff.ms to 1000ms.
> > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io>
> > wrote:
> > > >
> > > > > Ok, all of that makes sense. The only way to possibly recover from
> > that
> > > > > state is either for K2 to come back up allowing the metadata
> refresh
> > to
> > > > > eventually succeed or to eventually try some other node in the
> > cluster.
> > > > > Reusing the bootstrap nodes is one possibility. Another would be
> for
> > > the
> > > > > client to get more metadata than is required for the topics it
> needs
> > in
> > > > > order to ensure it has more nodes to use as options when looking
> for
> > a
> > > > node
> > > > > to fetch metadata from. I added your description to KAFKA-1843,
> > > although
> > > > it
> > > > > might also make sense as a separate bug since fixing it could be
> > > > considered
> > > > > incremental progress towards resolving 1843.
> > > > >
> > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> > kumar@nmsworks.co.in
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Ewen,
> > > > > >
> > > > > >  Thanks for the response.  I agree with you, In some case we
> should
> > > use
> > > > > > bootstrap servers.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > If you have logs at debug level, are you seeing this message in
> > > > between
> > > > > > the
> > > > > > > connection attempts:
> > > > > > >
> > > > > > > Give up sending metadata request since no node is available
> > > > > > >
> > > > > >
> > > > > >  Yes, this log came for couple of times.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Also, if you let it continue running, does it recover after the
> > > > > > > metadata.max.age.ms timeout?
> > > > > > >
> > > > > >
> > > > > >  It does not reconnect.  It is continuously trying to connect
> with
> > > dead
> > > > > > node.
> > > > > >
> > > > > >
> > > > > > -Manikumar
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Ewen
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -Regards,
> > > Mayuresh R. Gharat
> > > (862) 250-7125
> > >
> >
>



-- 
Thanks,
Ewen

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Jason Rosenberg <jb...@squareup.com>.

Can you clarify, is this issue here specific to the "new" producer?  With
the "old" producer, we routinely construct a new producer which makes a
fresh metadata request (via a VIP connected to all nodes in the cluster).
Would this approach work with the new producer?

Jason


On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <ra...@gmail.com> wrote:

> Mayuresh,
> I was testing this in a development environment and manually brought down a
> node to simulate this. So the dead node never came back up.
>
> My colleague and I were able to consistently see this behaviour several
> times during the testing.
> On 5 May 2015 20:32, "Mayuresh Gharat" <gh...@gmail.com> wrote:
>
> > I agree that to find the least Loaded node the producer should fall back
> to
> > the bootstrap nodes if its not able to connect to any nodes in the
> current
> > metadata. That should resolve this.
> >
> > Rahul, I suppose the problem went off because the dead node in your case
> > might have came back up and allowed for a metadata update. Can you
> confirm
> > this?
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com> wrote:
> >
> > > We observed the exact same error. Not very clear about the root cause
> > > although it appears to be related to leastLoadedNode implementation.
> > > Interestingly, the problem went away by increasing the value of
> > > reconnect.backoff.ms to 1000ms.
> > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io>
> wrote:
> > >
> > > > Ok, all of that makes sense. The only way to possibly recover from
> that
> > > > state is either for K2 to come back up allowing the metadata refresh
> to
> > > > eventually succeed or to eventually try some other node in the
> cluster.
> > > > Reusing the bootstrap nodes is one possibility. Another would be for
> > the
> > > > client to get more metadata than is required for the topics it needs
> in
> > > > order to ensure it has more nodes to use as options when looking for
> a
> > > node
> > > > to fetch metadata from. I added your description to KAFKA-1843,
> > although
> > > it
> > > > might also make sense as a separate bug since fixing it could be
> > > considered
> > > > incremental progress towards resolving 1843.
> > > >
> > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> kumar@nmsworks.co.in
> > >
> > > > wrote:
> > > >
> > > > > Hi Ewen,
> > > > >
> > > > >  Thanks for the response.  I agree with you, In some case we should
> > use
> > > > > bootstrap servers.
> > > > >
> > > > >
> > > > > >
> > > > > > If you have logs at debug level, are you seeing this message in
> > > between
> > > > > the
> > > > > > connection attempts:
> > > > > >
> > > > > > Give up sending metadata request since no node is available
> > > > > >
> > > > >
> > > > >  Yes, this log came for couple of times.
> > > > >
> > > > >
> > > > > >
> > > > > > Also, if you let it continue running, does it recover after the
> > > > > > metadata.max.age.ms timeout?
> > > > > >
> > > > >
> > > > >  It does not reconnect.  It is continuously trying to connect with
> > dead
> > > > > node.
> > > > >
> > > > >
> > > > > -Manikumar
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Ewen
> > > >
> > >
> >
> >
> >
> > --
> > -Regards,
> > Mayuresh R. Gharat
> > (862) 250-7125
> >
>

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Rahul Jain <ra...@gmail.com>.

Mayuresh,
I was testing this in a development environment and manually brought down a
node to simulate this. So the dead node never came back up.

My colleague and I were able to consistently see this behaviour several
times during the testing.
On 5 May 2015 20:32, "Mayuresh Gharat" <gh...@gmail.com> wrote:

> I agree that to find the least Loaded node the producer should fall back to
> the bootstrap nodes if its not able to connect to any nodes in the current
> metadata. That should resolve this.
>
> Rahul, I suppose the problem went off because the dead node in your case
> might have came back up and allowed for a metadata update. Can you confirm
> this?
>
> Thanks,
>
> Mayuresh
>
> On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com> wrote:
>
> > We observed the exact same error. Not very clear about the root cause
> > although it appears to be related to leastLoadedNode implementation.
> > Interestingly, the problem went away by increasing the value of
> > reconnect.backoff.ms to 1000ms.
> > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io> wrote:
> >
> > > Ok, all of that makes sense. The only way to possibly recover from that
> > > state is either for K2 to come back up allowing the metadata refresh to
> > > eventually succeed or to eventually try some other node in the cluster.
> > > Reusing the bootstrap nodes is one possibility. Another would be for
> the
> > > client to get more metadata than is required for the topics it needs in
> > > order to ensure it has more nodes to use as options when looking for a
> > node
> > > to fetch metadata from. I added your description to KAFKA-1843,
> although
> > it
> > > might also make sense as a separate bug since fixing it could be
> > considered
> > > incremental progress towards resolving 1843.
> > >
> > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <kumar@nmsworks.co.in
> >
> > > wrote:
> > >
> > > > Hi Ewen,
> > > >
> > > >  Thanks for the response.  I agree with you, In some case we should
> use
> > > > bootstrap servers.
> > > >
> > > >
> > > > >
> > > > > If you have logs at debug level, are you seeing this message in
> > between
> > > > the
> > > > > connection attempts:
> > > > >
> > > > > Give up sending metadata request since no node is available
> > > > >
> > > >
> > > >  Yes, this log came for couple of times.
> > > >
> > > >
> > > > >
> > > > > Also, if you let it continue running, does it recover after the
> > > > > metadata.max.age.ms timeout?
> > > > >
> > > >
> > > >  It does not reconnect.  It is continuously trying to connect with
> dead
> > > > node.
> > > >
> > > >
> > > > -Manikumar
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Ewen
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Mayuresh Gharat <gh...@gmail.com>.

I agree that to find the least Loaded node the producer should fall back to
the bootstrap nodes if its not able to connect to any nodes in the current
metadata. That should resolve this.

Rahul, I suppose the problem went off because the dead node in your case
might have came back up and allowed for a metadata update. Can you confirm
this?

Thanks,

Mayuresh

On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <ra...@gmail.com> wrote:

> We observed the exact same error. Not very clear about the root cause
> although it appears to be related to leastLoadedNode implementation.
> Interestingly, the problem went away by increasing the value of
> reconnect.backoff.ms to 1000ms.
> On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io> wrote:
>
> > Ok, all of that makes sense. The only way to possibly recover from that
> > state is either for K2 to come back up allowing the metadata refresh to
> > eventually succeed or to eventually try some other node in the cluster.
> > Reusing the bootstrap nodes is one possibility. Another would be for the
> > client to get more metadata than is required for the topics it needs in
> > order to ensure it has more nodes to use as options when looking for a
> node
> > to fetch metadata from. I added your description to KAFKA-1843, although
> it
> > might also make sense as a separate bug since fixing it could be
> considered
> > incremental progress towards resolving 1843.
> >
> > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <ku...@nmsworks.co.in>
> > wrote:
> >
> > > Hi Ewen,
> > >
> > >  Thanks for the response.  I agree with you, In some case we should use
> > > bootstrap servers.
> > >
> > >
> > > >
> > > > If you have logs at debug level, are you seeing this message in
> between
> > > the
> > > > connection attempts:
> > > >
> > > > Give up sending metadata request since no node is available
> > > >
> > >
> > >  Yes, this log came for couple of times.
> > >
> > >
> > > >
> > > > Also, if you let it continue running, does it recover after the
> > > > metadata.max.age.ms timeout?
> > > >
> > >
> > >  It does not reconnect.  It is continuously trying to connect with dead
> > > node.
> > >
> > >
> > > -Manikumar
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Rahul Jain <ra...@gmail.com>.

We observed the exact same error. Not very clear about the root cause
although it appears to be related to leastLoadedNode implementation.
Interestingly, the problem went away by increasing the value of
reconnect.backoff.ms to 1000ms.
On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <ew...@confluent.io> wrote:

> Ok, all of that makes sense. The only way to possibly recover from that
> state is either for K2 to come back up allowing the metadata refresh to
> eventually succeed or to eventually try some other node in the cluster.
> Reusing the bootstrap nodes is one possibility. Another would be for the
> client to get more metadata than is required for the topics it needs in
> order to ensure it has more nodes to use as options when looking for a node
> to fetch metadata from. I added your description to KAFKA-1843, although it
> might also make sense as a separate bug since fixing it could be considered
> incremental progress towards resolving 1843.
>
> On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <ku...@nmsworks.co.in>
> wrote:
>
> > Hi Ewen,
> >
> >  Thanks for the response.  I agree with you, In some case we should use
> > bootstrap servers.
> >
> >
> > >
> > > If you have logs at debug level, are you seeing this message in between
> > the
> > > connection attempts:
> > >
> > > Give up sending metadata request since no node is available
> > >
> >
> >  Yes, this log came for couple of times.
> >
> >
> > >
> > > Also, if you let it continue running, does it recover after the
> > > metadata.max.age.ms timeout?
> > >
> >
> >  It does not reconnect.  It is continuously trying to connect with dead
> > node.
> >
> >
> > -Manikumar
> >
>
>
>
> --
> Thanks,
> Ewen
>

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Ok, all of that makes sense. The only way to possibly recover from that
state is either for K2 to come back up allowing the metadata refresh to
eventually succeed or to eventually try some other node in the cluster.
Reusing the bootstrap nodes is one possibility. Another would be for the
client to get more metadata than is required for the topics it needs in
order to ensure it has more nodes to use as options when looking for a node
to fetch metadata from. I added your description to KAFKA-1843, although it
might also make sense as a separate bug since fixing it could be considered
incremental progress towards resolving 1843.

On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <ku...@nmsworks.co.in>
wrote:

> Hi Ewen,
>
>  Thanks for the response.  I agree with you, In some case we should use
> bootstrap servers.
>
>
> >
> > If you have logs at debug level, are you seeing this message in between
> the
> > connection attempts:
> >
> > Give up sending metadata request since no node is available
> >
>
>  Yes, this log came for couple of times.
>
>
> >
> > Also, if you let it continue running, does it recover after the
> > metadata.max.age.ms timeout?
> >
>
>  It does not reconnect.  It is continuously trying to connect with dead
> node.
>
>
> -Manikumar
>

-- 
Thanks,
Ewen

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Manikumar Reddy <ku...@nmsworks.co.in>.

Hi Ewen,

 Thanks for the response.  I agree with you, In some case we should use
bootstrap servers.


>
> If you have logs at debug level, are you seeing this message in between the
> connection attempts:
>
> Give up sending metadata request since no node is available
>

 Yes, this log came for couple of times.


>
> Also, if you let it continue running, does it recover after the
> metadata.max.age.ms timeout?
>

 It does not reconnect.  It is continuously trying to connect with dead
node.


-Manikumar

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Maybe add this to the description of
https://issues.apache.org/jira/browse/KAFKA-1843 ? I can't find it now, but
I think there was another bug where I described a similar problem -- in
some cases it makes sense to fall back to the list of bootstrap nodes
because you've gotten into a bad state and can't make any progress without
a metadata update but can't connect to any nodes. The leastLoadedNode
method only considers nodes in the current metadata, so in your example K1
is not considered an option after seeing the producer metadata update that
only includes K2. In KAFKA-1501 I also found another obscure edge case
where you can run into this problem if your broker hostnames/ports aren't
consistent across restarts. Yours is obviously much more likely to occur,
and may not even be that uncommon for producers that are only sending data
to one topi.

If you have logs at debug level, are you seeing this message in between the
connection attempts:

Give up sending metadata request since no node is available

Also, if you let it continue running, does it recover after the
metadata.max.age.ms timeout? If so, I think that would definitely confirm
the issue and might suggest a fix -- preserve the bootstrap metadata and
fall back to choosing a node from it when leastLoadedNode would otherwise
return null.

-Ewen

On Mon, Apr 27, 2015 at 5:40 AM, Manikumar Reddy <ma...@gmail.com>
wrote:

> Any comments on this issue?
> On Apr 24, 2015 8:05 PM, "Manikumar Reddy" <ku...@nmsworks.co.in> wrote:
>
> > We are testing new producer on a 2 node cluster.
> > Under some node failure scenarios, producer is not able
> > to update metadata.
> >
> > Steps to reproduce
> > 1. form a 2 node cluster (K1, K2)
> > 2. create a topic with single partition, replication factor = 2
> > 3. start producing data (producer metadata : K1,K2)
> > 2. Kill leader node (say K1)
> > 3. K2 becomes the leader (producer metadata : K2)
> > 4. Bring back K1 and Kill K2 before metadata.max.age.ms
> > 5. K1 becomes the Leader (producer metadata still contains : K2)
> >
> > After this point, producer is not able to update the metadata.
> > producer continuously trying to connect with dead node (K2).
> >
> > This looks like a bug to me. Am I missing anything?
> >
>

-- 
Thanks,
Ewen

Re: New producer: metadata update problem on 2 Node cluster.

Posted by Manikumar Reddy <ma...@gmail.com>.

Any comments on this issue?
On Apr 24, 2015 8:05 PM, "Manikumar Reddy" <ku...@nmsworks.co.in> wrote:

> We are testing new producer on a 2 node cluster.
> Under some node failure scenarios, producer is not able
> to update metadata.
>
> Steps to reproduce
> 1. form a 2 node cluster (K1, K2)
> 2. create a topic with single partition, replication factor = 2
> 3. start producing data (producer metadata : K1,K2)
> 2. Kill leader node (say K1)
> 3. K2 becomes the leader (producer metadata : K2)
> 4. Bring back K1 and Kill K2 before metadata.max.age.ms
> 5. K1 becomes the Leader (producer metadata still contains : K2)
>
> After this point, producer is not able to update the metadata.
> producer continuously trying to connect with dead node (K2).
>
> This looks like a bug to me. Am I missing anything?
>