You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Marshall McMullen <ma...@gmail.com> on 2015/02/17 19:01:48 UTC

One ensemble node shows massive number of 'Outstanding' requests

Greetings,

We saw an issue recently that I've never seen before and am hoping I can
get some clarity on what may cause this and whether it's a known issue. We
had a 5 node ensemble and were unable to connect to one of the ZooKeeper
instances.  When trying to connect with zkCli it would timeout. When I
connected via telnet and issued the srvr four letter word, I was surprised
to see that this one server reported a massive number of 'Outstanding'
requests. I'd never seen that really be anything other than 0 before. On
the ZK dev guide it says:

"outstanding is the number of queued requests, this increases when the
server is under load and is receiving more sustained requests than it can
process, ie the request queue". I looked at all the ZK servers in my
ensemble:

for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
grep Outstanding; done
Outstanding: 0
Outstanding: 0
Outstanding: 0
Outstanding: 0
Outstanding: 18876

I eventually killed ZK on the affected server and everything corrected
itself and Outstanding went to zero and I was able to connect again.

Is this something anyone's familiar with? I have logs if it would be
helpful.

Thanks!

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Jared Cantwell <ja...@gmail.com>.

I've confirmed that no time changing was happening before things got stuck
2 days before we noticed it.

I also just noticed that the FollowerZookeeperServer, who should be calling
commit() on the CommitProcessor, has no elements in its pendingTxns list,
which indicates that it thinks it has already passed a COMMIT message to
the CommitProcessor for every request that is stuck in the queuedRequests
list and nextPending member of CommitProcessor.

I guess I should open a JIRA for this at this point?

~Jared


On Tue, Mar 24, 2015 at 8:49 AM, Jared Cantwell <ja...@gmail.com>
wrote:

> I do not see any evidence of a time jump or date change on this node
> during recently.  I will continue to investigate.
>
> ~Jared
>
> On Mon, Mar 23, 2015 at 6:42 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> Not this, right?
>>
>> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441
>> http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html
>>
>> https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/
>>
>> Patrick
>>
>>
>> On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell
>> <ja...@gmail.com> wrote:
>> > Greetings,
>> >
>> > We just saw this problem again, and this time we were able to capture a
>> > core file of the jvm using gdb.  I've run it through jstack and jmap to
>> get
>> > a heap profile.  I can see that the FollowerZookeeperServer has
>> > a requestsInProcess member that is ~24K.  I can also see that the
>> > CommitProcessor's queuedRequest's list has the 24K items in it, so the
>> > FinalRequestProcessor's processRequest function isn't ever getting
>> called
>> > to complete the requests.
>> >
>> > The CommitProcessor's run() is doing this:
>> >
>> > Thread 23510: (state = BLOCKED)
>> >  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may
>> be
>> > imprecise)
>> >  - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
>> > line=182 (Compiled frame)
>> >
>> > Based on the state, it made it to wait() because
>> isWaitingForCommit()==true
>> > && committedRequests.isEmpty()==true.
>> >
>> > Strangely, once we detached from the jvm, it must have woken up this
>> thread
>> > and the queue flushed out as expected, bringing everything back to
>> normal.
>> >
>> > I'll keep digging, but any help or direction would be appreciated as I'm
>> > not very familiar with this area of the codebase.
>> >
>> > Thanks!
>> > Jared
>> >
>> >
>> > On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
>> > fpjunqueira@yahoo.com.invalid> wrote:
>> >
>> >> It doesn't ring a bell, but it might be worth having a look at the
>> logs to
>> >> see if there is anything unusual.
>> >>
>> >> Just to clarify, was the number of outstanding requests growing,
>> constant?
>> >> I suppose the server was following/leading and operations were going
>> >> through, otherwise it'd have dropped the connection to the leader or
>> >> leadership.
>> >>
>> >> -Flavio
>> >>
>> >> > On 17 Feb 2015, at 18:01, Marshall McMullen <
>> marshall.mcmullen@gmail.com>
>> >> wrote:
>> >> >
>> >> > Greetings,
>> >> >
>> >> > We saw an issue recently that I've never seen before and am hoping I
>> can
>> >> > get some clarity on what may cause this and whether it's a known
>> issue.
>> >> We
>> >> > had a 5 node ensemble and were unable to connect to one of the
>> ZooKeeper
>> >> > instances.  When trying to connect with zkCli it would timeout. When
>> I
>> >> > connected via telnet and issued the srvr four letter word, I was
>> >> surprised
>> >> > to see that this one server reported a massive number of
>> 'Outstanding'
>> >> > requests. I'd never seen that really be anything other than 0
>> before. On
>> >> > the ZK dev guide it says:
>> >> >
>> >> > "outstanding is the number of queued requests, this increases when
>> the
>> >> > server is under load and is receiving more sustained requests than
>> it can
>> >> > process, ie the request queue". I looked at all the ZK servers in my
>> >> > ensemble:
>> >> >
>> >> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip}
>> 2181 |
>> >> > grep Outstanding; done
>> >> > Outstanding: 0
>> >> > Outstanding: 0
>> >> > Outstanding: 0
>> >> > Outstanding: 0
>> >> > Outstanding: 18876
>> >> >
>> >> > I eventually killed ZK on the affected server and everything
>> corrected
>> >> > itself and Outstanding went to zero and I was able to connect again.
>> >> >
>> >> > Is this something anyone's familiar with? I have logs if it would be
>> >> > helpful.
>> >> >
>> >> > Thanks!
>> >>
>> >>
>>
>
>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Jared Cantwell <ja...@gmail.com>.

I do not see any evidence of a time jump or date change on this node during
recently.  I will continue to investigate.

~Jared

On Mon, Mar 23, 2015 at 6:42 PM, Patrick Hunt <ph...@apache.org> wrote:

> Not this, right?
>
> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441
> http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html
>
> https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/
>
> Patrick
>
>
> On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell
> <ja...@gmail.com> wrote:
> > Greetings,
> >
> > We just saw this problem again, and this time we were able to capture a
> > core file of the jvm using gdb.  I've run it through jstack and jmap to
> get
> > a heap profile.  I can see that the FollowerZookeeperServer has
> > a requestsInProcess member that is ~24K.  I can also see that the
> > CommitProcessor's queuedRequest's list has the 24K items in it, so the
> > FinalRequestProcessor's processRequest function isn't ever getting called
> > to complete the requests.
> >
> > The CommitProcessor's run() is doing this:
> >
> > Thread 23510: (state = BLOCKED)
> >  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
> > imprecise)
> >  - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
> > line=182 (Compiled frame)
> >
> > Based on the state, it made it to wait() because
> isWaitingForCommit()==true
> > && committedRequests.isEmpty()==true.
> >
> > Strangely, once we detached from the jvm, it must have woken up this
> thread
> > and the queue flushed out as expected, bringing everything back to
> normal.
> >
> > I'll keep digging, but any help or direction would be appreciated as I'm
> > not very familiar with this area of the codebase.
> >
> > Thanks!
> > Jared
> >
> >
> > On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
> > fpjunqueira@yahoo.com.invalid> wrote:
> >
> >> It doesn't ring a bell, but it might be worth having a look at the logs
> to
> >> see if there is anything unusual.
> >>
> >> Just to clarify, was the number of outstanding requests growing,
> constant?
> >> I suppose the server was following/leading and operations were going
> >> through, otherwise it'd have dropped the connection to the leader or
> >> leadership.
> >>
> >> -Flavio
> >>
> >> > On 17 Feb 2015, at 18:01, Marshall McMullen <
> marshall.mcmullen@gmail.com>
> >> wrote:
> >> >
> >> > Greetings,
> >> >
> >> > We saw an issue recently that I've never seen before and am hoping I
> can
> >> > get some clarity on what may cause this and whether it's a known
> issue.
> >> We
> >> > had a 5 node ensemble and were unable to connect to one of the
> ZooKeeper
> >> > instances.  When trying to connect with zkCli it would timeout. When I
> >> > connected via telnet and issued the srvr four letter word, I was
> >> surprised
> >> > to see that this one server reported a massive number of 'Outstanding'
> >> > requests. I'd never seen that really be anything other than 0 before.
> On
> >> > the ZK dev guide it says:
> >> >
> >> > "outstanding is the number of queued requests, this increases when the
> >> > server is under load and is receiving more sustained requests than it
> can
> >> > process, ie the request queue". I looked at all the ZK servers in my
> >> > ensemble:
> >> >
> >> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181
> |
> >> > grep Outstanding; done
> >> > Outstanding: 0
> >> > Outstanding: 0
> >> > Outstanding: 0
> >> > Outstanding: 0
> >> > Outstanding: 18876
> >> >
> >> > I eventually killed ZK on the affected server and everything corrected
> >> > itself and Outstanding went to zero and I was able to connect again.
> >> >
> >> > Is this something anyone's familiar with? I have logs if it would be
> >> > helpful.
> >> >
> >> > Thanks!
> >>
> >>
>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Jared Cantwell <ja...@gmail.com>.

I do not see any evidence of a time jump or date change on this node during
recently.  I will continue to investigate.

~Jared

On Mon, Mar 23, 2015 at 6:42 PM, Patrick Hunt <ph...@apache.org> wrote:

> Not this, right?
>
> http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441
> http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html
>
> https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/
>
> Patrick
>
>
> On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell
> <ja...@gmail.com> wrote:
> > Greetings,
> >
> > We just saw this problem again, and this time we were able to capture a
> > core file of the jvm using gdb.  I've run it through jstack and jmap to
> get
> > a heap profile.  I can see that the FollowerZookeeperServer has
> > a requestsInProcess member that is ~24K.  I can also see that the
> > CommitProcessor's queuedRequest's list has the 24K items in it, so the
> > FinalRequestProcessor's processRequest function isn't ever getting called
> > to complete the requests.
> >
> > The CommitProcessor's run() is doing this:
> >
> > Thread 23510: (state = BLOCKED)
> >  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
> > imprecise)
> >  - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
> > line=182 (Compiled frame)
> >
> > Based on the state, it made it to wait() because
> isWaitingForCommit()==true
> > && committedRequests.isEmpty()==true.
> >
> > Strangely, once we detached from the jvm, it must have woken up this
> thread
> > and the queue flushed out as expected, bringing everything back to
> normal.
> >
> > I'll keep digging, but any help or direction would be appreciated as I'm
> > not very familiar with this area of the codebase.
> >
> > Thanks!
> > Jared
> >
> >
> > On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
> > fpjunqueira@yahoo.com.invalid> wrote:
> >
> >> It doesn't ring a bell, but it might be worth having a look at the logs
> to
> >> see if there is anything unusual.
> >>
> >> Just to clarify, was the number of outstanding requests growing,
> constant?
> >> I suppose the server was following/leading and operations were going
> >> through, otherwise it'd have dropped the connection to the leader or
> >> leadership.
> >>
> >> -Flavio
> >>
> >> > On 17 Feb 2015, at 18:01, Marshall McMullen <
> marshall.mcmullen@gmail.com>
> >> wrote:
> >> >
> >> > Greetings,
> >> >
> >> > We saw an issue recently that I've never seen before and am hoping I
> can
> >> > get some clarity on what may cause this and whether it's a known
> issue.
> >> We
> >> > had a 5 node ensemble and were unable to connect to one of the
> ZooKeeper
> >> > instances.  When trying to connect with zkCli it would timeout. When I
> >> > connected via telnet and issued the srvr four letter word, I was
> >> surprised
> >> > to see that this one server reported a massive number of 'Outstanding'
> >> > requests. I'd never seen that really be anything other than 0 before.
> On
> >> > the ZK dev guide it says:
> >> >
> >> > "outstanding is the number of queued requests, this increases when the
> >> > server is under load and is receiving more sustained requests than it
> can
> >> > process, ie the request queue". I looked at all the ZK servers in my
> >> > ensemble:
> >> >
> >> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181
> |
> >> > grep Outstanding; done
> >> > Outstanding: 0
> >> > Outstanding: 0
> >> > Outstanding: 0
> >> > Outstanding: 0
> >> > Outstanding: 18876
> >> >
> >> > I eventually killed ZK on the affected server and everything corrected
> >> > itself and Outstanding went to zero and I was able to connect again.
> >> >
> >> > Is this something anyone's familiar with? I have logs if it would be
> >> > helpful.
> >> >
> >> > Thanks!
> >>
> >>
>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Patrick Hunt <ph...@apache.org>.

Not this, right?

http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441
http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html
https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/

Patrick


On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell
<ja...@gmail.com> wrote:
> Greetings,
>
> We just saw this problem again, and this time we were able to capture a
> core file of the jvm using gdb.  I've run it through jstack and jmap to get
> a heap profile.  I can see that the FollowerZookeeperServer has
> a requestsInProcess member that is ~24K.  I can also see that the
> CommitProcessor's queuedRequest's list has the 24K items in it, so the
> FinalRequestProcessor's processRequest function isn't ever getting called
> to complete the requests.
>
> The CommitProcessor's run() is doing this:
>
> Thread 23510: (state = BLOCKED)
>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
> imprecise)
>  - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
> line=182 (Compiled frame)
>
> Based on the state, it made it to wait() because isWaitingForCommit()==true
> && committedRequests.isEmpty()==true.
>
> Strangely, once we detached from the jvm, it must have woken up this thread
> and the queue flushed out as expected, bringing everything back to normal.
>
> I'll keep digging, but any help or direction would be appreciated as I'm
> not very familiar with this area of the codebase.
>
> Thanks!
> Jared
>
>
> On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
> fpjunqueira@yahoo.com.invalid> wrote:
>
>> It doesn't ring a bell, but it might be worth having a look at the logs to
>> see if there is anything unusual.
>>
>> Just to clarify, was the number of outstanding requests growing, constant?
>> I suppose the server was following/leading and operations were going
>> through, otherwise it'd have dropped the connection to the leader or
>> leadership.
>>
>> -Flavio
>>
>> > On 17 Feb 2015, at 18:01, Marshall McMullen <ma...@gmail.com>
>> wrote:
>> >
>> > Greetings,
>> >
>> > We saw an issue recently that I've never seen before and am hoping I can
>> > get some clarity on what may cause this and whether it's a known issue.
>> We
>> > had a 5 node ensemble and were unable to connect to one of the ZooKeeper
>> > instances.  When trying to connect with zkCli it would timeout. When I
>> > connected via telnet and issued the srvr four letter word, I was
>> surprised
>> > to see that this one server reported a massive number of 'Outstanding'
>> > requests. I'd never seen that really be anything other than 0 before. On
>> > the ZK dev guide it says:
>> >
>> > "outstanding is the number of queued requests, this increases when the
>> > server is under load and is receiving more sustained requests than it can
>> > process, ie the request queue". I looked at all the ZK servers in my
>> > ensemble:
>> >
>> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
>> > grep Outstanding; done
>> > Outstanding: 0
>> > Outstanding: 0
>> > Outstanding: 0
>> > Outstanding: 0
>> > Outstanding: 18876
>> >
>> > I eventually killed ZK on the affected server and everything corrected
>> > itself and Outstanding went to zero and I was able to connect again.
>> >
>> > Is this something anyone's familiar with? I have logs if it would be
>> > helpful.
>> >
>> > Thanks!
>>
>>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Patrick Hunt <ph...@apache.org>.

Not this, right?

http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6900441
http://osdir.com/ml/hotspot-runtime-dev-java/2013-09/msg00006.html
https://bbossola.wordpress.com/2013/09/04/jvm-issue-concurrency-is-affected-by-changing-the-date-of-the-system/

Patrick


On Mon, Mar 23, 2015 at 5:00 PM, Jared Cantwell
<ja...@gmail.com> wrote:
> Greetings,
>
> We just saw this problem again, and this time we were able to capture a
> core file of the jvm using gdb.  I've run it through jstack and jmap to get
> a heap profile.  I can see that the FollowerZookeeperServer has
> a requestsInProcess member that is ~24K.  I can also see that the
> CommitProcessor's queuedRequest's list has the 24K items in it, so the
> FinalRequestProcessor's processRequest function isn't ever getting called
> to complete the requests.
>
> The CommitProcessor's run() is doing this:
>
> Thread 23510: (state = BLOCKED)
>  - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
> imprecise)
>  - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
> line=182 (Compiled frame)
>
> Based on the state, it made it to wait() because isWaitingForCommit()==true
> && committedRequests.isEmpty()==true.
>
> Strangely, once we detached from the jvm, it must have woken up this thread
> and the queue flushed out as expected, bringing everything back to normal.
>
> I'll keep digging, but any help or direction would be appreciated as I'm
> not very familiar with this area of the codebase.
>
> Thanks!
> Jared
>
>
> On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
> fpjunqueira@yahoo.com.invalid> wrote:
>
>> It doesn't ring a bell, but it might be worth having a look at the logs to
>> see if there is anything unusual.
>>
>> Just to clarify, was the number of outstanding requests growing, constant?
>> I suppose the server was following/leading and operations were going
>> through, otherwise it'd have dropped the connection to the leader or
>> leadership.
>>
>> -Flavio
>>
>> > On 17 Feb 2015, at 18:01, Marshall McMullen <ma...@gmail.com>
>> wrote:
>> >
>> > Greetings,
>> >
>> > We saw an issue recently that I've never seen before and am hoping I can
>> > get some clarity on what may cause this and whether it's a known issue.
>> We
>> > had a 5 node ensemble and were unable to connect to one of the ZooKeeper
>> > instances.  When trying to connect with zkCli it would timeout. When I
>> > connected via telnet and issued the srvr four letter word, I was
>> surprised
>> > to see that this one server reported a massive number of 'Outstanding'
>> > requests. I'd never seen that really be anything other than 0 before. On
>> > the ZK dev guide it says:
>> >
>> > "outstanding is the number of queued requests, this increases when the
>> > server is under load and is receiving more sustained requests than it can
>> > process, ie the request queue". I looked at all the ZK servers in my
>> > ensemble:
>> >
>> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
>> > grep Outstanding; done
>> > Outstanding: 0
>> > Outstanding: 0
>> > Outstanding: 0
>> > Outstanding: 0
>> > Outstanding: 18876
>> >
>> > I eventually killed ZK on the affected server and everything corrected
>> > itself and Outstanding went to zero and I was able to connect again.
>> >
>> > Is this something anyone's familiar with? I have logs if it would be
>> > helpful.
>> >
>> > Thanks!
>>
>>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Jared Cantwell <ja...@gmail.com>.

Greetings,

We just saw this problem again, and this time we were able to capture a
core file of the jvm using gdb.  I've run it through jstack and jmap to get
a heap profile.  I can see that the FollowerZookeeperServer has
a requestsInProcess member that is ~24K.  I can also see that the
CommitProcessor's queuedRequest's list has the 24K items in it, so the
FinalRequestProcessor's processRequest function isn't ever getting called
to complete the requests.

The CommitProcessor's run() is doing this:

Thread 23510: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
line=182 (Compiled frame)

Based on the state, it made it to wait() because isWaitingForCommit()==true
&& committedRequests.isEmpty()==true.

Strangely, once we detached from the jvm, it must have woken up this thread
and the queue flushed out as expected, bringing everything back to normal.

I'll keep digging, but any help or direction would be appreciated as I'm
not very familiar with this area of the codebase.

Thanks!
Jared


On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
fpjunqueira@yahoo.com.invalid> wrote:

> It doesn't ring a bell, but it might be worth having a look at the logs to
> see if there is anything unusual.
>
> Just to clarify, was the number of outstanding requests growing, constant?
> I suppose the server was following/leading and operations were going
> through, otherwise it'd have dropped the connection to the leader or
> leadership.
>
> -Flavio
>
> > On 17 Feb 2015, at 18:01, Marshall McMullen <ma...@gmail.com>
> wrote:
> >
> > Greetings,
> >
> > We saw an issue recently that I've never seen before and am hoping I can
> > get some clarity on what may cause this and whether it's a known issue.
> We
> > had a 5 node ensemble and were unable to connect to one of the ZooKeeper
> > instances.  When trying to connect with zkCli it would timeout. When I
> > connected via telnet and issued the srvr four letter word, I was
> surprised
> > to see that this one server reported a massive number of 'Outstanding'
> > requests. I'd never seen that really be anything other than 0 before. On
> > the ZK dev guide it says:
> >
> > "outstanding is the number of queued requests, this increases when the
> > server is under load and is receiving more sustained requests than it can
> > process, ie the request queue". I looked at all the ZK servers in my
> > ensemble:
> >
> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
> > grep Outstanding; done
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 18876
> >
> > I eventually killed ZK on the affected server and everything corrected
> > itself and Outstanding went to zero and I was able to connect again.
> >
> > Is this something anyone's familiar with? I have logs if it would be
> > helpful.
> >
> > Thanks!
>
>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Jared Cantwell <ja...@gmail.com>.

Greetings,

We just saw this problem again, and this time we were able to capture a
core file of the jvm using gdb.  I've run it through jstack and jmap to get
a heap profile.  I can see that the FollowerZookeeperServer has
a requestsInProcess member that is ~24K.  I can also see that the
CommitProcessor's queuedRequest's list has the 24K items in it, so the
FinalRequestProcessor's processRequest function isn't ever getting called
to complete the requests.

The CommitProcessor's run() is doing this:

Thread 23510: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
 - org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165,
line=182 (Compiled frame)

Based on the state, it made it to wait() because isWaitingForCommit()==true
&& committedRequests.isEmpty()==true.

Strangely, once we detached from the jvm, it must have woken up this thread
and the queue flushed out as expected, bringing everything back to normal.

I'll keep digging, but any help or direction would be appreciated as I'm
not very familiar with this area of the codebase.

Thanks!
Jared


On Tue, Feb 17, 2015 at 2:38 PM, Flavio Junqueira <
fpjunqueira@yahoo.com.invalid> wrote:

> It doesn't ring a bell, but it might be worth having a look at the logs to
> see if there is anything unusual.
>
> Just to clarify, was the number of outstanding requests growing, constant?
> I suppose the server was following/leading and operations were going
> through, otherwise it'd have dropped the connection to the leader or
> leadership.
>
> -Flavio
>
> > On 17 Feb 2015, at 18:01, Marshall McMullen <ma...@gmail.com>
> wrote:
> >
> > Greetings,
> >
> > We saw an issue recently that I've never seen before and am hoping I can
> > get some clarity on what may cause this and whether it's a known issue.
> We
> > had a 5 node ensemble and were unable to connect to one of the ZooKeeper
> > instances.  When trying to connect with zkCli it would timeout. When I
> > connected via telnet and issued the srvr four letter word, I was
> surprised
> > to see that this one server reported a massive number of 'Outstanding'
> > requests. I'd never seen that really be anything other than 0 before. On
> > the ZK dev guide it says:
> >
> > "outstanding is the number of queued requests, this increases when the
> > server is under load and is receiving more sustained requests than it can
> > process, ie the request queue". I looked at all the ZK servers in my
> > ensemble:
> >
> > for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
> > grep Outstanding; done
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 0
> > Outstanding: 18876
> >
> > I eventually killed ZK on the affected server and everything corrected
> > itself and Outstanding went to zero and I was able to connect again.
> >
> > Is this something anyone's familiar with? I have logs if it would be
> > helpful.
> >
> > Thanks!
>
>

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.

It doesn't ring a bell, but it might be worth having a look at the logs to see if there is anything unusual. 

Just to clarify, was the number of outstanding requests growing, constant? I suppose the server was following/leading and operations were going through, otherwise it'd have dropped the connection to the leader or leadership.

-Flavio 

> On 17 Feb 2015, at 18:01, Marshall McMullen <ma...@gmail.com> wrote:
> 
> Greetings,
> 
> We saw an issue recently that I've never seen before and am hoping I can
> get some clarity on what may cause this and whether it's a known issue. We
> had a 5 node ensemble and were unable to connect to one of the ZooKeeper
> instances.  When trying to connect with zkCli it would timeout. When I
> connected via telnet and issued the srvr four letter word, I was surprised
> to see that this one server reported a massive number of 'Outstanding'
> requests. I'd never seen that really be anything other than 0 before. On
> the ZK dev guide it says:
> 
> "outstanding is the number of queued requests, this increases when the
> server is under load and is receiving more sustained requests than it can
> process, ie the request queue". I looked at all the ZK servers in my
> ensemble:
> 
> for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
> grep Outstanding; done
> Outstanding: 0
> Outstanding: 0
> Outstanding: 0
> Outstanding: 0
> Outstanding: 18876
> 
> I eventually killed ZK on the affected server and everything corrected
> itself and Outstanding went to zero and I was able to connect again.
> 
> Is this something anyone's familiar with? I have logs if it would be
> helpful.
> 
> Thanks!

Re: One ensemble node shows massive number of 'Outstanding' requests

Posted by Flavio Junqueira <fp...@yahoo.com.INVALID>.

It doesn't ring a bell, but it might be worth having a look at the logs to see if there is anything unusual. 

Just to clarify, was the number of outstanding requests growing, constant? I suppose the server was following/leading and operations were going through, otherwise it'd have dropped the connection to the leader or leadership.

-Flavio 

> On 17 Feb 2015, at 18:01, Marshall McMullen <ma...@gmail.com> wrote:
> 
> Greetings,
> 
> We saw an issue recently that I've never seen before and am hoping I can
> get some clarity on what may cause this and whether it's a known issue. We
> had a 5 node ensemble and were unable to connect to one of the ZooKeeper
> instances.  When trying to connect with zkCli it would timeout. When I
> connected via telnet and issued the srvr four letter word, I was surprised
> to see that this one server reported a massive number of 'Outstanding'
> requests. I'd never seen that really be anything other than 0 before. On
> the ZK dev guide it says:
> 
> "outstanding is the number of queued requests, this increases when the
> server is under load and is receiving more sustained requests than it can
> process, ie the request queue". I looked at all the ZK servers in my
> ensemble:
> 
> for ip in 101 102 103 104 105; do echo srvr | nc 172.21.20.${ip} 2181 |
> grep Outstanding; done
> Outstanding: 0
> Outstanding: 0
> Outstanding: 0
> Outstanding: 0
> Outstanding: 18876
> 
> I eventually killed ZK on the affected server and everything corrected
> itself and Outstanding went to zero and I was able to connect again.
> 
> Is this something anyone's familiar with? I have logs if it would be
> helpful.
> 
> Thanks!