You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@geode.apache.org by Hovhannes Antonyan <ha...@vmware.com> on 2015/12/15 09:03:38 UTC

How to troubleshoot stuck distributed function calls

Hello experts,

I have a multi node environment where one of the nodes has made a broadcast call to all other nodes and got stuck.
It is still waiting responses from all nodes and from the heapdump I see that ResultCollector has N-1 elements, where N is the total number of nodes, so it looks like one of the nodes didn't return a response, or it did return but for some reason the caller has not received it.
How can I troubleshoot this issue, how can I know which node exactly has failed to return the response and why?

Thanks in advance,
Hovhannes

Re: How to troubleshoot stuck distributed function calls

Posted by Barry Oglesby <bo...@pivotal.io>.

Ok. Is this reproducible? We'll probably need to see all the artifacts
(logs / stats / thread dumps) to see if we can figure out what is going on.

Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at
http://support.pivotal.io/


On Tue, Dec 15, 2015 at 9:44 PM, Hovhannes Antonyan <ha...@vmware.com>
wrote:

> Hi Barry,
>
>
> Yes I am running onMembers API, but as I already said there is no Function
> Execution Processor thread that runs that function.
>
>
> ------------------------------
> *From:* Barry Oglesby <bo...@pivotal.io>
> *Sent:* Wednesday, December 16, 2015 12:25 AM
> *To:* user@geode.incubator.apache.org
> *Subject:* Re: How to troubleshoot stuck distributed function calls
>
> I think it depends on how the function is being invoked. Below is an
> example with two peers using the onMembers API. If you're invoking your
> function differently (e.g. onRegion), let me know. Also, if you want to
> send your thread dumps, I can take a look at them.
>
> I have a test where I have one peer invoking a Function onMembers. If I
> put a sleep in the execute method, I see these threads.
>
> The thread in the caller (in this case the main thread) is waiting for a
> reply in ReplyProcessor21.basicWait:
>
> "main" prio=5 tid=0x00007fd04a008800 nid=0x1903 waiting on condition
> [0x0000000108567000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x000000010fff1ac0> (a
> java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282)
> at
> com.gemstone.gemfire.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:55)
> at
> com.gemstone.gemfire.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:743)
> at
> com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:819)
> at
> com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:796)
> at
> com.gemstone.gemfire.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:142)
> at TestPeer.executeFunctionOnMembers(TestPeer.java:45)
> at TestPeer.main(TestPeer.java:28)
>
> The thread in the member processing the function (a Function Execution
> Processor thread) is in the Function.execute method here:
>
> "Function Execution Processor1" daemon prio=5 tid=0x00007fa694cb3000
> nid=0xc403 waiting on condition [0x000000015f8c6000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at TestFunction.execute(TestFunction.java:13)
> at
> com.gemstone.gemfire.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185)
> at
> com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
> at
> com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
> at
> com.gemstone.gemfire.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1149)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Barry Oglesby
> GemFire Advanced Customer Engineering (ACE)
> For immediate support please contact Pivotal Support at
> http://support.pivotal.io/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=44iwpeLLhP6gBvkRXzHdVJ1hRqDb5pLiHV1coNmMwEU&s=7GdRrkSBt7z5vo79p_ot6CmW4so73SK9iOgK8axRnng&e=>
>
>
> On Tue, Dec 15, 2015 at 12:05 PM, Hovhannes Antonyan <hantonyan@vmware.com
> > wrote:
>
>> I have dumps of both nodes. Now can you please point to which threads
>> should I look at? I do not see any function execution thread on target
>> node running that function.
>>
>> But still the caller node waits for response from that node. Should I
>> look at P2P threads next? Something else?
>> ------------------------------
>> *From:* Barry Oglesby <bo...@pivotal.io>
>> *Sent:* Tuesday, December 15, 2015 11:37 PM
>> *To:* user@geode.incubator.apache.org
>> *Subject:* Re: How to troubleshoot stuck distributed function calls
>>
>> You'll want to take thread dumps (not heap dumps) in the members
>> especially the one that initiated the function call and the one that didn't
>> send a response. Those will tell you whether the thread processing the
>> function or the thread processing the reply is stuck and if so, where.
>>
>> Barry Oglesby
>> GemFire Advanced Customer Engineering (ACE)
>> For immediate support please contact Pivotal Support at
>> http://support.pivotal.io/
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=QfLGTYeWOQDhTSy6a8t-VHVYIdCKpVNoLLpF-X-b054&s=3nG5HwNACSIP-F00MISsSsTioz5EtMo-u4jx8w_yjek&e=>
>>
>>
>> On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan <
>> hantonyan@vmware.com> wrote:
>>
>>> I was looking at the heapdump and identified the node which didn't sent
>>> the response.
>>>
>>> But the question now is why didn't it send it, did it run the function
>>> or not yet...?
>>> ------------------------------
>>> *From:* Darrel Schneider <ds...@pivotal.io>
>>> *Sent:* Tuesday, December 15, 2015 9:58 PM
>>> *To:* user@geode.incubator.apache.org
>>> *Subject:* Re: How to troubleshoot stuck distributed function calls
>>>
>>> Usually the member waiting for a response logs a warning that it has
>>> been waiting for longer than 15 seconds from a particular member. Use that
>>> member id to identify the member that is not responding. Get a stack dump
>>> on that member and look for a thread that is processing the unresponsive
>>> message. Sometimes this member also logs that he is waiting for someone
>>> else to respond to him before he can respond to the first member.
>>>
>>> The log message to look for is: "seconds have elapsed while waiting for
>>> replies:". It will be a warning and should be the last message logged by
>>> that thread. Sometimes it will log this warning and then get the response
>>> later in which case it will log an info message that it did receive the
>>> reply.
>>>
>>>
>>> On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <
>>> hantonyan@vmware.com> wrote:
>>>
>>>> Hello experts,
>>>>
>>>>
>>>>
>>>> I have a multi node environment where one of the nodes has made a
>>>> broadcast call to all other nodes and got stuck.
>>>>
>>>> It is still waiting responses from all nodes and from the heapdump I
>>>> see that ResultCollector has N-1 elements, where N is the total number of
>>>> nodes, so it looks like one of the nodes didn't return a response, or it
>>>> did return but for some reason the caller has not received it.
>>>>
>>>> How can I troubleshoot this issue, how can I know which node exactly
>>>> has failed to return the response and why?
>>>>
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Hovhannes
>>>>
>>>
>>>
>>
>

Re: How to troubleshoot stuck distributed function calls

Posted by Hovhannes Antonyan <ha...@vmware.com>.

Hi Barry,

Yes I am running onMembers API, but as I already said there is no Function Execution Processor thread that runs that function.

________________________________
From: Barry Oglesby <bo...@pivotal.io>
Sent: Wednesday, December 16, 2015 12:25 AM
To: user@geode.incubator.apache.org
Subject: Re: How to troubleshoot stuck distributed function calls

I think it depends on how the function is being invoked. Below is an example with two peers using the onMembers API. If you're invoking your function differently (e.g. onRegion), let me know. Also, if you want to send your thread dumps, I can take a look at them.

I have a test where I have one peer invoking a Function onMembers. If I put a sleep in the execute method, I see these threads.

The thread in the caller (in this case the main thread) is waiting for a reply in ReplyProcessor21.basicWait:

"main" prio=5 tid=0x00007fd04a008800 nid=0x1903 waiting on condition [0x0000000108567000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000000010fff1ac0> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282)
at com.gemstone.gemfire.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:55)
at com.gemstone.gemfire.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:743)
at com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:819)
at com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:796)
at com.gemstone.gemfire.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:142)
at TestPeer.executeFunctionOnMembers(TestPeer.java:45)
at TestPeer.main(TestPeer.java:28)

The thread in the member processing the function (a Function Execution Processor thread) is in the Function.execute method here:

"Function Execution Processor1" daemon prio=5 tid=0x00007fa694cb3000 nid=0xc403 waiting on condition [0x000000015f8c6000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at TestFunction.execute(TestFunction.java:13)
at com.gemstone.gemfire.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185)
at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
at com.gemstone.gemfire.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1149)
at java.lang.Thread.run(Thread.java:745)

Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at http://support.pivotal.io/<https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=44iwpeLLhP6gBvkRXzHdVJ1hRqDb5pLiHV1coNmMwEU&s=7GdRrkSBt7z5vo79p_ot6CmW4so73SK9iOgK8axRnng&e=>

On Tue, Dec 15, 2015 at 12:05 PM, Hovhannes Antonyan <ha...@vmware.com>> wrote:

I have dumps of both nodes. Now can you please point to which threads should I look at? I do not see any function execution thread on target node running that function.

But still the caller node waits for response from that node. Should I look at P2P threads next? Something else?

________________________________
From: Barry Oglesby <bo...@pivotal.io>>
Sent: Tuesday, December 15, 2015 11:37 PM
To: user@geode.incubator.apache.org<ma...@geode.incubator.apache.org>
Subject: Re: How to troubleshoot stuck distributed function calls

You'll want to take thread dumps (not heap dumps) in the members especially the one that initiated the function call and the one that didn't send a response. Those will tell you whether the thread processing the function or the thread processing the reply is stuck and if so, where.

Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at http://support.pivotal.io/<https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=QfLGTYeWOQDhTSy6a8t-VHVYIdCKpVNoLLpF-X-b054&s=3nG5HwNACSIP-F00MISsSsTioz5EtMo-u4jx8w_yjek&e=>

On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan <ha...@vmware.com>> wrote:

I was looking at the heapdump and identified the node which didn't sent the response.

But the question now is why didn't it send it, did it run the function or not yet...?

________________________________
From: Darrel Schneider <ds...@pivotal.io>>
Sent: Tuesday, December 15, 2015 9:58 PM
To: user@geode.incubator.apache.org<ma...@geode.incubator.apache.org>
Subject: Re: How to troubleshoot stuck distributed function calls

Usually the member waiting for a response logs a warning that it has been waiting for longer than 15 seconds from a particular member. Use that member id to identify the member that is not responding. Get a stack dump on that member and look for a thread that is processing the unresponsive message. Sometimes this member also logs that he is waiting for someone else to respond to him before he can respond to the first member.

The log message to look for is: "seconds have elapsed while waiting for replies:". It will be a warning and should be the last message logged by that thread. Sometimes it will log this warning and then get the response later in which case it will log an info message that it did receive the reply.

On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <ha...@vmware.com>> wrote:
Hello experts,

I have a multi node environment where one of the nodes has made a broadcast call to all other nodes and got stuck.
It is still waiting responses from all nodes and from the heapdump I see that ResultCollector has N-1 elements, where N is the total number of nodes, so it looks like one of the nodes didn't return a response, or it did return but for some reason the caller has not received it.
How can I troubleshoot this issue, how can I know which node exactly has failed to return the response and why?

Thanks in advance,
Hovhannes

Re: How to troubleshoot stuck distributed function calls

Posted by Barry Oglesby <bo...@pivotal.io>.

I think it depends on how the function is being invoked. Below is an
example with two peers using the onMembers API. If you're invoking your
function differently (e.g. onRegion), let me know. Also, if you want to
send your thread dumps, I can take a look at them.

I have a test where I have one peer invoking a Function onMembers. If I put
a sleep in the execute method, I see these threads.

The thread in the caller (in this case the main thread) is waiting for a
reply in ReplyProcessor21.basicWait:

"main" prio=5 tid=0x00007fd04a008800 nid=0x1903 waiting on condition
[0x0000000108567000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x000000010fff1ac0> (a
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:282)
at
com.gemstone.gemfire.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:55)
at
com.gemstone.gemfire.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:743)
at
com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:819)
at
com.gemstone.gemfire.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:796)
at
com.gemstone.gemfire.internal.cache.execute.FunctionStreamingResultCollector.getResult(FunctionStreamingResultCollector.java:142)
at TestPeer.executeFunctionOnMembers(TestPeer.java:45)
at TestPeer.main(TestPeer.java:28)

The thread in the member processing the function (a Function Execution
Processor thread) is in the Function.execute method here:

"Function Execution Processor1" daemon prio=5 tid=0x00007fa694cb3000
nid=0xc403 waiting on condition [0x000000015f8c6000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at TestFunction.execute(TestFunction.java:13)
at
com.gemstone.gemfire.internal.cache.MemberFunctionStreamingMessage.process(MemberFunctionStreamingMessage.java:185)
at
com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
at
com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at
com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
at
com.gemstone.gemfire.distributed.internal.DistributionManager$9$1.run(DistributionManager.java:1149)
at java.lang.Thread.run(Thread.java:745)


Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at
http://support.pivotal.io/


On Tue, Dec 15, 2015 at 12:05 PM, Hovhannes Antonyan <ha...@vmware.com>
wrote:

> I have dumps of both nodes. Now can you please point to which threads
> should I look at? I do not see any function execution thread on target
> node running that function.
>
> But still the caller node waits for response from that node. Should I
> look at P2P threads next? Something else?
> ------------------------------
> *From:* Barry Oglesby <bo...@pivotal.io>
> *Sent:* Tuesday, December 15, 2015 11:37 PM
> *To:* user@geode.incubator.apache.org
> *Subject:* Re: How to troubleshoot stuck distributed function calls
>
> You'll want to take thread dumps (not heap dumps) in the members
> especially the one that initiated the function call and the one that didn't
> send a response. Those will tell you whether the thread processing the
> function or the thread processing the reply is stuck and if so, where.
>
> Barry Oglesby
> GemFire Advanced Customer Engineering (ACE)
> For immediate support please contact Pivotal Support at
> http://support.pivotal.io/
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=QfLGTYeWOQDhTSy6a8t-VHVYIdCKpVNoLLpF-X-b054&s=3nG5HwNACSIP-F00MISsSsTioz5EtMo-u4jx8w_yjek&e=>
>
>
> On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan <hantonyan@vmware.com
> > wrote:
>
>> I was looking at the heapdump and identified the node which didn't sent
>> the response.
>>
>> But the question now is why didn't it send it, did it run the function or
>> not yet...?
>> ------------------------------
>> *From:* Darrel Schneider <ds...@pivotal.io>
>> *Sent:* Tuesday, December 15, 2015 9:58 PM
>> *To:* user@geode.incubator.apache.org
>> *Subject:* Re: How to troubleshoot stuck distributed function calls
>>
>> Usually the member waiting for a response logs a warning that it has been
>> waiting for longer than 15 seconds from a particular member. Use that
>> member id to identify the member that is not responding. Get a stack dump
>> on that member and look for a thread that is processing the unresponsive
>> message. Sometimes this member also logs that he is waiting for someone
>> else to respond to him before he can respond to the first member.
>>
>> The log message to look for is: "seconds have elapsed while waiting for
>> replies:". It will be a warning and should be the last message logged by
>> that thread. Sometimes it will log this warning and then get the response
>> later in which case it will log an info message that it did receive the
>> reply.
>>
>>
>> On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <
>> hantonyan@vmware.com> wrote:
>>
>>> Hello experts,
>>>
>>>
>>>
>>> I have a multi node environment where one of the nodes has made a
>>> broadcast call to all other nodes and got stuck.
>>>
>>> It is still waiting responses from all nodes and from the heapdump I see
>>> that ResultCollector has N-1 elements, where N is the total number of
>>> nodes, so it looks like one of the nodes didn't return a response, or it
>>> did return but for some reason the caller has not received it.
>>>
>>> How can I troubleshoot this issue, how can I know which node exactly has
>>> failed to return the response and why?
>>>
>>>
>>>
>>> Thanks in advance,
>>>
>>> Hovhannes
>>>
>>
>>
>

Re: How to troubleshoot stuck distributed function calls

Posted by Hovhannes Antonyan <ha...@vmware.com>.

I have dumps of both nodes. Now can you please point to which threads should I look at? I do not see any function execution thread on target node running that function.

But still the caller node waits for response from that node. Should I look at P2P threads next? Something else?

________________________________
From: Barry Oglesby <bo...@pivotal.io>
Sent: Tuesday, December 15, 2015 11:37 PM
To: user@geode.incubator.apache.org
Subject: Re: How to troubleshoot stuck distributed function calls

You'll want to take thread dumps (not heap dumps) in the members especially the one that initiated the function call and the one that didn't send a response. Those will tell you whether the thread processing the function or the thread processing the reply is stuck and if so, where.

Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at http://support.pivotal.io/<https://urldefense.proofpoint.com/v2/url?u=http-3A__support.pivotal.io_&d=BQMFaQ&c=Sqcl0Ez6M0X8aeM67LKIiDJAXVeAw-YihVMNtXt-uEs&r=bIt5r7erIk2FRv4Meej3vuWToY3QHT-2W8ak_AP93qs&m=QfLGTYeWOQDhTSy6a8t-VHVYIdCKpVNoLLpF-X-b054&s=3nG5HwNACSIP-F00MISsSsTioz5EtMo-u4jx8w_yjek&e=>

On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan <ha...@vmware.com>> wrote:

I was looking at the heapdump and identified the node which didn't sent the response.

But the question now is why didn't it send it, did it run the function or not yet...?

________________________________
From: Darrel Schneider <ds...@pivotal.io>>
Sent: Tuesday, December 15, 2015 9:58 PM
To: user@geode.incubator.apache.org<ma...@geode.incubator.apache.org>
Subject: Re: How to troubleshoot stuck distribute?d function calls

Usually the member waiting for a response logs a warning that it has been waiting for longer than 15 seconds from a particular member. Use that member id to identify the member that is not responding. Get a stack dump on that member and look for a thread that is processing the unresponsive message. Sometimes this member also logs that he is waiting for someone else to respond to him before he can respond to the first member.

The log message to look for is: "seconds have elapsed while waiting for replies:". It will be a warning and should be the last message logged by that thread. Sometimes it will log this warning and then get the response later in which case it will log an info message that it did receive the reply.

On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <ha...@vmware.com>> wrote:
Hello experts,

I have a multi node environment where one of the nodes has made a broadcast call to all other nodes and got stuck.
It is still waiting responses from all nodes and from the heapdump I see that ResultCollector has N-1 elements, where N is the total number of nodes, so it looks like one of the nodes didn't return a response, or it did return but for some reason the caller has not received it.
How can I troubleshoot this issue, how can I know which node exactly has failed to return the response and why?

Thanks in advance,
Hovhannes

Re: How to troubleshoot stuck distributed function calls

Posted by Barry Oglesby <bo...@pivotal.io>.

You'll want to take thread dumps (not heap dumps) in the members especially
the one that initiated the function call and the one that didn't send a
response. Those will tell you whether the thread processing the function or
the thread processing the reply is stuck and if so, where.

Barry Oglesby
GemFire Advanced Customer Engineering (ACE)
For immediate support please contact Pivotal Support at
http://support.pivotal.io/


On Tue, Dec 15, 2015 at 11:23 AM, Hovhannes Antonyan <ha...@vmware.com>
wrote:

> I was looking at the heapdump and identified the node which didn't sent
> the response.
>
> But the question now is why didn't it send it, did it run the function or
> not yet...?
> ------------------------------
> *From:* Darrel Schneider <ds...@pivotal.io>
> *Sent:* Tuesday, December 15, 2015 9:58 PM
> *To:* user@geode.incubator.apache.org
> *Subject:* Re: How to troubleshoot stuck distributed function calls
>
> Usually the member waiting for a response logs a warning that it has been
> waiting for longer than 15 seconds from a particular member. Use that
> member id to identify the member that is not responding. Get a stack dump
> on that member and look for a thread that is processing the unresponsive
> message. Sometimes this member also logs that he is waiting for someone
> else to respond to him before he can respond to the first member.
>
> The log message to look for is: "seconds have elapsed while waiting for
> replies:". It will be a warning and should be the last message logged by
> that thread. Sometimes it will log this warning and then get the response
> later in which case it will log an info message that it did receive the
> reply.
>
>
> On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <hantonyan@vmware.com
> > wrote:
>
>> Hello experts,
>>
>>
>>
>> I have a multi node environment where one of the nodes has made a
>> broadcast call to all other nodes and got stuck.
>>
>> It is still waiting responses from all nodes and from the heapdump I see
>> that ResultCollector has N-1 elements, where N is the total number of
>> nodes, so it looks like one of the nodes didn't return a response, or it
>> did return but for some reason the caller has not received it.
>>
>> How can I troubleshoot this issue, how can I know which node exactly has
>> failed to return the response and why?
>>
>>
>>
>> Thanks in advance,
>>
>> Hovhannes
>>
>
>

Re: How to troubleshoot stuck distributed function calls

Posted by Hovhannes Antonyan <ha...@vmware.com>.

I was looking at the heapdump and identified the node which didn't sent the response.

But the question now is why didn't it send it, did it run the function or not yet...?

________________________________
From: Darrel Schneider <ds...@pivotal.io>
Sent: Tuesday, December 15, 2015 9:58 PM
To: user@geode.incubator.apache.org
Subject: Re: How to troubleshoot stuck distribute?d function calls

Usually the member waiting for a response logs a warning that it has been waiting for longer than 15 seconds from a particular member. Use that member id to identify the member that is not responding. Get a stack dump on that member and look for a thread that is processing the unresponsive message. Sometimes this member also logs that he is waiting for someone else to respond to him before he can respond to the first member.

The log message to look for is: "seconds have elapsed while waiting for replies:". It will be a warning and should be the last message logged by that thread. Sometimes it will log this warning and then get the response later in which case it will log an info message that it did receive the reply.

On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <ha...@vmware.com>> wrote:
Hello experts,

I have a multi node environment where one of the nodes has made a broadcast call to all other nodes and got stuck.
It is still waiting responses from all nodes and from the heapdump I see that ResultCollector has N-1 elements, where N is the total number of nodes, so it looks like one of the nodes didn't return a response, or it did return but for some reason the caller has not received it.
How can I troubleshoot this issue, how can I know which node exactly has failed to return the response and why?

Thanks in advance,
Hovhannes

Re: How to troubleshoot stuck distributed function calls

Posted by Darrel Schneider <ds...@pivotal.io>.

Usually the member waiting for a response logs a warning that it has been
waiting for longer than 15 seconds from a particular member. Use that
member id to identify the member that is not responding. Get a stack dump
on that member and look for a thread that is processing the unresponsive
message. Sometimes this member also logs that he is waiting for someone
else to respond to him before he can respond to the first member.

The log message to look for is: "seconds have elapsed while waiting for
replies:". It will be a warning and should be the last message logged by
that thread. Sometimes it will log this warning and then get the response
later in which case it will log an info message that it did receive the
reply.

On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan <ha...@vmware.com>
wrote:

> Hello experts,
>
>
>
> I have a multi node environment where one of the nodes has made a
> broadcast call to all other nodes and got stuck.
>
> It is still waiting responses from all nodes and from the heapdump I see
> that ResultCollector has N-1 elements, where N is the total number of
> nodes, so it looks like one of the nodes didn't return a response, or it
> did return but for some reason the caller has not received it.
>
> How can I troubleshoot this issue, how can I know which node exactly has
> failed to return the response and why?
>
>
>
> Thanks in advance,
>
> Hovhannes
>