You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Gregory Vortman <Gr...@Amdocs.com> on 2018/02/20 09:04:31 UTC

[Proposal] Thread monitoring mechanism

Hello team,
One of the most severe issues hitting our real time application is thread stuck for multiple reasons, such as long lasting locks, deadlocks, threads which wait for reply forever in case of packet drop issue etc...
Such kind of stuck are under Radar of the existing system health check methods.
In mission critical applications, this will be resulted as an immediate outage.

As a short we are implementing kind of internal watch dog mechanism for stuck detector:
               There is a registration object
               Function executor having start/end hooks to register/unregister the thread via the registration object
Customized Monitoring scheduled thread is spawned on startup. The thread to wake up every N seconds, to scan the registration map and to detect unregistered threads for a long time (configurable).
Once such threads has been detected, process stack is taken and thread stack statistic metric is provided.

This helps us to monitor, detect and take fast decision about the action which should be taken - usually it is member bounce decision (consistency issue is possible, in our case it is better than deny of service).
The above solution is not touching GEODE core code, but implemented in boundaries of customized code only.

I would like to raise a proposal to introduce a long term generic thread monitoring mechanism, to detect threads which are stuck for any reason.
To maintain a monitoring object having a start/end methods to be invoked similarly to FunctionStats.startFunctionExecution and FunctionStats.endFunctionExecution.

Your feedback would be appreciated

Thank you for cooperation.
Best regards!

Gregory Vortman

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

Re: [Proposal] Thread monitoring mechanism

Posted by Anilkumar Gingade <ag...@pivotal.io>.
Good idea, we need better tools (echo systems) to manage/monitor Geode
resources.
In Geode many times the work is handed to other low-level threads
(messaging) or new threads/runnables; it will be nice to have some
mechanism to associate main work-thread to low level thread; that will give
better indication on who is waiting on whom.

-Anil




On Fri, Feb 23, 2018 at 3:08 PM, Barry Oglesby <bo...@pivotal.io> wrote:

> A lot of the Geode thread pools are defined in ClusterDistributionManager.
> Most of these use custom ThreadPoolExecutors like:
>
> SerialQueuedExecutorWithDMStats
> PooledExecutorWithDMStats
> FunctionExecutionPooledExecutor
>
> These classes all extend ThreadPoolExecutor and override beforeExecute and
> afterExecute. These methods are currently used by helper classes to update
> the stats before and after a thread executes. Potentially these same
> methods could be used to add and remove a thread from a monitor. For
> example, there could be a FunctionExecutionThreadMonitor that is created as
> part of the FunctionExecutionPooledExecutor whose job it would be to
> monitor FunctionExecution threads. The beforeExecute method would add the
> thread to the monitor; the afterExecute would remove the thread from the
> monitor.
>
> I would be mindful about the performance impact of adding these monitors,
> though.
>
>
> Thanks,
> Barry Oglesby
>
>
> On Wed, Feb 21, 2018 at 11:41 AM, Gregory Vortman <
> Gregory.Vortman@amdocs.com> wrote:
>
> > That's the point exactly to have a single very thin and generic mechanism
> > to cover all threads/threads pool. Nothing is specific in this solution.
> > Regards
> >
> >
> > -----Original Message-----
> > From: Jason Huynh [jhuynh@pivotal.io]
> > Received: Wednesday, 21 Feb 2018, 20:54
> > To: dev@geode.apache.org [dev@geode.apache.org]
> > CC: user@geode.apache.org [user@geode.apache.org]
> > Subject: Re: [Proposal] Thread monitoring mechanism
> >
> > I am assuming this would be for all thread/thread pools and not specific
> > to Function threads.  I wonder what the impact would be for put/get
> > operations or are we going to target specific operations.
> >
> >
> >
> > On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <
> > Gregory.Vortman@amdocs.com<ma...@amdocs.com>> wrote:
> > Hello team,
> > One of the most severe issues hitting our real time application is thread
> > stuck for multiple reasons, such as long lasting locks, deadlocks,
> threads
> > which wait for reply forever in case of packet drop issue etc...
> > Such kind of stuck are under Radar of the existing system health check
> > methods.
> > In mission critical applications, this will be resulted as an immediate
> > outage.
> >
> > As a short we are implementing kind of internal watch dog mechanism for
> > stuck detector:
> >                There is a registration object
> >                Function executor having start/end hooks to
> > register/unregister the thread via the registration object
> > Customized Monitoring scheduled thread is spawned on startup. The thread
> > to wake up every N seconds, to scan the registration map and to detect
> > unregistered threads for a long time (configurable).
> > Once such threads has been detected, process stack is taken and thread
> > stack statistic metric is provided.
> >
> > This helps us to monitor, detect and take fast decision about the action
> > which should be taken - usually it is member bounce decision (consistency
> > issue is possible, in our case it is better than deny of service).
> > The above solution is not touching GEODE core code, but implemented in
> > boundaries of customized code only.
> >
> > I would like to raise a proposal to introduce a long term generic thread
> > monitoring mechanism, to detect threads which are stuck for any reason.
> > To maintain a monitoring object having a start/end methods to be invoked
> > similarly to FunctionStats.startFunctionExecution and FunctionStats.
> > endFunctionExecution.
> >
> > Your feedback would be appreciated
> >
> > Thank you for cooperation.
> > Best regards!
> >
> > Gregory Vortman
> >
> > This message and the information contained herein is proprietary and
> > confidential and subject to the Amdocs policy statement,
> >
> > you may review at https://www.amdocs.com/about/email-disclaimer <
> > https://www.amdocs.com/about/email-disclaimer>
> > This message and the information contained herein is proprietary and
> > confidential and subject to the Amdocs policy statement,
> >
> > you may review at https://www.amdocs.com/about/email-disclaimer <
> > https://www.amdocs.com/about/email-disclaimer>
> >
>

Re: [Proposal] Thread monitoring mechanism

Posted by Anilkumar Gingade <ag...@pivotal.io>.
Good idea, we need better tools (echo systems) to manage/monitor Geode
resources.
In Geode many times the work is handed to other low-level threads
(messaging) or new threads/runnables; it will be nice to have some
mechanism to associate main work-thread to low level thread; that will give
better indication on who is waiting on whom.

-Anil




On Fri, Feb 23, 2018 at 3:08 PM, Barry Oglesby <bo...@pivotal.io> wrote:

> A lot of the Geode thread pools are defined in ClusterDistributionManager.
> Most of these use custom ThreadPoolExecutors like:
>
> SerialQueuedExecutorWithDMStats
> PooledExecutorWithDMStats
> FunctionExecutionPooledExecutor
>
> These classes all extend ThreadPoolExecutor and override beforeExecute and
> afterExecute. These methods are currently used by helper classes to update
> the stats before and after a thread executes. Potentially these same
> methods could be used to add and remove a thread from a monitor. For
> example, there could be a FunctionExecutionThreadMonitor that is created as
> part of the FunctionExecutionPooledExecutor whose job it would be to
> monitor FunctionExecution threads. The beforeExecute method would add the
> thread to the monitor; the afterExecute would remove the thread from the
> monitor.
>
> I would be mindful about the performance impact of adding these monitors,
> though.
>
>
> Thanks,
> Barry Oglesby
>
>
> On Wed, Feb 21, 2018 at 11:41 AM, Gregory Vortman <
> Gregory.Vortman@amdocs.com> wrote:
>
> > That's the point exactly to have a single very thin and generic mechanism
> > to cover all threads/threads pool. Nothing is specific in this solution.
> > Regards
> >
> >
> > -----Original Message-----
> > From: Jason Huynh [jhuynh@pivotal.io]
> > Received: Wednesday, 21 Feb 2018, 20:54
> > To: dev@geode.apache.org [dev@geode.apache.org]
> > CC: user@geode.apache.org [user@geode.apache.org]
> > Subject: Re: [Proposal] Thread monitoring mechanism
> >
> > I am assuming this would be for all thread/thread pools and not specific
> > to Function threads.  I wonder what the impact would be for put/get
> > operations or are we going to target specific operations.
> >
> >
> >
> > On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <
> > Gregory.Vortman@amdocs.com<ma...@amdocs.com>> wrote:
> > Hello team,
> > One of the most severe issues hitting our real time application is thread
> > stuck for multiple reasons, such as long lasting locks, deadlocks,
> threads
> > which wait for reply forever in case of packet drop issue etc...
> > Such kind of stuck are under Radar of the existing system health check
> > methods.
> > In mission critical applications, this will be resulted as an immediate
> > outage.
> >
> > As a short we are implementing kind of internal watch dog mechanism for
> > stuck detector:
> >                There is a registration object
> >                Function executor having start/end hooks to
> > register/unregister the thread via the registration object
> > Customized Monitoring scheduled thread is spawned on startup. The thread
> > to wake up every N seconds, to scan the registration map and to detect
> > unregistered threads for a long time (configurable).
> > Once such threads has been detected, process stack is taken and thread
> > stack statistic metric is provided.
> >
> > This helps us to monitor, detect and take fast decision about the action
> > which should be taken - usually it is member bounce decision (consistency
> > issue is possible, in our case it is better than deny of service).
> > The above solution is not touching GEODE core code, but implemented in
> > boundaries of customized code only.
> >
> > I would like to raise a proposal to introduce a long term generic thread
> > monitoring mechanism, to detect threads which are stuck for any reason.
> > To maintain a monitoring object having a start/end methods to be invoked
> > similarly to FunctionStats.startFunctionExecution and FunctionStats.
> > endFunctionExecution.
> >
> > Your feedback would be appreciated
> >
> > Thank you for cooperation.
> > Best regards!
> >
> > Gregory Vortman
> >
> > This message and the information contained herein is proprietary and
> > confidential and subject to the Amdocs policy statement,
> >
> > you may review at https://www.amdocs.com/about/email-disclaimer <
> > https://www.amdocs.com/about/email-disclaimer>
> > This message and the information contained herein is proprietary and
> > confidential and subject to the Amdocs policy statement,
> >
> > you may review at https://www.amdocs.com/about/email-disclaimer <
> > https://www.amdocs.com/about/email-disclaimer>
> >
>

Re: [Proposal] Thread monitoring mechanism

Posted by Barry Oglesby <bo...@pivotal.io>.
A lot of the Geode thread pools are defined in ClusterDistributionManager.
Most of these use custom ThreadPoolExecutors like:

SerialQueuedExecutorWithDMStats
PooledExecutorWithDMStats
FunctionExecutionPooledExecutor

These classes all extend ThreadPoolExecutor and override beforeExecute and
afterExecute. These methods are currently used by helper classes to update
the stats before and after a thread executes. Potentially these same
methods could be used to add and remove a thread from a monitor. For
example, there could be a FunctionExecutionThreadMonitor that is created as
part of the FunctionExecutionPooledExecutor whose job it would be to
monitor FunctionExecution threads. The beforeExecute method would add the
thread to the monitor; the afterExecute would remove the thread from the
monitor.

I would be mindful about the performance impact of adding these monitors,
though.


Thanks,
Barry Oglesby


On Wed, Feb 21, 2018 at 11:41 AM, Gregory Vortman <
Gregory.Vortman@amdocs.com> wrote:

> That's the point exactly to have a single very thin and generic mechanism
> to cover all threads/threads pool. Nothing is specific in this solution.
> Regards
>
>
> -----Original Message-----
> From: Jason Huynh [jhuynh@pivotal.io]
> Received: Wednesday, 21 Feb 2018, 20:54
> To: dev@geode.apache.org [dev@geode.apache.org]
> CC: user@geode.apache.org [user@geode.apache.org]
> Subject: Re: [Proposal] Thread monitoring mechanism
>
> I am assuming this would be for all thread/thread pools and not specific
> to Function threads.  I wonder what the impact would be for put/get
> operations or are we going to target specific operations.
>
>
>
> On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <
> Gregory.Vortman@amdocs.com<ma...@amdocs.com>> wrote:
> Hello team,
> One of the most severe issues hitting our real time application is thread
> stuck for multiple reasons, such as long lasting locks, deadlocks, threads
> which wait for reply forever in case of packet drop issue etc...
> Such kind of stuck are under Radar of the existing system health check
> methods.
> In mission critical applications, this will be resulted as an immediate
> outage.
>
> As a short we are implementing kind of internal watch dog mechanism for
> stuck detector:
>                There is a registration object
>                Function executor having start/end hooks to
> register/unregister the thread via the registration object
> Customized Monitoring scheduled thread is spawned on startup. The thread
> to wake up every N seconds, to scan the registration map and to detect
> unregistered threads for a long time (configurable).
> Once such threads has been detected, process stack is taken and thread
> stack statistic metric is provided.
>
> This helps us to monitor, detect and take fast decision about the action
> which should be taken - usually it is member bounce decision (consistency
> issue is possible, in our case it is better than deny of service).
> The above solution is not touching GEODE core code, but implemented in
> boundaries of customized code only.
>
> I would like to raise a proposal to introduce a long term generic thread
> monitoring mechanism, to detect threads which are stuck for any reason.
> To maintain a monitoring object having a start/end methods to be invoked
> similarly to FunctionStats.startFunctionExecution and FunctionStats.
> endFunctionExecution.
>
> Your feedback would be appreciated
>
> Thank you for cooperation.
> Best regards!
>
> Gregory Vortman
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>

Re: [Proposal] Thread monitoring mechanism

Posted by Barry Oglesby <bo...@pivotal.io>.
A lot of the Geode thread pools are defined in ClusterDistributionManager.
Most of these use custom ThreadPoolExecutors like:

SerialQueuedExecutorWithDMStats
PooledExecutorWithDMStats
FunctionExecutionPooledExecutor

These classes all extend ThreadPoolExecutor and override beforeExecute and
afterExecute. These methods are currently used by helper classes to update
the stats before and after a thread executes. Potentially these same
methods could be used to add and remove a thread from a monitor. For
example, there could be a FunctionExecutionThreadMonitor that is created as
part of the FunctionExecutionPooledExecutor whose job it would be to
monitor FunctionExecution threads. The beforeExecute method would add the
thread to the monitor; the afterExecute would remove the thread from the
monitor.

I would be mindful about the performance impact of adding these monitors,
though.


Thanks,
Barry Oglesby


On Wed, Feb 21, 2018 at 11:41 AM, Gregory Vortman <
Gregory.Vortman@amdocs.com> wrote:

> That's the point exactly to have a single very thin and generic mechanism
> to cover all threads/threads pool. Nothing is specific in this solution.
> Regards
>
>
> -----Original Message-----
> From: Jason Huynh [jhuynh@pivotal.io]
> Received: Wednesday, 21 Feb 2018, 20:54
> To: dev@geode.apache.org [dev@geode.apache.org]
> CC: user@geode.apache.org [user@geode.apache.org]
> Subject: Re: [Proposal] Thread monitoring mechanism
>
> I am assuming this would be for all thread/thread pools and not specific
> to Function threads.  I wonder what the impact would be for put/get
> operations or are we going to target specific operations.
>
>
>
> On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <
> Gregory.Vortman@amdocs.com<ma...@amdocs.com>> wrote:
> Hello team,
> One of the most severe issues hitting our real time application is thread
> stuck for multiple reasons, such as long lasting locks, deadlocks, threads
> which wait for reply forever in case of packet drop issue etc...
> Such kind of stuck are under Radar of the existing system health check
> methods.
> In mission critical applications, this will be resulted as an immediate
> outage.
>
> As a short we are implementing kind of internal watch dog mechanism for
> stuck detector:
>                There is a registration object
>                Function executor having start/end hooks to
> register/unregister the thread via the registration object
> Customized Monitoring scheduled thread is spawned on startup. The thread
> to wake up every N seconds, to scan the registration map and to detect
> unregistered threads for a long time (configurable).
> Once such threads has been detected, process stack is taken and thread
> stack statistic metric is provided.
>
> This helps us to monitor, detect and take fast decision about the action
> which should be taken - usually it is member bounce decision (consistency
> issue is possible, in our case it is better than deny of service).
> The above solution is not touching GEODE core code, but implemented in
> boundaries of customized code only.
>
> I would like to raise a proposal to introduce a long term generic thread
> monitoring mechanism, to detect threads which are stuck for any reason.
> To maintain a monitoring object having a start/end methods to be invoked
> similarly to FunctionStats.startFunctionExecution and FunctionStats.
> endFunctionExecution.
>
> Your feedback would be appreciated
>
> Thank you for cooperation.
> Best regards!
>
> Gregory Vortman
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>

RE: [Proposal] Thread monitoring mechanism

Posted by Gregory Vortman <Gr...@Amdocs.com>.
That's the point exactly to have a single very thin and generic mechanism to cover all threads/threads pool. Nothing is specific in this solution.
Regards


-----Original Message-----
From: Jason Huynh [jhuynh@pivotal.io]
Received: Wednesday, 21 Feb 2018, 20:54
To: dev@geode.apache.org [dev@geode.apache.org]
CC: user@geode.apache.org [user@geode.apache.org]
Subject: Re: [Proposal] Thread monitoring mechanism

I am assuming this would be for all thread/thread pools and not specific to Function threads.  I wonder what the impact would be for put/get operations or are we going to target specific operations.



On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <Gr...@amdocs.com>> wrote:
Hello team,
One of the most severe issues hitting our real time application is thread stuck for multiple reasons, such as long lasting locks, deadlocks, threads which wait for reply forever in case of packet drop issue etc...
Such kind of stuck are under Radar of the existing system health check methods.
In mission critical applications, this will be resulted as an immediate outage.

As a short we are implementing kind of internal watch dog mechanism for stuck detector:
               There is a registration object
               Function executor having start/end hooks to register/unregister the thread via the registration object
Customized Monitoring scheduled thread is spawned on startup. The thread to wake up every N seconds, to scan the registration map and to detect unregistered threads for a long time (configurable).
Once such threads has been detected, process stack is taken and thread stack statistic metric is provided.

This helps us to monitor, detect and take fast decision about the action which should be taken - usually it is member bounce decision (consistency issue is possible, in our case it is better than deny of service).
The above solution is not touching GEODE core code, but implemented in boundaries of customized code only.

I would like to raise a proposal to introduce a long term generic thread monitoring mechanism, to detect threads which are stuck for any reason.
To maintain a monitoring object having a start/end methods to be invoked similarly to FunctionStats.startFunctionExecution and FunctionStats.endFunctionExecution.

Your feedback would be appreciated

Thank you for cooperation.
Best regards!

Gregory Vortman

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>
This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

RE: [Proposal] Thread monitoring mechanism

Posted by Gregory Vortman <Gr...@Amdocs.com>.
That's the point exactly to have a single very thin and generic mechanism to cover all threads/threads pool. Nothing is specific in this solution.
Regards


-----Original Message-----
From: Jason Huynh [jhuynh@pivotal.io]
Received: Wednesday, 21 Feb 2018, 20:54
To: dev@geode.apache.org [dev@geode.apache.org]
CC: user@geode.apache.org [user@geode.apache.org]
Subject: Re: [Proposal] Thread monitoring mechanism

I am assuming this would be for all thread/thread pools and not specific to Function threads.  I wonder what the impact would be for put/get operations or are we going to target specific operations.



On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <Gr...@amdocs.com>> wrote:
Hello team,
One of the most severe issues hitting our real time application is thread stuck for multiple reasons, such as long lasting locks, deadlocks, threads which wait for reply forever in case of packet drop issue etc...
Such kind of stuck are under Radar of the existing system health check methods.
In mission critical applications, this will be resulted as an immediate outage.

As a short we are implementing kind of internal watch dog mechanism for stuck detector:
               There is a registration object
               Function executor having start/end hooks to register/unregister the thread via the registration object
Customized Monitoring scheduled thread is spawned on startup. The thread to wake up every N seconds, to scan the registration map and to detect unregistered threads for a long time (configurable).
Once such threads has been detected, process stack is taken and thread stack statistic metric is provided.

This helps us to monitor, detect and take fast decision about the action which should be taken - usually it is member bounce decision (consistency issue is possible, in our case it is better than deny of service).
The above solution is not touching GEODE core code, but implemented in boundaries of customized code only.

I would like to raise a proposal to introduce a long term generic thread monitoring mechanism, to detect threads which are stuck for any reason.
To maintain a monitoring object having a start/end methods to be invoked similarly to FunctionStats.startFunctionExecution and FunctionStats.endFunctionExecution.

Your feedback would be appreciated

Thank you for cooperation.
Best regards!

Gregory Vortman

This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>
This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,

you may review at https://www.amdocs.com/about/email-disclaimer <https://www.amdocs.com/about/email-disclaimer>

Re: [Proposal] Thread monitoring mechanism

Posted by Jason Huynh <jh...@pivotal.io>.
I am assuming this would be for all thread/thread pools and not specific to
Function threads.  I wonder what the impact would be for put/get operations
or are we going to target specific operations.



On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <Gr...@amdocs.com>
wrote:

> Hello team,
> One of the most severe issues hitting our real time application is thread
> stuck for multiple reasons, such as long lasting locks, deadlocks, threads
> which wait for reply forever in case of packet drop issue etc...
> Such kind of stuck are under Radar of the existing system health check
> methods.
> In mission critical applications, this will be resulted as an immediate
> outage.
>
> As a short we are implementing kind of internal watch dog mechanism for
> stuck detector:
>                There is a registration object
>                Function executor having start/end hooks to
> register/unregister the thread via the registration object
> Customized Monitoring scheduled thread is spawned on startup. The thread
> to wake up every N seconds, to scan the registration map and to detect
> unregistered threads for a long time (configurable).
> Once such threads has been detected, process stack is taken and thread
> stack statistic metric is provided.
>
> This helps us to monitor, detect and take fast decision about the action
> which should be taken - usually it is member bounce decision (consistency
> issue is possible, in our case it is better than deny of service).
> The above solution is not touching GEODE core code, but implemented in
> boundaries of customized code only.
>
> I would like to raise a proposal to introduce a long term generic thread
> monitoring mechanism, to detect threads which are stuck for any reason.
> To maintain a monitoring object having a start/end methods to be invoked
> similarly to FunctionStats.startFunctionExecution and
> FunctionStats.endFunctionExecution.
>
> Your feedback would be appreciated
>
> Thank you for cooperation.
> Best regards!
>
> Gregory Vortman
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>

Re: [Proposal] Thread monitoring mechanism

Posted by Jason Huynh <jh...@pivotal.io>.
I am assuming this would be for all thread/thread pools and not specific to
Function threads.  I wonder what the impact would be for put/get operations
or are we going to target specific operations.



On Tue, Feb 20, 2018 at 1:04 AM Gregory Vortman <Gr...@amdocs.com>
wrote:

> Hello team,
> One of the most severe issues hitting our real time application is thread
> stuck for multiple reasons, such as long lasting locks, deadlocks, threads
> which wait for reply forever in case of packet drop issue etc...
> Such kind of stuck are under Radar of the existing system health check
> methods.
> In mission critical applications, this will be resulted as an immediate
> outage.
>
> As a short we are implementing kind of internal watch dog mechanism for
> stuck detector:
>                There is a registration object
>                Function executor having start/end hooks to
> register/unregister the thread via the registration object
> Customized Monitoring scheduled thread is spawned on startup. The thread
> to wake up every N seconds, to scan the registration map and to detect
> unregistered threads for a long time (configurable).
> Once such threads has been detected, process stack is taken and thread
> stack statistic metric is provided.
>
> This helps us to monitor, detect and take fast decision about the action
> which should be taken - usually it is member bounce decision (consistency
> issue is possible, in our case it is better than deny of service).
> The above solution is not touching GEODE core code, but implemented in
> boundaries of customized code only.
>
> I would like to raise a proposal to introduce a long term generic thread
> monitoring mechanism, to detect threads which are stuck for any reason.
> To maintain a monitoring object having a start/end methods to be invoked
> similarly to FunctionStats.startFunctionExecution and
> FunctionStats.endFunctionExecution.
>
> Your feedback would be appreciated
>
> Thank you for cooperation.
> Best regards!
>
> Gregory Vortman
>
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement,
>
> you may review at https://www.amdocs.com/about/email-disclaimer <
> https://www.amdocs.com/about/email-disclaimer>
>