You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Lari Hotari <lh...@apache.org> on 2023/01/02 13:31:57 UTC

Re: [DISCUSS] PIP-232: Introduce thread monitor to check if thread is blocked for long time.

This is an interesting proposal. However, I'd suggest changes to the current proposal.

I think that the current proposal is too invasive for the Pulsar code base. "Introduce thread monitor to check if thread is blocking for long time." seems to mean multiple things. 
When looking at the PR, it seems to be a solution for detecting long running tasks. Just FYI, that Bookkeeper has a solution for this in it's OrderedExecutor with a setting called enableTaskExecutionStats=true . I'm not saying that it would be the preferred way to implement it.

If the goal is to detect actual blocking code that is run with threads that should run only non-blocking code, there's a better tool called Reactor BlockHound (https://github.com/reactor/BlockHound) for that purpose. 
For actual profiling of the code base, Java Flight Recorder and Async Profiler are better solutions.

It seems that one part of the problem is that there aren't metrics for the thread pools. As an alternative implementation for the proposed PIP-232, I'd suggest that basic metrics (backlog / queue size, active thread count, number of executed tasks, etc)  are added for the thread pools. For example, Micrometer contains a decorator for many thread pool implementations, https://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/jvm/ExecutorServiceMetrics.java . A similar solution would be very useful in Pulsar for adding the thread pool metrics.

Tracking individual tasks requires more resources, and that's why I'd suggest adding the basic metrics and making them enabled by default. Some more advanced metrics would be useful, such as tracking the thread pool queue waiting time. Adding a low overhead thread pool queue waiting time could be done with a sampling approach. The benefit of that is that there won't be a need to wrap all tasks that are executed. There would be several ways to implement the queue waiting time metric. 

I assume that "blocking" itself might not be the problem and therefore having basic metrics (backlog size, active threads, executed tasks counter, failed tasks counter) for the thread pools is more essential. There's a lot of good things about the PIP-232 proposal and I believe that iterating on the ideas will propose a good outcome.

-Lari

On 2022/12/19 12:17:09 adobewjl wrote:
> Hello pulsar community,
> I've opened `PIP-232: Introduce thread monitor to check if thread is blocked for long time.` to discuss.
> For more details, please read the PIP at https://github.com/apache/pulsar/issues/18985
> I'm looking forward to hearing what you think. 
> Also the demo PR link at https://github.com/apache/pulsar/pull/18958

Re: [DISCUSS] PIP-232: Introduce thread monitor to check if thread is blocked for long time.

Posted by Enrico Olivelli <eo...@gmail.com>.
Il Lun 2 Gen 2023, 14:32 Lari Hotari <lh...@apache.org> ha scritto:

> This is an interesting proposal. However, I'd suggest changes to the
> current proposal.
>
> I think that the current proposal is too invasive for the Pulsar code
> base. "Introduce thread monitor to check if thread is blocking for long
> time." seems to mean multiple things.
> When looking at the PR, it seems to be a solution for detecting long
> running tasks. Just FYI, that Bookkeeper has a solution for this in it's
> OrderedExecutor with a setting called enableTaskExecutionStats=true . I'm
> not saying that it would be the preferred way to implement it.
>
> If the goal is to detect actual blocking code that is run with threads
> that should run only non-blocking code, there's a better tool called
> Reactor BlockHound (https://github.com/reactor/BlockHound) for that
> purpose.
> For actual profiling of the code base, Java Flight Recorder and Async
> Profiler are better solutions.
>
> It seems that one part of the problem is that there aren't metrics for the
> thread pools. As an alternative implementation for the proposed PIP-232,
> I'd suggest that basic metrics (backlog / queue size, active thread count,
> number of executed tasks, etc)  are added for the thread pools. For
> example, Micrometer contains a decorator for many thread pool
> implementations,
> https://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/jvm/ExecutorServiceMetrics.java
> . A similar solution would be very useful in Pulsar for adding the thread
> pool metrics.
>
> Tracking individual tasks requires more resources, and that's why I'd
> suggest adding the basic metrics and making them enabled by default. Some
> more advanced metrics would be useful, such as tracking the thread pool
> queue waiting time. Adding a low overhead thread pool queue waiting time
> could be done with a sampling approach. The benefit of that is that there
> won't be a need to wrap all tasks that are executed. There would be several
> ways to implement the queue waiting time metric.
>
> I assume that "blocking" itself might not be the problem and therefore
> having basic metrics (backlog size, active threads, executed tasks counter,
> failed tasks counter) for the thread pools is more essential.


I agree that we should start with these this low overhead metrics.
I have been spending much time in analyzing head dumps in order to see the
queues of the threadpools :)

Enrico


There's a lot of good things about the PIP-232 proposal and I believe that
> iterating on the ideas will propose a good outcome.
>
> -Lari
>
> On 2022/12/19 12:17:09 adobewjl wrote:
> > Hello pulsar community,
> > I've opened `PIP-232: Introduce thread monitor to check if thread is
> blocked for long time.` to discuss.
> > For more details, please read the PIP at
> https://github.com/apache/pulsar/issues/18985
> > I'm looking forward to hearing what you think.
> > Also the demo PR link at https://github.com/apache/pulsar/pull/18958
>

Re: [DISCUSS] PIP-232: Introduce thread monitor to check if thread is blocked for long time.

Posted by Heesung Sohn <he...@streamnative.io.INVALID>.
Hi,

This is a great idea. This will significantly help to debug production
issues caused by long-waiting threads.

I agree with Lari that we could start with the basic metrics.

Like Lari shared in the reference, we could simply collect the current
thread pool queue size like the following.

((ThreadPoolExecutor) executor).getQueue().size()

Also, we can get thread waited time and blocked time via JVM ThreadInfo.
We probably need to enable/disable this JVM ThreadInfo monitoring(thread
contention monitoring) dynamically since this monitoring can consume more
CPU cycles.

https://docs.oracle.com/en/java/javase/17/docs/api/java.management/java/lang/management/ThreadInfo.html

To be specific, some useful broker thread pool metrics might be

// enabled by default
xyzThreadPoolQueueSize(p50,p99,max)

// disabled by default(enabled by dynamic config)
xyzThreadPoolWaitedTime(p50,p99,max)
xyzThreadPoolWaitedCount(p50,p99,max)
xyzThreadPoolBlockedTime(p50,p99,max)
xyzThreadPoolBlockedCount(p50,p99,max)


Additionally,
https://docs.oracle.com/en/java/javase/17/docs/api/java.management/java/lang/management/ThreadInfo.html#getStackTrace()
ThreadInfo contains stack trace, which can help to narrow down the issue
code. (we can print this info in the broker logs.)

Thanks,
Heesung


On Mon, Jan 2, 2023 at 5:31 AM Lari Hotari <lh...@apache.org> wrote:

> This is an interesting proposal. However, I'd suggest changes to the
> current proposal.
>
> I think that the current proposal is too invasive for the Pulsar code
> base. "Introduce thread monitor to check if thread is blocking for long
> time." seems to mean multiple things.
> When looking at the PR, it seems to be a solution for detecting long
> running tasks. Just FYI, that Bookkeeper has a solution for this in it's
> OrderedExecutor with a setting called enableTaskExecutionStats=true . I'm
> not saying that it would be the preferred way to implement it.
>
> If the goal is to detect actual blocking code that is run with threads
> that should run only non-blocking code, there's a better tool called
> Reactor BlockHound (https://github.com/reactor/BlockHound) for that
> purpose.
> For actual profiling of the code base, Java Flight Recorder and Async
> Profiler are better solutions.
>
> It seems that one part of the problem is that there aren't metrics for the
> thread pools. As an alternative implementation for the proposed PIP-232,
> I'd suggest that basic metrics (backlog / queue size, active thread count,
> number of executed tasks, etc)  are added for the thread pools. For
> example, Micrometer contains a decorator for many thread pool
> implementations,
> https://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/jvm/ExecutorServiceMetrics.java
> . A similar solution would be very useful in Pulsar for adding the thread
> pool metrics.
>
> Tracking individual tasks requires more resources, and that's why I'd
> suggest adding the basic metrics and making them enabled by default. Some
> more advanced metrics would be useful, such as tracking the thread pool
> queue waiting time. Adding a low overhead thread pool queue waiting time
> could be done with a sampling approach. The benefit of that is that there
> won't be a need to wrap all tasks that are executed. There would be several
> ways to implement the queue waiting time metric.
>
> I assume that "blocking" itself might not be the problem and therefore
> having basic metrics (backlog size, active threads, executed tasks counter,
> failed tasks counter) for the thread pools is more essential. There's a lot
> of good things about the PIP-232 proposal and I believe that iterating on
> the ideas will propose a good outcome.
>
> -Lari
>
> On 2022/12/19 12:17:09 adobewjl wrote:
> > Hello pulsar community,
> > I've opened `PIP-232: Introduce thread monitor to check if thread is
> blocked for long time.` to discuss.
> > For more details, please read the PIP at
> https://github.com/apache/pulsar/issues/18985
> > I'm looking forward to hearing what you think.
> > Also the demo PR link at https://github.com/apache/pulsar/pull/18958
>