You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ratis.apache.org by "Tsz-wo Sze (Jira)" <ji...@apache.org> on 2022/10/10 08:10:00 UTC

[jira] [Updated] (RATIS-1709) Support specify ThreadGroup for Daemon threads

     [ https://issues.apache.org/jira/browse/RATIS-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz-wo Sze updated RATIS-1709:
------------------------------
    Fix Version/s: 2.4.0

> Support specify ThreadGroup for Daemon threads
> ----------------------------------------------
>
>                 Key: RATIS-1709
>                 URL: https://issues.apache.org/jira/browse/RATIS-1709
>             Project: Ratis
>          Issue Type: Improvement
>          Components: server
>            Reporter: Jiacheng Liu
>            Assignee: Jiacheng Liu
>            Priority: Major
>             Fix For: 2.4.0, 3.0.0
>
>         Attachments: 733_review.patch
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> In Ratis many threads are created using `Daemon` class manually. For threads like this, if there's an uncaught exception, the thread will just crash silently without other components knowing. If the thread happens to be a critical component then some part of the RaftServer is essentially down, whereas the RaftServer's lifecycle is still RUNNING (not set to EXCEPTION because the thread didn't have a chance).
> One example where this can happen is [https://github.com/apache/ratis/pull/417/files] Before this change is in, the StateMachineUpdater thread can throw NPE and exit, so the follower RaftServer stays stale forever. The RaftServer's lifecycle is RUNNING and there's no way for the external party to know by `RaftServer.getLifeCycleState()`.
> The proposal is to improve observability on RaftServer to ensure an uncaught exception can be caught and propagated to the external user, by multiple folds:
>  # For all `Daemon` threads, they should have UncaughtExceptionHandler set.
>  # The UncaughtExceptionHandler is defined by the application by RaftServer.Builder when creating the RaftServer. Then the RaftServer propagates the handler to each Daemon thread on creating them.
> So external users canĀ 
> {code:java}
> AtomicBoolean raftCrashed = new AtomicBoolean(false);
> AtomicReference<Throwable> raftError = new AtomicReference<>(null);
> RaftServer server = RaftServer.newBuilder()
>   .setUncaughtExceptionHandler((thread, ex) -> {
>     raftCrashed.set(true);
>     raftError.set(ex);
>   }).build();
> // Periodically check
> if (raftCrashed) {
>   LOG.error("RaftServer crashed", raftError.get());
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)