You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Daniel Kimsey (Jira)" <ji...@apache.org> on 2022/06/23 15:47:00 UTC

[jira] [Created] (STORM-3874) Nimbus status/metrics don't expose criical blobstore sync status information

Daniel Kimsey created STORM-3874:
------------------------------------

             Summary: Nimbus status/metrics don't expose criical blobstore sync status information
                 Key: STORM-3874
                 URL: https://issues.apache.org/jira/browse/STORM-3874
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Daniel Kimsey


As an operator, I'd like to know when a nimbus node is eligible for Leader so that I may check that status during instance replacement/health status.

When performing a rolling deployment, it's not possible to distinguish when a new node is "fully ready" and able to take on leadership. New nodes will refuse to becoem leaders until they have synced their blobstore. This can brick the cluster if instances are replaced before any new node has fully synced from the current leader.

There is a gap from when a nimbus node joins to when it's eligible to become a leader. To the best of my ability this status gap is not exposed, there are only three status values "Not a Leader", "Leader", and "Dead". Nor is there anything in the REST API (or any other metrics) that I can find that could hint at the sync status (say bytes in blob-cache per nimbus, sync "index" or mtime).

Example
1. Given a stable 3 node nimbus cluster.
2. Deploy some topologies of non-trivial size.
3. Replace nimbuses in a rolling fashion. (wait for new node to show "Not a Leader" status in {{/api/v1/nimbus/summary}})
4. Cluster dies. New nimbuses will refuse to become a leader.

I propose a new status, "Syncing" or "Ineligible". Nimbuses in this status are not yet ready to become leaders and a healthy instance will transition to "Not a Leader" shortly. A new instance should not be considered ready until it has transitioned to the "Not a Leader" status.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)