You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Jan Høydahl (Jira)" <ji...@apache.org> on 2021/03/29 22:54:00 UTC

[jira] [Commented] (SOLR-15300) Shard "state" flag is confusing and of limited value to outside consumers

    [ https://issues.apache.org/jira/browse/SOLR-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311024#comment-17311024 ] 

Jan Høydahl commented on SOLR-15300:
------------------------------------

Agree. Last week I was attempting to create a simple generic Prometheus Alert Rule to trigger alerts whenever a collection has a shard whose intended replicationFactor is not satisfied. Something like
 * Green - all OK: All replicas in all shards have state==active (and represented in live_nodes)
 * Yellow - still operational but replicationFactor not satisfied at the moment (Would trigger a non-critical alert "Shard N for collection C has a lower replicationFactor (A) than configured (B)."
 * Red - no replicas for a shard are active. They may be in any other state (Would trigger a critical alert "Collection C is down. Shard N has no live replicas. Recovery is in progress).

Currently I cannot find a single metric that can figure this out. I have tried compiling various JQ logic on the CLUSTERSTATE data, but it's quite hard to combine the configured replicationFactor with the actual in a generic way for all replicas in all shards of a collection and fold it into something alertable. So very much +1 to improving this situation.

Perhaps this collides a bit with the PRS effort which aims to not touch state.json for state changes in replicas... So I don't know..

> Shard "state" flag is confusing and of limited value to outside consumers
> -------------------------------------------------------------------------
>
>                 Key: SOLR-15300
>                 URL: https://issues.apache.org/jira/browse/SOLR-15300
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>
> Solr API (and consequently the metric reporters, which are often used for Solr monitoring) report the shard as being in ACTIVE state even when in reality its functionality is severely compromised (eg. no replicas, all replicas down, or no leader).
> This reported state is technically correct because it is used only for tracking of the SPLITSHARD operations, as defined in {{Slice.State}}. However, this may be misleading and more often unhelpful than not - for constant monitoring a flag that actually reports impaired functionality of a shard would be more useful than a flag that reports a relatively uncommon SPLITSHARD operation.
> We could either redefine the meaning of the existing flag (and change its state according to some of the criteria I listed above), or add another flag to represent the "health" status of a shard. The value of this flag would then provide an easy way to monitor and to alert external systems of dangerous function impairment, without monitoring the state of all replicas of a collection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org