You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Ivan Rakov <iv...@gmail.com> on 2019/10/04 13:09:44 UTC

Metric showing how many nodes may safely leave the cluster

Igniters,

I've seen numerous requests to find out an easy way to check whether is 
it safe to turn off cluster node. As we know, in Ignite protection from 
sudden node shutdown is implemented through keeping several backup 
copies of each partition. However, this guarantee can be weakened for a 
while in case cluster has recently experienced node restart and 
rebalancing process is still in progress.
Example scenario is restarting nodes one by one in order to update a 
local configuration parameter. User restarts one node and rebalancing 
starts: when it will be completed, it will be safe to proceed (backup 
count=1). However, there's no transparent way to determine whether 
rebalancing is over.
 From my perspective, it would be very helpful to:
1) Add information about rebalancing and number of free-to-go nodes to 
./control.sh --state command.
Examples of output:

> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> Cluster tag: new_tag
> --------------------------------------------------------------------------------
> Cluster is active
> All partitions are up-to-date.
> 3 node(s) can safely leave the cluster without partition loss.
> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> Cluster tag: new_tag
> --------------------------------------------------------------------------------
> Cluster is active
> Rebalancing is in progress.
> 1 node(s) can safely leave the cluster without partition loss.
2) Provide the same information via ClusterMetrics. For example:
ClusterMetrics#isRebalanceInProgress // boolean
ClusterMetrics#getSafeToLeaveNodesCount // int

Here I need to mention that this information can be calculated from 
existing rebalance metrics (see CacheMetrics#*rebalance*). However, I 
still think that we need more simple and understandable flag whether 
cluster is in danger of data loss. Another point is that current metrics 
are bound to specific cache, which makes this information even harder to 
analyze.

Thoughts?

-- 
Best Regards,
Ivan Rakov

Re: Metric showing how many nodes may safely leave the cluster

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

That's a very useful metric which we already discussed in the past. This
may be called "cluster backup factor" and "effective cache backup factor".
You can look up other mentions by searching in maillist archives.

Regards,
-- 
Ilya Kasnacheev


пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:

> Igniters,
>
> I've seen numerous requests to find out an easy way to check whether is
> it safe to turn off cluster node. As we know, in Ignite protection from
> sudden node shutdown is implemented through keeping several backup
> copies of each partition. However, this guarantee can be weakened for a
> while in case cluster has recently experienced node restart and
> rebalancing process is still in progress.
> Example scenario is restarting nodes one by one in order to update a
> local configuration parameter. User restarts one node and rebalancing
> starts: when it will be completed, it will be safe to proceed (backup
> count=1). However, there's no transparent way to determine whether
> rebalancing is over.
>  From my perspective, it would be very helpful to:
> 1) Add information about rebalancing and number of free-to-go nodes to
> ./control.sh --state command.
> Examples of output:
>
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> >
> --------------------------------------------------------------------------------
> > Cluster is active
> > All partitions are up-to-date.
> > 3 node(s) can safely leave the cluster without partition loss.
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> >
> --------------------------------------------------------------------------------
> > Cluster is active
> > Rebalancing is in progress.
> > 1 node(s) can safely leave the cluster without partition loss.
> 2) Provide the same information via ClusterMetrics. For example:
> ClusterMetrics#isRebalanceInProgress // boolean
> ClusterMetrics#getSafeToLeaveNodesCount // int
>
> Here I need to mention that this information can be calculated from
> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> still think that we need more simple and understandable flag whether
> cluster is in danger of data loss. Another point is that current metrics
> are bound to specific cache, which makes this information even harder to
> analyze.
>
> Thoughts?
>
> --
> Best Regards,
> Ivan Rakov
>
>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Maxim Muzafarov <mm...@apache.org>.

Ivan,

1. I think the rebalance cache metrics should be deprecated and
removed (someday). Here is the [1] issue to do such things.

2. I think #isRebalanceInProgress can and should be calculated by an
external monitoring system from local nodes based on
#localMovingPartitionsCount > 0 (or the more precise value
rebalancingPartitionsLeft from the issue [1]) values gathered from
each online node. Also, we should provide such templates for each
monitoring system (Zabbix, Prometheus etc.).

[1] https://issues.apache.org/jira/browse/IGNITE-12183

On Fri, 4 Oct 2019 at 16:17, Ivan Rakov <iv...@gmail.com> wrote:
>
> Igniters,
>
> I've seen numerous requests to find out an easy way to check whether is
> it safe to turn off cluster node. As we know, in Ignite protection from
> sudden node shutdown is implemented through keeping several backup
> copies of each partition. However, this guarantee can be weakened for a
> while in case cluster has recently experienced node restart and
> rebalancing process is still in progress.
> Example scenario is restarting nodes one by one in order to update a
> local configuration parameter. User restarts one node and rebalancing
> starts: when it will be completed, it will be safe to proceed (backup
> count=1). However, there's no transparent way to determine whether
> rebalancing is over.
>  From my perspective, it would be very helpful to:
> 1) Add information about rebalancing and number of free-to-go nodes to
> ./control.sh --state command.
> Examples of output:
>
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> > --------------------------------------------------------------------------------
> > Cluster is active
> > All partitions are up-to-date.
> > 3 node(s) can safely leave the cluster without partition loss.
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> > --------------------------------------------------------------------------------
> > Cluster is active
> > Rebalancing is in progress.
> > 1 node(s) can safely leave the cluster without partition loss.
> 2) Provide the same information via ClusterMetrics. For example:
> ClusterMetrics#isRebalanceInProgress // boolean
> ClusterMetrics#getSafeToLeaveNodesCount // int
>
> Here I need to mention that this information can be calculated from
> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> still think that we need more simple and understandable flag whether
> cluster is in danger of data loss. Another point is that current metrics
> are bound to specific cache, which makes this information even harder to
> analyze.
>
> Thoughts?
>
> --
> Best Regards,
> Ivan Rakov
>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Ivan Rakov <iv...@gmail.com>.

https://issues.apache.org/jira/browse/IGNITE-12278

Best Regards,
Ivan Rakov

On 07.10.2019 15:08, Ivan Rakov wrote:
> Denis, Alex,
>
> Sure, new metric will be integrated into new metrics framework.
> Let's not expose its value to control.sh right now. I'll create an 
> issue for aggregated "getMinimumNumberOfPartitionCopies" if everyone 
> agrees.
>
> Best Regards,
> Ivan Rakov
>
> On 04.10.2019 20:06, Denis Magda wrote:
>> I'm for the proposal to add new JMX metrics and enhance the existing
>> tooling. But I would encourage us to integrate this into the new metrics
>> framework Nikolay has been working on. Otherwise, we will be deprecating
>> these JMX metrics in a short time frame in favor of the new 
>> monitoring APIs.
>>
>> -
>> Denis
>>
>>
>> On Fri, Oct 4, 2019 at 9:33 AM Alexey Goncharuk 
>> <al...@gmail.com>
>> wrote:
>>
>>> I agree that we should have the ability to read any metric using simple
>>> Ignite tooling. I am not sure if visor.sh is a good fit - if I
>>> remember correctly, it will start a daemon node which will bump the
>>> topology version with all related consequences. I believe in the 
>>> long term
>>> it will beneficial to migrate all visor.sh functionality to a more
>>> lightweight protocol, such as used in control.sh.
>>>
>>> As for the metrics, the metric suggested by Ivan totally makes sense 
>>> to me
>>> - it is a simple and, actually, quite critical metric. It will be
>>> completely unusable to select a minimum of some metric for all cache 
>>> groups
>>> manually. A monitoring system, on the other hand, might not be 
>>> available
>>> when the metric is needed, or may not support aggregation.
>>>
>>> --AG
>>>
>>> пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <iv...@gmail.com>:
>>>
>>>> Nikolay,
>>>>
>>>> Many users start to use Ignite with a small project without
>>>> production-level monitoring. When proof-of-concept appears to be 
>>>> viable,
>>>> they tend to expand Ignite usage by growing cluster and adding needed
>>>> environment (including monitoring systems).
>>>> Inability to find such basic thing as survival in case of next node
>>>> crash may affect overall product impression. We all want Ignite to be
>>>> successful and widespread.
>>>>
>>>>> Can you clarify, what do you mean, exactly?
>>>> Right now user can access metric mentioned by Alex and choose 
>>>> minimum of
>>>> all cache groups. I want to highlight that not every user understands
>>>> Ignite and its internals so much to find out that exactly these 
>>>> sequence
>>>> of actions will bring him to desired answer.
>>>>
>>>>> Can you clarify, what do you mean, exactly?
>>>>> We have a ticket[1] to support metrics output via visor.sh.
>>>>>
>>>>> My understanding: we should have an easy way to output metric values
>>> for
>>>> each node in cluster.
>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12191
>>>> I propose to add metric method for aggregated
>>>> "getMinimumNumberOfPartitionCopies" and expose it to control.sh.
>>>> My understanding: it's result is critical enough to be accessible in a
>>>> short path. I've started this topic due to request from user list, and
>>>> I've heard many similar complaints before.
>>>>
>>>> Best Regards,
>>>> Ivan Rakov
>>>>
>>>> On 04.10.2019 17:18, Nikolay Izhikov wrote:
>>>>> Ivan.
>>>>>
>>>>>> We shouldn't force users to configure external tools and write extra
>>>> code for basic things.
>>>>> Actually, I don't agree with you.
>>>>> Having external monitoring system for any production cluster is a
>>>> *basic* thing.
>>>>> Can you, please, define "basic things"?
>>>>>
>>>>>> single method for the whole cluster
>>>>> Can you clarify, what do you mean, exactly?
>>>>> We have a ticket[1] to support metrics output via visor.sh.
>>>>>
>>>>> My understanding: we should have an easy way to output metric values
>>> for
>>>> each node in cluster.
>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12191
>>>>>
>>>>>
>>>>> В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
>>>>>> Max,
>>>>>>
>>>>>> What if user simply don't have configured monitoring system?
>>>>>> Knowing whether cluster will survive node shutdown is critical 
>>>>>> for any
>>>>>> administrator that performs any manipulations with cluster topology.
>>>>>> Essential information should be easily accessed. We shouldn't force
>>>>>> users to configure external tools and write extra code for basic
>>> things.
>>>>>> Alex,
>>>>>>
>>>>>> Thanks, that's exact metric we need.
>>>>>> My point is that we should make it more accessible: via control.sh
>>>>>> command and single method for the whole cluster.
>>>>>>
>>>>>> Best Regards,
>>>>>> Ivan Rakov
>>>>>>
>>>>>> On 04.10.2019 16:34, Alex Plehanov wrote:
>>>>>>> Ivan, there already exist metric
>>>>>>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which
>>> shows
>>>> the
>>>>>>> current redundancy level for the cache group.
>>>>>>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes
>>> without
>>>> data
>>>>>>> loss in this cache group.
>>>>>>>
>>>>>>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
>>>>>>>
>>>>>>>> Igniters,
>>>>>>>>
>>>>>>>> I've seen numerous requests to find out an easy way to check 
>>>>>>>> whether
>>>> is
>>>>>>>> it safe to turn off cluster node. As we know, in Ignite protection
>>>> from
>>>>>>>> sudden node shutdown is implemented through keeping several backup
>>>>>>>> copies of each partition. However, this guarantee can be weakened
>>> for
>>>> a
>>>>>>>> while in case cluster has recently experienced node restart and
>>>>>>>> rebalancing process is still in progress.
>>>>>>>> Example scenario is restarting nodes one by one in order to 
>>>>>>>> update a
>>>>>>>> local configuration parameter. User restarts one node and
>>> rebalancing
>>>>>>>> starts: when it will be completed, it will be safe to proceed
>>> (backup
>>>>>>>> count=1). However, there's no transparent way to determine whether
>>>>>>>> rebalancing is over.
>>>>>>>>     From my perspective, it would be very helpful to:
>>>>>>>> 1) Add information about rebalancing and number of free-to-go 
>>>>>>>> nodes
>>> to
>>>>>>>> ./control.sh --state command.
>>>>>>>> Examples of output:
>>>>>>>>
>>>>>>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>>>>>>>> Cluster tag: new_tag
>>>>>>>>>
>>> -------------------------------------------------------------------------------- 
>>>
>>>>>>>>> Cluster is active
>>>>>>>>> All partitions are up-to-date.
>>>>>>>>> 3 node(s) can safely leave the cluster without partition loss.
>>>>>>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>>>>>>>> Cluster tag: new_tag
>>>>>>>>>
>>> -------------------------------------------------------------------------------- 
>>>
>>>>>>>>> Cluster is active
>>>>>>>>> Rebalancing is in progress.
>>>>>>>>> 1 node(s) can safely leave the cluster without partition loss.
>>>>>>>> 2) Provide the same information via ClusterMetrics. For example:
>>>>>>>> ClusterMetrics#isRebalanceInProgress // boolean
>>>>>>>> ClusterMetrics#getSafeToLeaveNodesCount // int
>>>>>>>>
>>>>>>>> Here I need to mention that this information can be calculated 
>>>>>>>> from
>>>>>>>> existing rebalance metrics (see CacheMetrics#*rebalance*). 
>>>>>>>> However,
>>> I
>>>>>>>> still think that we need more simple and understandable flag 
>>>>>>>> whether
>>>>>>>> cluster is in danger of data loss. Another point is that current
>>>> metrics
>>>>>>>> are bound to specific cache, which makes this information even
>>> harder
>>>> to
>>>>>>>> analyze.
>>>>>>>>
>>>>>>>> Thoughts?
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Best Regards,
>>>>>>>> Ivan Rakov
>>>>>>>>
>>>>>>>>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Ivan Rakov <iv...@gmail.com>.

Denis, Alex,

Sure, new metric will be integrated into new metrics framework.
Let's not expose its value to control.sh right now. I'll create an issue 
for aggregated "getMinimumNumberOfPartitionCopies" if everyone agrees.

Best Regards,
Ivan Rakov

On 04.10.2019 20:06, Denis Magda wrote:
> I'm for the proposal to add new JMX metrics and enhance the existing
> tooling. But I would encourage us to integrate this into the new metrics
> framework Nikolay has been working on. Otherwise, we will be deprecating
> these JMX metrics in a short time frame in favor of the new monitoring APIs.
>
> -
> Denis
>
>
> On Fri, Oct 4, 2019 at 9:33 AM Alexey Goncharuk <al...@gmail.com>
> wrote:
>
>> I agree that we should have the ability to read any metric using simple
>> Ignite tooling. I am not sure if visor.sh is a good fit - if I
>> remember correctly, it will start a daemon node which will bump the
>> topology version with all related consequences. I believe in the long term
>> it will beneficial to migrate all visor.sh functionality to a more
>> lightweight protocol, such as used in control.sh.
>>
>> As for the metrics, the metric suggested by Ivan totally makes sense to me
>> - it is a simple and, actually, quite critical metric. It will be
>> completely unusable to select a minimum of some metric for all cache groups
>> manually. A monitoring system, on the other hand, might not be available
>> when the metric is needed, or may not support aggregation.
>>
>> --AG
>>
>> пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <iv...@gmail.com>:
>>
>>> Nikolay,
>>>
>>> Many users start to use Ignite with a small project without
>>> production-level monitoring. When proof-of-concept appears to be viable,
>>> they tend to expand Ignite usage by growing cluster and adding needed
>>> environment (including monitoring systems).
>>> Inability to find such basic thing as survival in case of next node
>>> crash may affect overall product impression. We all want Ignite to be
>>> successful and widespread.
>>>
>>>> Can you clarify, what do you mean, exactly?
>>> Right now user can access metric mentioned by Alex and choose minimum of
>>> all cache groups. I want to highlight that not every user understands
>>> Ignite and its internals so much to find out that exactly these sequence
>>> of actions will bring him to desired answer.
>>>
>>>> Can you clarify, what do you mean, exactly?
>>>> We have a ticket[1] to support metrics output via visor.sh.
>>>>
>>>> My understanding: we should have an easy way to output metric values
>> for
>>> each node in cluster.
>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12191
>>> I propose to add metric method for aggregated
>>> "getMinimumNumberOfPartitionCopies" and expose it to control.sh.
>>> My understanding: it's result is critical enough to be accessible in a
>>> short path. I've started this topic due to request from user list, and
>>> I've heard many similar complaints before.
>>>
>>> Best Regards,
>>> Ivan Rakov
>>>
>>> On 04.10.2019 17:18, Nikolay Izhikov wrote:
>>>> Ivan.
>>>>
>>>>> We shouldn't force users to configure external tools and write extra
>>> code for basic things.
>>>> Actually, I don't agree with you.
>>>> Having external monitoring system for any production cluster is a
>>> *basic* thing.
>>>> Can you, please, define "basic things"?
>>>>
>>>>> single method for the whole cluster
>>>> Can you clarify, what do you mean, exactly?
>>>> We have a ticket[1] to support metrics output via visor.sh.
>>>>
>>>> My understanding: we should have an easy way to output metric values
>> for
>>> each node in cluster.
>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12191
>>>>
>>>>
>>>> В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
>>>>> Max,
>>>>>
>>>>> What if user simply don't have configured monitoring system?
>>>>> Knowing whether cluster will survive node shutdown is critical for any
>>>>> administrator that performs any manipulations with cluster topology.
>>>>> Essential information should be easily accessed. We shouldn't force
>>>>> users to configure external tools and write extra code for basic
>> things.
>>>>> Alex,
>>>>>
>>>>> Thanks, that's exact metric we need.
>>>>> My point is that we should make it more accessible: via control.sh
>>>>> command and single method for the whole cluster.
>>>>>
>>>>> Best Regards,
>>>>> Ivan Rakov
>>>>>
>>>>> On 04.10.2019 16:34, Alex Plehanov wrote:
>>>>>> Ivan, there already exist metric
>>>>>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which
>> shows
>>> the
>>>>>> current redundancy level for the cache group.
>>>>>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes
>> without
>>> data
>>>>>> loss in this cache group.
>>>>>>
>>>>>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
>>>>>>
>>>>>>> Igniters,
>>>>>>>
>>>>>>> I've seen numerous requests to find out an easy way to check whether
>>> is
>>>>>>> it safe to turn off cluster node. As we know, in Ignite protection
>>> from
>>>>>>> sudden node shutdown is implemented through keeping several backup
>>>>>>> copies of each partition. However, this guarantee can be weakened
>> for
>>> a
>>>>>>> while in case cluster has recently experienced node restart and
>>>>>>> rebalancing process is still in progress.
>>>>>>> Example scenario is restarting nodes one by one in order to update a
>>>>>>> local configuration parameter. User restarts one node and
>> rebalancing
>>>>>>> starts: when it will be completed, it will be safe to proceed
>> (backup
>>>>>>> count=1). However, there's no transparent way to determine whether
>>>>>>> rebalancing is over.
>>>>>>>     From my perspective, it would be very helpful to:
>>>>>>> 1) Add information about rebalancing and number of free-to-go nodes
>> to
>>>>>>> ./control.sh --state command.
>>>>>>> Examples of output:
>>>>>>>
>>>>>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>>>>>>> Cluster tag: new_tag
>>>>>>>>
>> --------------------------------------------------------------------------------
>>>>>>>> Cluster is active
>>>>>>>> All partitions are up-to-date.
>>>>>>>> 3 node(s) can safely leave the cluster without partition loss.
>>>>>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>>>>>>> Cluster tag: new_tag
>>>>>>>>
>> --------------------------------------------------------------------------------
>>>>>>>> Cluster is active
>>>>>>>> Rebalancing is in progress.
>>>>>>>> 1 node(s) can safely leave the cluster without partition loss.
>>>>>>> 2) Provide the same information via ClusterMetrics. For example:
>>>>>>> ClusterMetrics#isRebalanceInProgress // boolean
>>>>>>> ClusterMetrics#getSafeToLeaveNodesCount // int
>>>>>>>
>>>>>>> Here I need to mention that this information can be calculated from
>>>>>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However,
>> I
>>>>>>> still think that we need more simple and understandable flag whether
>>>>>>> cluster is in danger of data loss. Another point is that current
>>> metrics
>>>>>>> are bound to specific cache, which makes this information even
>> harder
>>> to
>>>>>>> analyze.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards,
>>>>>>> Ivan Rakov
>>>>>>>
>>>>>>>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Denis Magda <dm...@apache.org>.

I'm for the proposal to add new JMX metrics and enhance the existing
tooling. But I would encourage us to integrate this into the new metrics
framework Nikolay has been working on. Otherwise, we will be deprecating
these JMX metrics in a short time frame in favor of the new monitoring APIs.

-
Denis


On Fri, Oct 4, 2019 at 9:33 AM Alexey Goncharuk <al...@gmail.com>
wrote:

> I agree that we should have the ability to read any metric using simple
> Ignite tooling. I am not sure if visor.sh is a good fit - if I
> remember correctly, it will start a daemon node which will bump the
> topology version with all related consequences. I believe in the long term
> it will beneficial to migrate all visor.sh functionality to a more
> lightweight protocol, such as used in control.sh.
>
> As for the metrics, the metric suggested by Ivan totally makes sense to me
> - it is a simple and, actually, quite critical metric. It will be
> completely unusable to select a minimum of some metric for all cache groups
> manually. A monitoring system, on the other hand, might not be available
> when the metric is needed, or may not support aggregation.
>
> --AG
>
> пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <iv...@gmail.com>:
>
> > Nikolay,
> >
> > Many users start to use Ignite with a small project without
> > production-level monitoring. When proof-of-concept appears to be viable,
> > they tend to expand Ignite usage by growing cluster and adding needed
> > environment (including monitoring systems).
> > Inability to find such basic thing as survival in case of next node
> > crash may affect overall product impression. We all want Ignite to be
> > successful and widespread.
> >
> > > Can you clarify, what do you mean, exactly?
> >
> > Right now user can access metric mentioned by Alex and choose minimum of
> > all cache groups. I want to highlight that not every user understands
> > Ignite and its internals so much to find out that exactly these sequence
> > of actions will bring him to desired answer.
> >
> > > Can you clarify, what do you mean, exactly?
> > > We have a ticket[1] to support metrics output via visor.sh.
> > >
> > > My understanding: we should have an easy way to output metric values
> for
> > each node in cluster.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> > I propose to add metric method for aggregated
> > "getMinimumNumberOfPartitionCopies" and expose it to control.sh.
> > My understanding: it's result is critical enough to be accessible in a
> > short path. I've started this topic due to request from user list, and
> > I've heard many similar complaints before.
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 04.10.2019 17:18, Nikolay Izhikov wrote:
> > > Ivan.
> > >
> > >> We shouldn't force users to configure external tools and write extra
> > code for basic things.
> > > Actually, I don't agree with you.
> > > Having external monitoring system for any production cluster is a
> > *basic* thing.
> > >
> > > Can you, please, define "basic things"?
> > >
> > >> single method for the whole cluster
> > > Can you clarify, what do you mean, exactly?
> > > We have a ticket[1] to support metrics output via visor.sh.
> > >
> > > My understanding: we should have an easy way to output metric values
> for
> > each node in cluster.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> > >
> > >
> > > В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
> > >> Max,
> > >>
> > >> What if user simply don't have configured monitoring system?
> > >> Knowing whether cluster will survive node shutdown is critical for any
> > >> administrator that performs any manipulations with cluster topology.
> > >> Essential information should be easily accessed. We shouldn't force
> > >> users to configure external tools and write extra code for basic
> things.
> > >>
> > >> Alex,
> > >>
> > >> Thanks, that's exact metric we need.
> > >> My point is that we should make it more accessible: via control.sh
> > >> command and single method for the whole cluster.
> > >>
> > >> Best Regards,
> > >> Ivan Rakov
> > >>
> > >> On 04.10.2019 16:34, Alex Plehanov wrote:
> > >>> Ivan, there already exist metric
> > >>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which
> shows
> > the
> > >>> current redundancy level for the cache group.
> > >>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes
> without
> > data
> > >>> loss in this cache group.
> > >>>
> > >>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
> > >>>
> > >>>> Igniters,
> > >>>>
> > >>>> I've seen numerous requests to find out an easy way to check whether
> > is
> > >>>> it safe to turn off cluster node. As we know, in Ignite protection
> > from
> > >>>> sudden node shutdown is implemented through keeping several backup
> > >>>> copies of each partition. However, this guarantee can be weakened
> for
> > a
> > >>>> while in case cluster has recently experienced node restart and
> > >>>> rebalancing process is still in progress.
> > >>>> Example scenario is restarting nodes one by one in order to update a
> > >>>> local configuration parameter. User restarts one node and
> rebalancing
> > >>>> starts: when it will be completed, it will be safe to proceed
> (backup
> > >>>> count=1). However, there's no transparent way to determine whether
> > >>>> rebalancing is over.
> > >>>>    From my perspective, it would be very helpful to:
> > >>>> 1) Add information about rebalancing and number of free-to-go nodes
> to
> > >>>> ./control.sh --state command.
> > >>>> Examples of output:
> > >>>>
> > >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > >>>>> Cluster tag: new_tag
> > >>>>>
> > >>>>
> >
> --------------------------------------------------------------------------------
> > >>>>> Cluster is active
> > >>>>> All partitions are up-to-date.
> > >>>>> 3 node(s) can safely leave the cluster without partition loss.
> > >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > >>>>> Cluster tag: new_tag
> > >>>>>
> > >>>>
> >
> --------------------------------------------------------------------------------
> > >>>>> Cluster is active
> > >>>>> Rebalancing is in progress.
> > >>>>> 1 node(s) can safely leave the cluster without partition loss.
> > >>>> 2) Provide the same information via ClusterMetrics. For example:
> > >>>> ClusterMetrics#isRebalanceInProgress // boolean
> > >>>> ClusterMetrics#getSafeToLeaveNodesCount // int
> > >>>>
> > >>>> Here I need to mention that this information can be calculated from
> > >>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However,
> I
> > >>>> still think that we need more simple and understandable flag whether
> > >>>> cluster is in danger of data loss. Another point is that current
> > metrics
> > >>>> are bound to specific cache, which makes this information even
> harder
> > to
> > >>>> analyze.
> > >>>>
> > >>>> Thoughts?
> > >>>>
> > >>>> --
> > >>>> Best Regards,
> > >>>> Ivan Rakov
> > >>>>
> > >>>>
> >
>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Alexey Goncharuk <al...@gmail.com>.

I agree that we should have the ability to read any metric using simple
Ignite tooling. I am not sure if visor.sh is a good fit - if I
remember correctly, it will start a daemon node which will bump the
topology version with all related consequences. I believe in the long term
it will beneficial to migrate all visor.sh functionality to a more
lightweight protocol, such as used in control.sh.

As for the metrics, the metric suggested by Ivan totally makes sense to me
- it is a simple and, actually, quite critical metric. It will be
completely unusable to select a minimum of some metric for all cache groups
manually. A monitoring system, on the other hand, might not be available
when the metric is needed, or may not support aggregation.

--AG

пт, 4 окт. 2019 г. в 18:58, Ivan Rakov <iv...@gmail.com>:

> Nikolay,
>
> Many users start to use Ignite with a small project without
> production-level monitoring. When proof-of-concept appears to be viable,
> they tend to expand Ignite usage by growing cluster and adding needed
> environment (including monitoring systems).
> Inability to find such basic thing as survival in case of next node
> crash may affect overall product impression. We all want Ignite to be
> successful and widespread.
>
> > Can you clarify, what do you mean, exactly?
>
> Right now user can access metric mentioned by Alex and choose minimum of
> all cache groups. I want to highlight that not every user understands
> Ignite and its internals so much to find out that exactly these sequence
> of actions will bring him to desired answer.
>
> > Can you clarify, what do you mean, exactly?
> > We have a ticket[1] to support metrics output via visor.sh.
> >
> > My understanding: we should have an easy way to output metric values for
> each node in cluster.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> I propose to add metric method for aggregated
> "getMinimumNumberOfPartitionCopies" and expose it to control.sh.
> My understanding: it's result is critical enough to be accessible in a
> short path. I've started this topic due to request from user list, and
> I've heard many similar complaints before.
>
> Best Regards,
> Ivan Rakov
>
> On 04.10.2019 17:18, Nikolay Izhikov wrote:
> > Ivan.
> >
> >> We shouldn't force users to configure external tools and write extra
> code for basic things.
> > Actually, I don't agree with you.
> > Having external monitoring system for any production cluster is a
> *basic* thing.
> >
> > Can you, please, define "basic things"?
> >
> >> single method for the whole cluster
> > Can you clarify, what do you mean, exactly?
> > We have a ticket[1] to support metrics output via visor.sh.
> >
> > My understanding: we should have an easy way to output metric values for
> each node in cluster.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-12191
> >
> >
> > В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
> >> Max,
> >>
> >> What if user simply don't have configured monitoring system?
> >> Knowing whether cluster will survive node shutdown is critical for any
> >> administrator that performs any manipulations with cluster topology.
> >> Essential information should be easily accessed. We shouldn't force
> >> users to configure external tools and write extra code for basic things.
> >>
> >> Alex,
> >>
> >> Thanks, that's exact metric we need.
> >> My point is that we should make it more accessible: via control.sh
> >> command and single method for the whole cluster.
> >>
> >> Best Regards,
> >> Ivan Rakov
> >>
> >> On 04.10.2019 16:34, Alex Plehanov wrote:
> >>> Ivan, there already exist metric
> >>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows
> the
> >>> current redundancy level for the cache group.
> >>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without
> data
> >>> loss in this cache group.
> >>>
> >>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
> >>>
> >>>> Igniters,
> >>>>
> >>>> I've seen numerous requests to find out an easy way to check whether
> is
> >>>> it safe to turn off cluster node. As we know, in Ignite protection
> from
> >>>> sudden node shutdown is implemented through keeping several backup
> >>>> copies of each partition. However, this guarantee can be weakened for
> a
> >>>> while in case cluster has recently experienced node restart and
> >>>> rebalancing process is still in progress.
> >>>> Example scenario is restarting nodes one by one in order to update a
> >>>> local configuration parameter. User restarts one node and rebalancing
> >>>> starts: when it will be completed, it will be safe to proceed (backup
> >>>> count=1). However, there's no transparent way to determine whether
> >>>> rebalancing is over.
> >>>>    From my perspective, it would be very helpful to:
> >>>> 1) Add information about rebalancing and number of free-to-go nodes to
> >>>> ./control.sh --state command.
> >>>> Examples of output:
> >>>>
> >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> >>>>> Cluster tag: new_tag
> >>>>>
> >>>>
> --------------------------------------------------------------------------------
> >>>>> Cluster is active
> >>>>> All partitions are up-to-date.
> >>>>> 3 node(s) can safely leave the cluster without partition loss.
> >>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> >>>>> Cluster tag: new_tag
> >>>>>
> >>>>
> --------------------------------------------------------------------------------
> >>>>> Cluster is active
> >>>>> Rebalancing is in progress.
> >>>>> 1 node(s) can safely leave the cluster without partition loss.
> >>>> 2) Provide the same information via ClusterMetrics. For example:
> >>>> ClusterMetrics#isRebalanceInProgress // boolean
> >>>> ClusterMetrics#getSafeToLeaveNodesCount // int
> >>>>
> >>>> Here I need to mention that this information can be calculated from
> >>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> >>>> still think that we need more simple and understandable flag whether
> >>>> cluster is in danger of data loss. Another point is that current
> metrics
> >>>> are bound to specific cache, which makes this information even harder
> to
> >>>> analyze.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> --
> >>>> Best Regards,
> >>>> Ivan Rakov
> >>>>
> >>>>
>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Ivan Rakov <iv...@gmail.com>.

Nikolay,

Many users start to use Ignite with a small project without 
production-level monitoring. When proof-of-concept appears to be viable, 
they tend to expand Ignite usage by growing cluster and adding needed 
environment (including monitoring systems).
Inability to find such basic thing as survival in case of next node 
crash may affect overall product impression. We all want Ignite to be 
successful and widespread.

> Can you clarify, what do you mean, exactly?

Right now user can access metric mentioned by Alex and choose minimum of 
all cache groups. I want to highlight that not every user understands 
Ignite and its internals so much to find out that exactly these sequence 
of actions will bring him to desired answer.

> Can you clarify, what do you mean, exactly?
> We have a ticket[1] to support metrics output via visor.sh.
>
> My understanding: we should have an easy way to output metric values for each node in cluster.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12191
I propose to add metric method for aggregated 
"getMinimumNumberOfPartitionCopies" and expose it to control.sh.
My understanding: it's result is critical enough to be accessible in a 
short path. I've started this topic due to request from user list, and 
I've heard many similar complaints before.

Best Regards,
Ivan Rakov

On 04.10.2019 17:18, Nikolay Izhikov wrote:
> Ivan.
>
>> We shouldn't force users to configure external tools and write extra code for basic things.
> Actually, I don't agree with you.
> Having external monitoring system for any production cluster is a *basic* thing.
>
> Can you, please, define "basic things"?
>
>> single method for the whole cluster
> Can you clarify, what do you mean, exactly?
> We have a ticket[1] to support metrics output via visor.sh.
>
> My understanding: we should have an easy way to output metric values for each node in cluster.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12191
>
>
> В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
>> Max,
>>
>> What if user simply don't have configured monitoring system?
>> Knowing whether cluster will survive node shutdown is critical for any
>> administrator that performs any manipulations with cluster topology.
>> Essential information should be easily accessed. We shouldn't force
>> users to configure external tools and write extra code for basic things.
>>
>> Alex,
>>
>> Thanks, that's exact metric we need.
>> My point is that we should make it more accessible: via control.sh
>> command and single method for the whole cluster.
>>
>> Best Regards,
>> Ivan Rakov
>>
>> On 04.10.2019 16:34, Alex Plehanov wrote:
>>> Ivan, there already exist metric
>>> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
>>> current redundancy level for the cache group.
>>> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
>>> loss in this cache group.
>>>
>>> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
>>>
>>>> Igniters,
>>>>
>>>> I've seen numerous requests to find out an easy way to check whether is
>>>> it safe to turn off cluster node. As we know, in Ignite protection from
>>>> sudden node shutdown is implemented through keeping several backup
>>>> copies of each partition. However, this guarantee can be weakened for a
>>>> while in case cluster has recently experienced node restart and
>>>> rebalancing process is still in progress.
>>>> Example scenario is restarting nodes one by one in order to update a
>>>> local configuration parameter. User restarts one node and rebalancing
>>>> starts: when it will be completed, it will be safe to proceed (backup
>>>> count=1). However, there's no transparent way to determine whether
>>>> rebalancing is over.
>>>>    From my perspective, it would be very helpful to:
>>>> 1) Add information about rebalancing and number of free-to-go nodes to
>>>> ./control.sh --state command.
>>>> Examples of output:
>>>>
>>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>>>> Cluster tag: new_tag
>>>>>
>>>> --------------------------------------------------------------------------------
>>>>> Cluster is active
>>>>> All partitions are up-to-date.
>>>>> 3 node(s) can safely leave the cluster without partition loss.
>>>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>>>> Cluster tag: new_tag
>>>>>
>>>> --------------------------------------------------------------------------------
>>>>> Cluster is active
>>>>> Rebalancing is in progress.
>>>>> 1 node(s) can safely leave the cluster without partition loss.
>>>> 2) Provide the same information via ClusterMetrics. For example:
>>>> ClusterMetrics#isRebalanceInProgress // boolean
>>>> ClusterMetrics#getSafeToLeaveNodesCount // int
>>>>
>>>> Here I need to mention that this information can be calculated from
>>>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
>>>> still think that we need more simple and understandable flag whether
>>>> cluster is in danger of data loss. Another point is that current metrics
>>>> are bound to specific cache, which makes this information even harder to
>>>> analyze.
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> Best Regards,
>>>> Ivan Rakov
>>>>
>>>>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Nikolay Izhikov <ni...@apache.org>.

Ivan.

> We shouldn't force users to configure external tools and write extra code for basic things.

Actually, I don't agree with you.
Having external monitoring system for any production cluster is a *basic* thing.

Can you, please, define "basic things"?

> single method for the whole cluster

Can you clarify, what do you mean, exactly?
We have a ticket[1] to support metrics output via visor.sh.

My understanding: we should have an easy way to output metric values for each node in cluster.

[1] https://issues.apache.org/jira/browse/IGNITE-12191


В Пт, 04/10/2019 в 17:09 +0300, Ivan Rakov пишет:
> Max,
> 
> What if user simply don't have configured monitoring system?
> Knowing whether cluster will survive node shutdown is critical for any 
> administrator that performs any manipulations with cluster topology.
> Essential information should be easily accessed. We shouldn't force 
> users to configure external tools and write extra code for basic things.
> 
> Alex,
> 
> Thanks, that's exact metric we need.
> My point is that we should make it more accessible: via control.sh 
> command and single method for the whole cluster.
> 
> Best Regards,
> Ivan Rakov
> 
> On 04.10.2019 16:34, Alex Plehanov wrote:
> > Ivan, there already exist metric
> > CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
> > current redundancy level for the cache group.
> > We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
> > loss in this cache group.
> > 
> > пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
> > 
> > > Igniters,
> > > 
> > > I've seen numerous requests to find out an easy way to check whether is
> > > it safe to turn off cluster node. As we know, in Ignite protection from
> > > sudden node shutdown is implemented through keeping several backup
> > > copies of each partition. However, this guarantee can be weakened for a
> > > while in case cluster has recently experienced node restart and
> > > rebalancing process is still in progress.
> > > Example scenario is restarting nodes one by one in order to update a
> > > local configuration parameter. User restarts one node and rebalancing
> > > starts: when it will be completed, it will be safe to proceed (backup
> > > count=1). However, there's no transparent way to determine whether
> > > rebalancing is over.
> > >   From my perspective, it would be very helpful to:
> > > 1) Add information about rebalancing and number of free-to-go nodes to
> > > ./control.sh --state command.
> > > Examples of output:
> > > 
> > > > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > > > Cluster tag: new_tag
> > > > 
> > > 
> > > --------------------------------------------------------------------------------
> > > > Cluster is active
> > > > All partitions are up-to-date.
> > > > 3 node(s) can safely leave the cluster without partition loss.
> > > > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > > > Cluster tag: new_tag
> > > > 
> > > 
> > > --------------------------------------------------------------------------------
> > > > Cluster is active
> > > > Rebalancing is in progress.
> > > > 1 node(s) can safely leave the cluster without partition loss.
> > > 
> > > 2) Provide the same information via ClusterMetrics. For example:
> > > ClusterMetrics#isRebalanceInProgress // boolean
> > > ClusterMetrics#getSafeToLeaveNodesCount // int
> > > 
> > > Here I need to mention that this information can be calculated from
> > > existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> > > still think that we need more simple and understandable flag whether
> > > cluster is in danger of data loss. Another point is that current metrics
> > > are bound to specific cache, which makes this information even harder to
> > > analyze.
> > > 
> > > Thoughts?
> > > 
> > > --
> > > Best Regards,
> > > Ivan Rakov
> > > 
> > >

Re: Metric showing how many nodes may safely leave the cluster

Posted by Ivan Rakov <iv...@gmail.com>.

Max,

What if user simply don't have configured monitoring system?
Knowing whether cluster will survive node shutdown is critical for any 
administrator that performs any manipulations with cluster topology.
Essential information should be easily accessed. We shouldn't force 
users to configure external tools and write extra code for basic things.

Alex,

Thanks, that's exact metric we need.
My point is that we should make it more accessible: via control.sh 
command and single method for the whole cluster.

Best Regards,
Ivan Rakov

On 04.10.2019 16:34, Alex Plehanov wrote:
> Ivan, there already exist metric
> CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
> current redundancy level for the cache group.
> We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
> loss in this cache group.
>
> пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:
>
>> Igniters,
>>
>> I've seen numerous requests to find out an easy way to check whether is
>> it safe to turn off cluster node. As we know, in Ignite protection from
>> sudden node shutdown is implemented through keeping several backup
>> copies of each partition. However, this guarantee can be weakened for a
>> while in case cluster has recently experienced node restart and
>> rebalancing process is still in progress.
>> Example scenario is restarting nodes one by one in order to update a
>> local configuration parameter. User restarts one node and rebalancing
>> starts: when it will be completed, it will be safe to proceed (backup
>> count=1). However, there's no transparent way to determine whether
>> rebalancing is over.
>>   From my perspective, it would be very helpful to:
>> 1) Add information about rebalancing and number of free-to-go nodes to
>> ./control.sh --state command.
>> Examples of output:
>>
>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>> Cluster tag: new_tag
>>>
>> --------------------------------------------------------------------------------
>>> Cluster is active
>>> All partitions are up-to-date.
>>> 3 node(s) can safely leave the cluster without partition loss.
>>> Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
>>> Cluster tag: new_tag
>>>
>> --------------------------------------------------------------------------------
>>> Cluster is active
>>> Rebalancing is in progress.
>>> 1 node(s) can safely leave the cluster without partition loss.
>> 2) Provide the same information via ClusterMetrics. For example:
>> ClusterMetrics#isRebalanceInProgress // boolean
>> ClusterMetrics#getSafeToLeaveNodesCount // int
>>
>> Here I need to mention that this information can be calculated from
>> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
>> still think that we need more simple and understandable flag whether
>> cluster is in danger of data loss. Another point is that current metrics
>> are bound to specific cache, which makes this information even harder to
>> analyze.
>>
>> Thoughts?
>>
>> --
>> Best Regards,
>> Ivan Rakov
>>
>>

Re: Metric showing how many nodes may safely leave the cluster

Posted by Alex Plehanov <pl...@gmail.com>.

Ivan, there already exist metric
CacheGroupMetricsMXBean#getMinimumNumberOfPartitionCopies, which shows the
current redundancy level for the cache group.
We can lose up to ( getMinimumNumberOfPartitionCopies-1) nodes without data
loss in this cache group.

пт, 4 окт. 2019 г. в 16:17, Ivan Rakov <iv...@gmail.com>:

> Igniters,
>
> I've seen numerous requests to find out an easy way to check whether is
> it safe to turn off cluster node. As we know, in Ignite protection from
> sudden node shutdown is implemented through keeping several backup
> copies of each partition. However, this guarantee can be weakened for a
> while in case cluster has recently experienced node restart and
> rebalancing process is still in progress.
> Example scenario is restarting nodes one by one in order to update a
> local configuration parameter. User restarts one node and rebalancing
> starts: when it will be completed, it will be safe to proceed (backup
> count=1). However, there's no transparent way to determine whether
> rebalancing is over.
>  From my perspective, it would be very helpful to:
> 1) Add information about rebalancing and number of free-to-go nodes to
> ./control.sh --state command.
> Examples of output:
>
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> >
> --------------------------------------------------------------------------------
> > Cluster is active
> > All partitions are up-to-date.
> > 3 node(s) can safely leave the cluster without partition loss.
> > Cluster  ID: 125a6dce-74b1-4ee7-a453-c58f23f1f8fc
> > Cluster tag: new_tag
> >
> --------------------------------------------------------------------------------
> > Cluster is active
> > Rebalancing is in progress.
> > 1 node(s) can safely leave the cluster without partition loss.
> 2) Provide the same information via ClusterMetrics. For example:
> ClusterMetrics#isRebalanceInProgress // boolean
> ClusterMetrics#getSafeToLeaveNodesCount // int
>
> Here I need to mention that this information can be calculated from
> existing rebalance metrics (see CacheMetrics#*rebalance*). However, I
> still think that we need more simple and understandable flag whether
> cluster is in danger of data loss. Another point is that current metrics
> are bound to specific cache, which makes this information even harder to
> analyze.
>
> Thoughts?
>
> --
> Best Regards,
> Ivan Rakov
>
>