You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by Zhenya Stanilovsky <ar...@mail.ru.INVALID> on 2019/07/24 06:11:10 UTC

Re[2]: Partition map exchange metrics

+1 with Anton decisions.


>Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
>
>Folks,
>
>It looks like we're trying to implement "extended debug" instead of
>"monitoring".
>It should not be interesting for real admin what phase of PME is in
>progress and so on.
>Interested metrics are
>- total blocked time (will be used for real SLA counting)
>- are we blocked right now (shows we have an SLA degradation right now)
>Duration of the current blocking period can be easily presented using any
>modern monitoring tool by regular checks.
>Initial true will means "period start", precision will be a result of
>checks frequency.
>Anyway, I'm ok to have current metric presented with long, where long is a
>duration, see no reason, but ok :)
>
>All other features you mentioned are useful for code or
>deployment improving and can (should) be taken from logs at the analysis
>phase.
>
>On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com > wrote:
>
>> Folks, let me step in.
>>
>> Nikita, thanks for your suggestions!
>>
>> > 1. initialVersion. Topology version that initiates the exchange.
>> > 2. initTime. Time PME was started.
>> > 3. initEvent. Event that triggered PME.
>> > 4. partitionReleaseTime. Time when a node has finished waiting for all
>> > updates and translations on a previous topology.
>> > 5. sendSingleMessageTime. Time when a node sent a single message.
>> > 6. recieveFullMessageTime. Time when a node received a full message.
>> > 7. finishTime. Time PME was ended.
>> >
>> > When new PME started all these metrics resets.
>> Every metric from Nikita's list looks useful and simple to implement.
>> I think that it would be better to change format of metrics 4, 5, 6 and
>> 7 a bit: we can keep only difference between time of previous event and
>> time of corresponding event. Such metrics would be easier to perceive:
>> they answer to specific questions "how much time did partition release
>> take?" or "how much time did awaiting of distributed phase end take?".
>> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
>> graphs will show how different stages times change from one PME to another.
>>
>> > When PME cause no blocking, it's a good PME and I see no reason to have
>> > monitoring related to it
>> Agree with Anton here. These metrics should be measured only for true
>> distributed exchange. Saving results for client leave/join PMEs will
>> just complicate monitoring.
>>
>> > I agree with total blocking duration metric but
>> > I still don't understand why instant value indicating that operations are
>> > blocked should be boolean.
>> > Duration time since blocking has started looks more appropriate and
>> useful.
>> > It gives more information while semantic is left the same.
>> Totally agree with Pavel here. Both "accumulated block time" and
>> "current PME block time" metrics are useful. Growth of accumulated
>> metric for specific period of time (should be easy to check via
>> monitoring system graph) will show for how much business operations were
>> blocked in total, and non-zero current metric will show that we are
>> experiencing issues right now. Boolean metric "are we blocked right now"
>> is not needed as it's obviously can be inferred from "current PME block
>> time".
>>
>> Best Regards,
>> Ivan Rakov
>>
>> On 23.07.2019 16:02, Pavel Kovalenko wrote:
>> > Nikita,
>> >
>> > I agree with total blocking duration metric but
>> > I still don't understand why instant value indicating that operations are
>> > blocked should be boolean.
>> > Duration time since blocking has started looks more appropriate and
>> useful.
>> > It gives more information while semantic is left the same.
>> >
>> >
>> >
>> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com >:
>> >
>> >> Folks,
>> >>
>> >> All previous suggestions have some disadvantages. It can be several
>> >> exchanges between two metric updates and fast exchange can rewrite
>> >> previous long exchange.
>> >>
>> >> We can introduce a metric of total blocking duration that will
>> >> accumulate at the end of the exchange. So, users will get actual
>> >> information about how long operations were blocked. Cluster metric
>> >> will be a maximum of local nodes metrics. And we need a boolean metric
>> >> that will indicate realtime status. It needs because of duration
>> >> metric updates at the end of the exchange.
>> >>
>> >> So I propose to change the current metric that not released to the
>> >> totalCacheOperationsBlockingDuration metric and to add the
>> >> isCacheOperationsBlocked metric.
>> >>
>> >> WDYT?
>> >>
>> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
>> >>> Nikolay,
>> >>>
>> >>> Still see no reason to replace boolean with long.
>> >>>
>> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < nizhikov@apache.org >
>> >> wrote:
>> >>>> Anton.
>> >>>>
>> >>>> 1. Value exported based on SPI settings, not in the moment it changed.
>> >>>>
>> >>>> 2. Clock synchronisation - if we export start time, we should also
>> >> export
>> >>>> node local timestamp.
>> >>>>
>> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
>> >>>>
>> >>>>> Folks,
>> >>>>>
>> >>>>> What's the reason for duration counting?
>> >>>>> AFAIU, it's a monitoring system feature to count the durations.
>> >>>>> Sine monitoring system checks metrics periodically it will know the
>> >>>>> duration by its own log.
>> >>>>>
>> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < jokserfn@gmail.com >
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Nikita,
>> >>>>>>
>> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
>> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
>> >> what
>> >>>> is
>> >>>>>> blocked during PME.
>> >>>>>> We can also combine both timestamp
>> >> "cacheOperationsBlockingStartTs" and
>> >>>>>> duration to have better correlation when cache operations were
>> >> blocked
>> >>>>> and
>> >>>>>> how much time it's taken.
>> >>>>>> For instant view (like in JMX bean) a calculated value as you
>> >> mentioned
>> >>>>>> can be used.
>> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be
>> >>>> used.
>> >>>>>> The counter is incremented by blocking time after blocking has
>> >> ended.
>> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < nsamelchev@gmail.com
>> >>> :
>> >>>>>>> Pavel,
>> >>>>>>>
>> >>>>>>> The main purpose of this metric is
>> >>>>>>>>> how much time we wait for resuming cache operations
>> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
>> >>>>>>>>> What do you think if we change the boolean value of metric to a
>> >>>> long
>> >>>>>>> value that represents time in milliseconds when operations were
>> >>>> blocked?
>> >>>>>>> This time can be calculated as (currentTime -
>> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
>> >>>>>>>
>> >>>>>>> Duration will be more understandable. It'll be something like
>> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
>> >>>>>>> name yet.
>> >>>>>>>
>> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < jokserfn@gmail.com
>> >>> :
>> >>>>>>>> Nikita,
>> >>>>>>>>
>> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
>> >> The
>> >>>>> main
>> >>>>>>> PME side effect for end-users is blocking cache operations. Not
>> >> all
>> >>>> PME
>> >>>>>>> time blocks it.
>> >>>>>>>> What information gives to an end-user timestamp of
>> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
>> >>>> how?
>> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
>> >>  nsamelchev@gmail.com
>> >>>>> :
>> >>>>>>>>> Hi Pavel,
>> >>>>>>>>>
>> >>>>>>>>> This time already can be obtained from the
>> >> getCurrentPmeDuration
>> >>>> and
>> >>>>>>>>> new isOperationsBlockedByPme metrics.
>> >>>>>>>>>
>> >>>>>>>>> As an alternative solution, I can rework recently added
>> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
>> >> users it
>> >>>>>>>>> useless in case of non-blocking PME.
>> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
>> >> when
>> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
>> >> blocking
>> >>>>>>>>> ends (there is no running PME).
>> >>>>>>>>>
>> >>>>>>>>> WDYT?
>> >>>>>>>>>
>> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
>> >>  jokserfn@gmail.com >:
>> >>>>>>>>>> Hi Nikita,
>> >>>>>>>>>>
>> >>>>>>>>>> Thank you for working on this. What do you think if we
>> >> change the
>> >>>>>>> boolean
>> >>>>>>>>>> value of metric to a long value that represents time in
>> >>>>> milliseconds
>> >>>>>>> when
>> >>>>>>>>>> operations were blocked?
>> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
>> >>>>> exported
>> >>>>>>> to
>> >>>>>>>>>> some backend it can give a more clear picture of how much
>> >> time we
>> >>>>>>> wait for
>> >>>>>>>>>> resuming cache operations instead of instant boolean
>> >> indicator.
>> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
>> >>>>  nsamelchev@gmail.com
>> >>>>>> :
>> >>>>>>>>>>> Anton, Nikolay,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for the support.
>> >>>>>>>>>>>
>> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
>> >> does
>> >>>> not
>> >>>>>>> show
>> >>>>>>>>>>> influence on the cluster correctly. PME can be without
>> >> blocking
>> >>>>>>>>>>> operations. For example, client node join/leave events.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
>> >>>> Together,
>> >>>>>>> these
>> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
>> >>>>>>> operations.
>> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
>> >> anyone
>> >>>>>>> take a
>> >>>>>>>>>>> look?
>> >>>>>>>>>>>
>> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
>> >>>>>>>>>>>
>> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
>> >>>>>  nizhikov@apache.org
>> >>>>>>>> :
>> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
>> >>>>> monitor
>> >>>>>>> all
>> >>>>>>>>>>> Ignite process, including non blocking PME.
>> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>> >>>>>>>>>>>>> BTW,
>> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
>> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
>> >> because
>> >>>> of
>> >>>>>>> this.
>> >>>>>>>>>>>>> The goal it so show exactly blocking period.
>> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
>> >> no
>> >>>>>>> reason to have
>> >>>>>>>>>>>>> monitoring related to it :)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>> >>>>>>>  nizhikov@apache.org >
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>> Anton.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
>> >>>> metrics?
>> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> I think we can implement this metrics as a single
>> >>>>>>> contribution.
>> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
>> >> пишет:
>> >>>>>>>>>>>>>>> Nikita,
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
>> >> are
>> >>>>>>> operations
>> >>>>>>>>>>> blocked?
>> >>>>>>>>>>>>>>> Just a true or false.
>> >>>>>>>>>>>>>>> Lest start from this.
>> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
>> >> and
>> >>>> can
>> >>>>> be
>> >>>>>>>>>>> implemented
>> >>>>>>>>>>>>>>> later.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>> >>>>>>>>>>>  nizhikov@apache.org >
>> >>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> +1.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>> >>>>>>>  nsamelchev@gmail.com
>> >>>>>>>>>>>> :
>> >>>>>>>>>>>>>>>>> Hello, Igniters.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
>> >>>>>>> partition map
>> >>>>>>>>>>> exchange
>> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
>> >>>> available
>> >>>>>>> only in
>> >>>>>>>>>>> log
>> >>>>>>>>>>>>>> files
>> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
>> >> external
>> >>>>>>> tools. [1]
>> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
>> >> help to
>> >>>>>>> understand
>> >>>>>>>>>>> the
>> >>>>>>>>>>>>>>>>> actual status of current PME:
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
>> >> initiates
>> >>>>> the
>> >>>>>>>>>>> exchange.
>> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
>> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
>> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
>> >>>>> finished
>> >>>>>>> waiting
>> >>>>>>>>>>> for
>> >>>>>>>>>>>>>> all
>> >>>>>>>>>>>>>>>>> updates and translations on a previous
>> >> topology.
>> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
>> >> sent a
>> >>>>>>> single
>> >>>>>>>>>>> message.
>> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
>> >>>> received
>> >>>>> a
>> >>>>>>> full
>> >>>>>>>>>>> message.
>> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> These metrics help to understand:
>> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
>> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
>> >> completed.
>> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
>> >>>> message)
>> >>>>>>>>>>>>>>>>> - what triggered PME.
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thoughts?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> [1]
>> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
>> >>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>> Best wishes,
>> >>>>>>>>>>>>>>>>> Amelchev Nikita
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Best wishes,
>> >>>>>>>>>>> Amelchev Nikita
>> >>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best wishes,
>> >>>>>>>>> Amelchev Nikita
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Best wishes,
>> >>>>>>> Amelchev Nikita
>> >>>>>>>
>> >>
>> >>
>> >> --
>> >> Best wishes,
>> >> Amelchev Nikita
>> >>
>>


-- 
Zhenya Stanilovsky

Re: Partition map exchange metrics

Posted by Pavel Kovalenko <jo...@gmail.com>.

Nikolay,

Looks like final resolution. +1.

пт, 26 июл. 2019 г. в 12:08, Nikolay Izhikov <ni...@apache.org>:

> Pavel.
>
> > I just want to add that currentPmeTime is also useful alerting systems,
> not
> > only for eye observing
>
> Fully agree.
>
> Let me make it as clear as I can.
> In the end we should have 4 metrics:
>
> `CurrentPMEDuration` - existing metric, shows current PME duration.
> `CurrentPMECacheOperationsBlockedDuration` - new long metric. show
> blocking duration of PME.
>
> `PMEDuration` - histogram of full PME durations.
> `PMECacheOperationsBlockedDuration` - histogram of blocking PME durations.
>
> В Чт, 25/07/2019 в 22:40 +0300, Pavel Kovalenko пишет:
> > Nikolay,
> >
> > Okay, sounds reasonable.
> > I just want to add that currentPmeTime is also useful alerting systems,
> not
> > only for eye observing. If the time become too long and exceeds some
> > threshold appropriate alert firing can help to early determine a critical
> > problem.
> >
> > On Thu, 25 Jul 2019 at 21.12, Nikolay Izhikov <ni...@apache.org>
> wrote:
> >
> > > I think exact time should be obtained from logs, isnt it?
> > >
> > >
> > > чт, 25 июля 2019 г., 20:00 Pavel Kovalenko <jo...@gmail.com>:
> > >
> > > > Nikolay,
> > > >
> > > > Yes, I have a chance to see HistogramMetric and moreover reviewed
> it) My
> > > > question was mostly about what exactly we will track in Histogram.
> > > > If we use histogram do you know how we can find exact time e.g. when
> PME
> > > > with time > 1s happened?
> > > >
> > > > чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <ni...@apache.org>:
> > > >
> > > > > Pavel
> > > > >
> > > > > Do you have a chance to see HistogramMetric source?
> > > > > It in master now.
> > > > > Look in source would be better then my explanation)
> > > > >
> > > > > We should count PME processes that blocks operations for some
> amount of
> > > > > time. For example [less then 50, less then 250, less then 1000,
> more
> > >
> > > then
> > > > > 1000] millis.
> > > > >
> > > > > чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jo...@gmail.com>:
> > > > >
> > > > > > Nikolay,
> > > > > >
> > > > > > Could you please explain deeper what structure will be of PME
> > > >
> > > > histogram?
> > > > > >
> > > > > > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <
> nizhikov@apache.org>:
> > > > > >
> > > > > > > Hello, Nikita.
> > > > > > >
> > > > > > > I think
> > > > > > >
> > > > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > > >
> > > > > accumulate
> > > > > > > > all blocking durations that happen after node starts.
> > > > > > >
> > > > > > > No, we don't need it.
> > > > > > >
> > > > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> > >
> > > class.
> > > > > > >
> > > > > > > Yes, we need it.
> > > > > > >
> > > > > > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > > > > > > Igniters,
> > > > > > > >
> > > > > > > > All want to see the сacheOperationsBlockedDuration metric
> that
> > >
> > > will
> > > > > > > > show current blocking duration or 0 if there is no blocking
> right
> > > > >
> > > > > now.
> > > > > > > >
> > > > > > > > Do we need the following metrics? It seems one of them will
> be
> > > > > > >
> > > > > > > superfluous.
> > > > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > > >
> > > > > accumulate
> > > > > > > > all blocking durations that happen after node starts.
> > > > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> > >
> > > class.
> > > > > > > > User will be able to configure bounds.
> > > > > > > >
> > > > > > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <
> > >
> > > nizhikov@apache.org
> > > > > :
> > > > > > > > >
> > > > > > > > > Guys.
> > > > > > > > >
> > > > > > > > > I think we should go with the 2 metrics
> > > > > > > > >
> > > > > > > > >         * current PME duration (resets on finish)
> > > > > > > > >
> > > > > > > > >                 This metric required for alerting(or
> automatic
> > > > > > >
> > > > > > > actions) on long PME.
> > > > > > > > >
> > > > > > > > >         * PME duration histogram (value added to metrics
> on PME
> > > > > >
> > > > > > finish)
> > > > > > > > >                 This metric required for an:
> > > > > > > > >                         * Quick PME trend analysis
> > > > > > > > >                         * Quick PME history analysis
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > > > > > > Nikita and Maxim,
> > > > > > > > > >
> > > > > > > > > > > What if we just update current metric
> getCurrentPmeDuration
> > > > > > >
> > > > > > > behaviour
> > > > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > > > Remain it as a long value and rename it to
> > > > > > >
> > > > > > > getCacheOperationsBlockedDuration.
> > > > > > > > > > >
> > > > > > > > > > > No other changes will require.
> > > > > > > > > > >
> > > > > > > > > > > WDYT?
> > > > > > > > > >
> > > > > > > > > > I agree with these two metrics. I also think that current
> > > > > > > > > > getCurrentPmeDuration will become redundant.
> > > > > > > > > >
> > > > > > > > > > Anton,
> > > > > > > > > >
> > > > > > > > > > > It looks like we're trying to implement "extended
> debug"
> > > > >
> > > > > instead
> > > > > > of
> > > > > > > > > > > "monitoring".
> > > > > > > > > > > It should not be interesting for real admin what phase
> of
> > >
> > > PME
> > > > > is
> > > > > > in
> > > > > > > > > > > progress and so on.
> > > > > > > > > >
> > > > > > > > > > PME is mission critical cluster process. I agree that
> > >
> > > there's a
> > > > > > fine
> > > > > > > > > > line between monitoring and debug here. However, it's not
> > >
> > > good
> > > > to
> > > > > > add
> > > > > > > > > > monitoring capabilities only for scenario when
> everything is
> > > > > >
> > > > > > alright.
> > > > > > > > > > If PME will really hang, *real admin* will be extremely
> > > > >
> > > > > interested
> > > > > > > how
> > > > > > > > > > to return cluster back to working state. Metrics about
> stages
> > > > > > >
> > > > > > > completion
> > > > > > > > > > time may really help here: e.g. if one specific node
> hasn't
> > > > > >
> > > > > > completed
> > > > > > > > > > stage X while rest of the cluster has, it can be a signal
> > >
> > > that
> > > > > this
> > > > > > > node
> > > > > > > > > > should be killed.
> > > > > > > > > >
> > > > > > > > > > Of course, it's possible to build monitoring system that
> > > >
> > > > extract
> > > > > > this
> > > > > > > > > > information from logs, but:
> > > > > > > > > > - It's more resource intensive as it requires parsing
> logs
> > >
> > > for
> > > > > all
> > > > > > > the time
> > > > > > > > > > - It's less reliable as log messages may change
> > > > > > > > > >
> > > > > > > > > > Best Regards,
> > > > > > > > > > Ivan Rakov
> > > > > > > > > >
> > > > > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > +1 with Anton post.
> > > > > > > > > > >
> > > > > > > > > > > What if we just update current metric
> getCurrentPmeDuration
> > > > > > >
> > > > > > > behaviour
> > > > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > > > Remain it as a long value and rename it to
> > > > > > >
> > > > > > > getCacheOperationsBlockedDuration.
> > > > > > > > > > >
> > > > > > > > > > > No other changes will require.
> > > > > > > > > > >
> > > > > > > > > > > WDYT?
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > > > > > >
> > > > > > > nsamelchev@gmail.com> wrote:
> > > > > > > > > > > > Nikolay,
> > > > > > > > > > > >
> > > > > > > > > > > > The сacheOperationsBlockedDuration metric will show
> > >
> > > current
> > > > > > > blocking
> > > > > > > > > > > > duration or 0 if there is no blocking right now.
> > > > > > > > > > > >
> > > > > > > > > > > > The totalCacheOperationsBlockedDuration metric will
> > > > >
> > > > > accumulate
> > > > > > > all
> > > > > > > > > > > > blocking durations that happen after node starts.
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > > > > > >
> > > > > > > nizhikov@apache.org>:
> > > > > > > > > > > > > Nikita
> > > > > > > > > > > > >
> > > > > > > > > > > > > What is the difference between those two metrics?
> > > > > > > > > > > > >
> > > > > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > > > > > >
> > > > > > > nsamelchev@gmail.com>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Igniters, thanks for comments.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >  From the discussion it can be seen that we need
> only
> > > >
> > > > two
> > > > > > > metrics for now:
> > > > > > > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > > > > > >
> > > > > > > <arzamas123@mail.ru.invalid
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton
> > > >
> > > > Vinogradov
> > > > > <
> > > > > > > av@apache.org>:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It looks like we're trying to implement
> "extended
> > > > > >
> > > > > > debug"
> > > > > > > instead of
> > > > > > > > > > > > > > > > "monitoring".
> > > > > > > > > > > > > > > > It should not be interesting for real admin
> what
> > > > >
> > > > > phase
> > > > > > > of PME is in
> > > > > > > > > > > > > > > > progress and so on.
> > > > > > > > > > > > > > > > Interested metrics are
> > > > > > > > > > > > > > > > - total blocked time (will be used for real
> SLA
> > > > > >
> > > > > > counting)
> > > > > > > > > > > > > > > > - are we blocked right now (shows we have an
> SLA
> > > > > > >
> > > > > > > degradation right now)
> > > > > > > > > > > > > > > > Duration of the current blocking period can
> be
> > > >
> > > > easily
> > > > > > > presented using
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > > > > > > Initial true will means "period start",
> precision
> > > > >
> > > > > will
> > > > > > > be a result of
> > > > > > > > > > > > > > > > checks frequency.
> > > > > > > > > > > > > > > > Anyway, I'm ok to have current metric
> presented
> > > >
> > > > with
> > > > > > > long, where long
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > is a
> > > > > > > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > All other features you mentioned are useful
> for
> > > >
> > > > code
> > > > > or
> > > > > > > > > > > > > > > > deployment improving and can (should) be
> taken
> > >
> > > from
> > > > > > logs
> > > > > > > at the analysis
> > > > > > > > > > > > > > > > phase.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > > > > > >
> > > > > > > ivan.glukos@gmail.com >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > > >
> > > > > initiates
> > > > > > > the exchange.
> > > > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a
> node has
> > > > > > >
> > > > > > > finished waiting for
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > updates and translations on a previous
> > > >
> > > > topology.
> > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a
> node
> > > >
> > > > sent a
> > > > > > > single message.
> > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a
> node
> > > > > >
> > > > > > received
> > > > > > > a full message.
> > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > When new PME started all these metrics
> > >
> > > resets.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Every metric from Nikita's list looks
> useful
> > >
> > > and
> > > > > > > simple to implement.
> > > > > > > > > > > > > > > > > I think that it would be better to change
> > >
> > > format
> > > > of
> > > > > > > metrics 4, 5, 6
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > 7 a bit: we can keep only difference
> between
> > >
> > > time
> > > > > of
> > > > > > > previous event
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > time of corresponding event. Such metrics
> would
> > > >
> > > > be
> > > > > > > easier to perceive:
> > > > > > > > > > > > > > > > > they answer to specific questions "how much
> > >
> > > time
> > > > > did
> > > > > > > partition release
> > > > > > > > > > > > > > > > > take?" or "how much time did awaiting of
> > > > >
> > > > > distributed
> > > > > > > phase end take?".
> > > > > > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be
> exported
> > > >
> > > > to
> > > > > > > monitoring system,
> > > > > > > > > > > > > > > > > graphs will show how different stages times
> > > >
> > > > change
> > > > > > > from one PME to
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > another.
> > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a good
> PME
> > > >
> > > > and I
> > > > > > > see no reason to
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Agree with Anton here. These metrics
> should be
> > > > > > >
> > > > > > > measured only for true
> > > > > > > > > > > > > > > > > distributed exchange. Saving results for
> client
> > > > > > >
> > > > > > > leave/join PMEs will
> > > > > > > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I agree with total blocking duration
> metric
> > >
> > > but
> > > > > > > > > > > > > > > > > > I still don't understand why instant
> value
> > > > > > >
> > > > > > > indicating that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > > > Duration time since blocking has started
> > >
> > > looks
> > > > > more
> > > > > > > appropriate and
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > > > It gives more information while semantic
> is
> > > >
> > > > left
> > > > > > the
> > > > > > > same.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Totally agree with Pavel here. Both
> > >
> > > "accumulated
> > > > > > block
> > > > > > > time" and
> > > > > > > > > > > > > > > > > "current PME block time" metrics are
> useful.
> > > >
> > > > Growth
> > > > > > of
> > > > > > > accumulated
> > > > > > > > > > > > > > > > > metric for specific period of time (should
> be
> > > >
> > > > easy
> > > > > to
> > > > > > > check via
> > > > > > > > > > > > > > > > > monitoring system graph) will show for how
> much
> > > > > > >
> > > > > > > business operations
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > were
> > > > > > > > > > > > > > > > > blocked in total, and non-zero current
> metric
> > > >
> > > > will
> > > > > > > show that we are
> > > > > > > > > > > > > > > > > experiencing issues right now. Boolean
> metric
> > > >
> > > > "are
> > > > > we
> > > > > > > blocked right
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > now"
> > > > > > > > > > > > > > > > > is not needed as it's obviously can be
> inferred
> > > > >
> > > > > from
> > > > > > > "current PME
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > block
> > > > > > > > > > > > > > > > > time".
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I agree with total blocking duration
> metric
> > >
> > > but
> > > > > > > > > > > > > > > > > > I still don't understand why instant
> value
> > > > > > >
> > > > > > > indicating that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > > > Duration time since blocking has started
> > >
> > > looks
> > > > > more
> > > > > > > appropriate and
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > > > It gives more information while semantic
> is
> > > >
> > > > left
> > > > > > the
> > > > > > > same.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita
> Amelchev
> > >
> > > <
> > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > All previous suggestions have some
> > > > >
> > > > > disadvantages.
> > > > > > > It can be several
> > > > > > > > > > > > > > > > > > > exchanges between two metric updates
> and
> > >
> > > fast
> > > > > > > exchange can rewrite
> > > > > > > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > We can introduce a metric of total
> blocking
> > > > > > >
> > > > > > > duration that will
> > > > > > > > > > > > > > > > > > > accumulate at the end of the exchange.
> So,
> > > > >
> > > > > users
> > > > > > > will get actual
> > > > > > > > > > > > > > > > > > > information about how long operations
> were
> > > > > > >
> > > > > > > blocked. Cluster metric
> > > > > > > > > > > > > > > > > > > will be a maximum of local nodes
> metrics.
> > >
> > > And
> > > > > we
> > > > > > > need a boolean
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > metric
> > > > > > > > > > > > > > > > > > > that will indicate realtime status. It
> > >
> > > needs
> > > > > > > because of duration
> > > > > > > > > > > > > > > > > > > metric updates at the end of the
> exchange.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > So I propose to change the current
> metric
> > > >
> > > > that
> > > > > > not
> > > > > > > released to the
> > > > > > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration
> metric
> > > >
> > > > and
> > > > > > to
> > > > > > > add the
> > > > > > > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton
> > > >
> > > > Vinogradov <
> > > > > > > av@apache.org >:
> > > > > > > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Still see no reason to replace
> boolean
> > >
> > > with
> > > > > > long.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM
> Nikolay
> > > > > >
> > > > > > Izhikov <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI
> > >
> > > settings,
> > > > > not
> > > > > > > in the moment it
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > changed.
> > > > > > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we
> export
> > > > >
> > > > > start
> > > > > > > time, we should
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton
> > > >
> > > > Vinogradov
> > > > > <
> > > > > > > av@apache.org >:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > What's the reason for duration
> > > >
> > > > counting?
> > > > > > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system
> > >
> > > feature
> > > > > to
> > > > > > > count the durations.
> > > > > > > > > > > > > > > > > > > > > > Sine monitoring system checks
> metrics
> > > > > > >
> > > > > > > periodically it will know
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM
> Pavel
> > > > > > >
> > > > > > > Kovalenko <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not
> timestamp.
> > > >
> > > > For
> > > > > > > the metric name, I
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > > > > > >
> "cacheOperationsBlockingDuration",
> > >
> > > I
> > > > > > think
> > > > > > > it cleaner
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > represents
> > > > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > > > > > > We can also combine both
> timestamp
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > > > > > > duration to have better
> correlation
> > > > >
> > > > > when
> > > > > > > cache operations were
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > > > > > > For instant view (like in JMX
> > >
> > > bean) a
> > > > > > > calculated value as you
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > > > > > > For metrics are exported to
> some
> > > > >
> > > > > backend
> > > > > > > (IEP-35) a counter
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > can be
> > > > > > > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > > > > > > The counter is incremented by
> > > >
> > > > blocking
> > > > > > > time after blocking has
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10,
> Nikita
> > > > > > >
> > > > > > > Amelchev <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > The main purpose of this
> metric
> > >
> > > is
> > > > > > > > > > > > > > > > > > > > > > > > > > how much time we wait for
> > > > >
> > > > > resuming
> > > > > > > cache operations
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you.
> Do you
> > > > >
> > > > > mean
> > > > > > > timestamp or duration
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > > > > > > What do you think if we
> > >
> > > change
> > > > > the
> > > > > > > boolean value of metric
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > to a
> > > > > > > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > > > > > > value that represents time in
> > > > > > >
> > > > > > > milliseconds when operations
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > were
> > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > This time can be calculated
> as
> > > > > > >
> > > > > > > (currentTime -
> > > > > > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked)
> in
> > >
> > > case
> > > > > of
> > > > > > > timestamp.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Duration will be more
> > > >
> > > > understandable.
> > > > > > > It'll be something like
> > > > > > > > > > > > > > > > > > > > > > > >
> getCurrentBlockingPmeDuration.
> > >
> > > But
> > > > I
> > > > > > > haven't come up with a
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30,
> > >
> > > Pavel
> > > > > > > Kovalenko <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > I think
> getCurrentPmeDuration
> > > > >
> > > > > doesn't
> > > > > > > show useful
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > information.
> > > > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > > > > > > PME side effect for
> end-users is
> > > > > > >
> > > > > > > blocking cache operations.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not
> > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > > > > > > What information gives to
> an
> > > > >
> > > > > end-user
> > > > > > > timestamp of
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> "timeSinceOperationsBlocked"? For
> > > > >
> > > > > what
> > > > > > > analysis it can be
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > used and
> > > > > > > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в
> 17:48,
> > > >
> > > > Nikita
> > > > > > > Amelchev <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > This time already can be
> > > >
> > > > obtained
> > > > > > > from the
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > > > new
> isOperationsBlockedByPme
> > > > > >
> > > > > > metrics.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > As an alternative
> solution, I
> > > >
> > > > can
> > > > > > > rework recently added
> > > > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration
> metric
> > > >
> > > > (not
> > > > > > > released yet). Seems for
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > > > > > > useless in case of
> > >
> > > non-blocking
> > > > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > Lets name it
> > > > > > >
> > > > > > > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > > blocking started (minimal
> > >
> > > value
> > > > > of
> > > > > > > cluster nodes) and 0 if
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > > ends (there is no running
> > >
> > > PME).
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в
> 15:56,
> > > > >
> > > > > Pavel
> > > > > > > Kovalenko <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working
> on
> > > >
> > > > this.
> > > > > > > What do you think if we
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > > > > > > value of metric to a
> long
> > > >
> > > > value
> > > > > > > that represents time in
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > > > operations were
> blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > Since we have not only
> JMX
> > > >
> > > > and
> > > > > > now
> > > > > > > metrics are periodically
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > > > some backend it can
> give a
> > > >
> > > > more
> > > > > > > clear picture of how much
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > > > > > > resuming cache
> operations
> > > > >
> > > > > instead
> > > > > > > of instant boolean
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в
> > >
> > > 14:41,
> > > > > > > Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the
> support.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > > > > > >
> > > > > > > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > > > > > > influence on the
> cluster
> > > > > > >
> > > > > > > correctly. PME can be without
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > > > > operations. For
> example,
> > > > >
> > > > > client
> > > > > > > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new
> metric
> > >
> > > -
> > > > > > > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > > > > > > metrics will show
> > >
> > > influence
> > > > > of
> > > > > > > the PME on cluster and user
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR
> for
> > >
> > > this
> > > > > > (Bot
> > > > > > > visa is green). [1] Can
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в
> > > >
> > > > 14:58,
> > > > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think
> administator of
> > > > > >
> > > > > > Ignite
> > > > > > > cluster should be able to
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite process,
> including
> > > >
> > > > non
> > > > > > > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> > >
> > > 14:57
> > > > > > > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric
> -
> > > > > > >
> > > > > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows
> > >
> > > exactly
> > > > > PME
> > > > > > > time and not so useful
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so
> show
> > > > >
> > > > > exactly
> > > > > > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no
> > > > >
> > > > > blocking,
> > > > > > > it's a good PME and I see
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > monitoring
> related to
> > > >
> > > > it
> > > > > :)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16,
> 2019
> > >
> > > at
> > > > > > 2:50
> > > > > > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need
> to
> > > > > >
> > > > > > postpone
> > > > > > > implementation of this
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now,
> > > >
> > > > implementation
> > > > > > of
> > > > > > > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can
> > > > >
> > > > > implement
> > > > > > > this metrics as a single
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт,
> 16/07/2019 в
> > > > >
> > > > > 13:47
> > > > > > > +0300, Anton Vinogradov
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like
> all we
> > > > >
> > > > > need
> > > > > > > now is a 1 simple metric:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true
> or
> > > >
> > > > false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start
> from
> > > >
> > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All other
> metrics
> > > >
> > > > can
> > > > > > be
> > > > > > > extracted from logs now
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul
> 16,
> > > >
> > > > 2019
> > > > > at
> > > > > > > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> please,
> > > >
> > > > go
> > > > > > > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля
> > >
> > > 2019
> > > > > г.,
> > > > > > > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello,
> > > >
> > > > Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I
> suggest to
> > > >
> > > > add
> > > > > > > some useful metrics about the
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME).
> For
> > >
> > > now,
> > > > > the
> > > > > > > duration of PME stages
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and
> cannot be
> > > > > > >
> > > > > > > obtained using JMX or other
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made
> the
> > >
> > > list
> > > > > of
> > > > > > > local node metrics that
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual
> status
> > > >
> > > > of
> > > > > > > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.
> > > > >
> > > > > initialVersion.
> > > > > > > Topology version that
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2.
> initTime.
> > > >
> > > > Time
> > > > > > > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3.
> initEvent.
> > > > >
> > > > > Event
> > > > > > > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > > > > > >
> > > > > > > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates
> and
> > > > > > >
> > > > > > > translations on a previous
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > > > > > >
> > > > > > > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > > > > > >
> > > > > > > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7.
> > >
> > > finishTime.
> > > > > Time
> > > > > > > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new
> PME
> > > > > >
> > > > > > started
> > > > > > > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These
> metrics
> > > > >
> > > > > help
> > > > > > > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how
> long
> > >
> > > PME
> > > > > was
> > > > > > > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how
> long
> > > > >
> > > > > awaited
> > > > > > > for all updates was
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what
> node
> > > > >
> > > > > blocks
> > > > > > > PME (didn't send a single
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what
> > > >
> > > > triggered
> > > > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best
> wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev
> > >
> > > Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > Amelchev Nikita
> > > > > > > >
> > > > > > > >
> > > > > > > >
>

Re: Partition map exchange metrics

Posted by Nikolay Izhikov <ni...@apache.org>.

Pavel.

> I just want to add that currentPmeTime is also useful alerting systems, not
> only for eye observing

Fully agree.

Let me make it as clear as I can.
In the end we should have 4 metrics:

`CurrentPMEDuration` - existing metric, shows current PME duration.
`CurrentPMECacheOperationsBlockedDuration` - new long metric. show blocking duration of PME.

`PMEDuration` - histogram of full PME durations.
`PMECacheOperationsBlockedDuration` - histogram of blocking PME durations.

В Чт, 25/07/2019 в 22:40 +0300, Pavel Kovalenko пишет:
> Nikolay,
> 
> Okay, sounds reasonable.
> I just want to add that currentPmeTime is also useful alerting systems, not
> only for eye observing. If the time become too long and exceeds some
> threshold appropriate alert firing can help to early determine a critical
> problem.
> 
> On Thu, 25 Jul 2019 at 21.12, Nikolay Izhikov <ni...@apache.org> wrote:
> 
> > I think exact time should be obtained from logs, isnt it?
> > 
> > 
> > чт, 25 июля 2019 г., 20:00 Pavel Kovalenko <jo...@gmail.com>:
> > 
> > > Nikolay,
> > > 
> > > Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
> > > question was mostly about what exactly we will track in Histogram.
> > > If we use histogram do you know how we can find exact time e.g. when PME
> > > with time > 1s happened?
> > > 
> > > чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <ni...@apache.org>:
> > > 
> > > > Pavel
> > > > 
> > > > Do you have a chance to see HistogramMetric source?
> > > > It in master now.
> > > > Look in source would be better then my explanation)
> > > > 
> > > > We should count PME processes that blocks operations for some amount of
> > > > time. For example [less then 50, less then 250, less then 1000, more
> > 
> > then
> > > > 1000] millis.
> > > > 
> > > > чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jo...@gmail.com>:
> > > > 
> > > > > Nikolay,
> > > > > 
> > > > > Could you please explain deeper what structure will be of PME
> > > 
> > > histogram?
> > > > > 
> > > > > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <ni...@apache.org>:
> > > > > 
> > > > > > Hello, Nikita.
> > > > > > 
> > > > > > I think
> > > > > > 
> > > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > > 
> > > > accumulate
> > > > > > > all blocking durations that happen after node starts.
> > > > > > 
> > > > > > No, we don't need it.
> > > > > > 
> > > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> > 
> > class.
> > > > > > 
> > > > > > Yes, we need it.
> > > > > > 
> > > > > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > > > > > Igniters,
> > > > > > > 
> > > > > > > All want to see the сacheOperationsBlockedDuration metric that
> > 
> > will
> > > > > > > show current blocking duration or 0 if there is no blocking right
> > > > 
> > > > now.
> > > > > > > 
> > > > > > > Do we need the following metrics? It seems one of them will be
> > > > > > 
> > > > > > superfluous.
> > > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > > 
> > > > accumulate
> > > > > > > all blocking durations that happen after node starts.
> > > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> > 
> > class.
> > > > > > > User will be able to configure bounds.
> > > > > > > 
> > > > > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <
> > 
> > nizhikov@apache.org
> > > > :
> > > > > > > > 
> > > > > > > > Guys.
> > > > > > > > 
> > > > > > > > I think we should go with the 2 metrics
> > > > > > > > 
> > > > > > > >         * current PME duration (resets on finish)
> > > > > > > > 
> > > > > > > >                 This metric required for alerting(or automatic
> > > > > > 
> > > > > > actions) on long PME.
> > > > > > > > 
> > > > > > > >         * PME duration histogram (value added to metrics on PME
> > > > > 
> > > > > finish)
> > > > > > > >                 This metric required for an:
> > > > > > > >                         * Quick PME trend analysis
> > > > > > > >                         * Quick PME history analysis
> > > > > > > > 
> > > > > > > > 
> > > > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > > > > > Nikita and Maxim,
> > > > > > > > > 
> > > > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > > > 
> > > > > > behaviour
> > > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > > Remain it as a long value and rename it to
> > > > > > 
> > > > > > getCacheOperationsBlockedDuration.
> > > > > > > > > > 
> > > > > > > > > > No other changes will require.
> > > > > > > > > > 
> > > > > > > > > > WDYT?
> > > > > > > > > 
> > > > > > > > > I agree with these two metrics. I also think that current
> > > > > > > > > getCurrentPmeDuration will become redundant.
> > > > > > > > > 
> > > > > > > > > Anton,
> > > > > > > > > 
> > > > > > > > > > It looks like we're trying to implement "extended debug"
> > > > 
> > > > instead
> > > > > of
> > > > > > > > > > "monitoring".
> > > > > > > > > > It should not be interesting for real admin what phase of
> > 
> > PME
> > > > is
> > > > > in
> > > > > > > > > > progress and so on.
> > > > > > > > > 
> > > > > > > > > PME is mission critical cluster process. I agree that
> > 
> > there's a
> > > > > fine
> > > > > > > > > line between monitoring and debug here. However, it's not
> > 
> > good
> > > to
> > > > > add
> > > > > > > > > monitoring capabilities only for scenario when everything is
> > > > > 
> > > > > alright.
> > > > > > > > > If PME will really hang, *real admin* will be extremely
> > > > 
> > > > interested
> > > > > > how
> > > > > > > > > to return cluster back to working state. Metrics about stages
> > > > > > 
> > > > > > completion
> > > > > > > > > time may really help here: e.g. if one specific node hasn't
> > > > > 
> > > > > completed
> > > > > > > > > stage X while rest of the cluster has, it can be a signal
> > 
> > that
> > > > this
> > > > > > node
> > > > > > > > > should be killed.
> > > > > > > > > 
> > > > > > > > > Of course, it's possible to build monitoring system that
> > > 
> > > extract
> > > > > this
> > > > > > > > > information from logs, but:
> > > > > > > > > - It's more resource intensive as it requires parsing logs
> > 
> > for
> > > > all
> > > > > > the time
> > > > > > > > > - It's less reliable as log messages may change
> > > > > > > > > 
> > > > > > > > > Best Regards,
> > > > > > > > > Ivan Rakov
> > > > > > > > > 
> > > > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > > > > > Folks,
> > > > > > > > > > 
> > > > > > > > > > +1 with Anton post.
> > > > > > > > > > 
> > > > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > > > 
> > > > > > behaviour
> > > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > > Remain it as a long value and rename it to
> > > > > > 
> > > > > > getCacheOperationsBlockedDuration.
> > > > > > > > > > 
> > > > > > > > > > No other changes will require.
> > > > > > > > > > 
> > > > > > > > > > WDYT?
> > > > > > > > > > 
> > > > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > > > > > 
> > > > > > nsamelchev@gmail.com> wrote:
> > > > > > > > > > > Nikolay,
> > > > > > > > > > > 
> > > > > > > > > > > The сacheOperationsBlockedDuration metric will show
> > 
> > current
> > > > > > blocking
> > > > > > > > > > > duration or 0 if there is no blocking right now.
> > > > > > > > > > > 
> > > > > > > > > > > The totalCacheOperationsBlockedDuration metric will
> > > > 
> > > > accumulate
> > > > > > all
> > > > > > > > > > > blocking durations that happen after node starts.
> > > > > > > > > > > 
> > > > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > > > > > 
> > > > > > nizhikov@apache.org>:
> > > > > > > > > > > > Nikita
> > > > > > > > > > > > 
> > > > > > > > > > > > What is the difference between those two metrics?
> > > > > > > > > > > > 
> > > > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > > > > > 
> > > > > > nsamelchev@gmail.com>:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Igniters, thanks for comments.
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  From the discussion it can be seen that we need only
> > > 
> > > two
> > > > > > metrics for now:
> > > > > > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > > > > > 
> > > > > > <arzamas123@mail.ru.invalid
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton
> > > 
> > > Vinogradov
> > > > <
> > > > > > av@apache.org>:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > It looks like we're trying to implement "extended
> > > > > 
> > > > > debug"
> > > > > > instead of
> > > > > > > > > > > > > > > "monitoring".
> > > > > > > > > > > > > > > It should not be interesting for real admin what
> > > > 
> > > > phase
> > > > > > of PME is in
> > > > > > > > > > > > > > > progress and so on.
> > > > > > > > > > > > > > > Interested metrics are
> > > > > > > > > > > > > > > - total blocked time (will be used for real SLA
> > > > > 
> > > > > counting)
> > > > > > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > > > > > 
> > > > > > degradation right now)
> > > > > > > > > > > > > > > Duration of the current blocking period can be
> > > 
> > > easily
> > > > > > presented using
> > > > > > > > > > > > > 
> > > > > > > > > > > > > any
> > > > > > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > > > > > Initial true will means "period start", precision
> > > > 
> > > > will
> > > > > > be a result of
> > > > > > > > > > > > > > > checks frequency.
> > > > > > > > > > > > > > > Anyway, I'm ok to have current metric presented
> > > 
> > > with
> > > > > > long, where long
> > > > > > > > > > > > > 
> > > > > > > > > > > > > is a
> > > > > > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > All other features you mentioned are useful for
> > > 
> > > code
> > > > or
> > > > > > > > > > > > > > > deployment improving and can (should) be taken
> > 
> > from
> > > > > logs
> > > > > > at the analysis
> > > > > > > > > > > > > > > phase.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > > > > > 
> > > > > > ivan.glukos@gmail.com >
> > > > > > > > > > > > > 
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > > 
> > > > initiates
> > > > > > the exchange.
> > > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > > > 
> > > > > > finished waiting for
> > > > > > > > > > > > > 
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > updates and translations on a previous
> > > 
> > > topology.
> > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> > > 
> > > sent a
> > > > > > single message.
> > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > > > 
> > > > > received
> > > > > > a full message.
> > > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > When new PME started all these metrics
> > 
> > resets.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Every metric from Nikita's list looks useful
> > 
> > and
> > > > > > simple to implement.
> > > > > > > > > > > > > > > > I think that it would be better to change
> > 
> > format
> > > of
> > > > > > metrics 4, 5, 6
> > > > > > > > > > > > > 
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > 7 a bit: we can keep only difference between
> > 
> > time
> > > > of
> > > > > > previous event
> > > > > > > > > > > > > 
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > time of corresponding event. Such metrics would
> > > 
> > > be
> > > > > > easier to perceive:
> > > > > > > > > > > > > > > > they answer to specific questions "how much
> > 
> > time
> > > > did
> > > > > > partition release
> > > > > > > > > > > > > > > > take?" or "how much time did awaiting of
> > > > 
> > > > distributed
> > > > > > phase end take?".
> > > > > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported
> > > 
> > > to
> > > > > > monitoring system,
> > > > > > > > > > > > > > > > graphs will show how different stages times
> > > 
> > > change
> > > > > > from one PME to
> > > > > > > > > > > > > 
> > > > > > > > > > > > > another.
> > > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME
> > > 
> > > and I
> > > > > > see no reason to
> > > > > > > > > > > > > 
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Agree with Anton here. These metrics should be
> > > > > > 
> > > > > > measured only for true
> > > > > > > > > > > > > > > > distributed exchange. Saving results for client
> > > > > > 
> > > > > > leave/join PMEs will
> > > > > > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I agree with total blocking duration metric
> > 
> > but
> > > > > > > > > > > > > > > > > I still don't understand why instant value
> > > > > > 
> > > > > > indicating that
> > > > > > > > > > > > > 
> > > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > > Duration time since blocking has started
> > 
> > looks
> > > > more
> > > > > > appropriate and
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > > It gives more information while semantic is
> > > 
> > > left
> > > > > the
> > > > > > same.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Totally agree with Pavel here. Both
> > 
> > "accumulated
> > > > > block
> > > > > > time" and
> > > > > > > > > > > > > > > > "current PME block time" metrics are useful.
> > > 
> > > Growth
> > > > > of
> > > > > > accumulated
> > > > > > > > > > > > > > > > metric for specific period of time (should be
> > > 
> > > easy
> > > > to
> > > > > > check via
> > > > > > > > > > > > > > > > monitoring system graph) will show for how much
> > > > > > 
> > > > > > business operations
> > > > > > > > > > > > > 
> > > > > > > > > > > > > were
> > > > > > > > > > > > > > > > blocked in total, and non-zero current metric
> > > 
> > > will
> > > > > > show that we are
> > > > > > > > > > > > > > > > experiencing issues right now. Boolean metric
> > > 
> > > "are
> > > > we
> > > > > > blocked right
> > > > > > > > > > > > > 
> > > > > > > > > > > > > now"
> > > > > > > > > > > > > > > > is not needed as it's obviously can be inferred
> > > > 
> > > > from
> > > > > > "current PME
> > > > > > > > > > > > > 
> > > > > > > > > > > > > block
> > > > > > > > > > > > > > > > time".
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I agree with total blocking duration metric
> > 
> > but
> > > > > > > > > > > > > > > > > I still don't understand why instant value
> > > > > > 
> > > > > > indicating that
> > > > > > > > > > > > > 
> > > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > > Duration time since blocking has started
> > 
> > looks
> > > > more
> > > > > > appropriate and
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > > It gives more information while semantic is
> > > 
> > > left
> > > > > the
> > > > > > same.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev
> > 
> > <
> > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > All previous suggestions have some
> > > > 
> > > > disadvantages.
> > > > > > It can be several
> > > > > > > > > > > > > > > > > > exchanges between two metric updates and
> > 
> > fast
> > > > > > exchange can rewrite
> > > > > > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > We can introduce a metric of total blocking
> > > > > > 
> > > > > > duration that will
> > > > > > > > > > > > > > > > > > accumulate at the end of the exchange. So,
> > > > 
> > > > users
> > > > > > will get actual
> > > > > > > > > > > > > > > > > > information about how long operations were
> > > > > > 
> > > > > > blocked. Cluster metric
> > > > > > > > > > > > > > > > > > will be a maximum of local nodes metrics.
> > 
> > And
> > > > we
> > > > > > need a boolean
> > > > > > > > > > > > > 
> > > > > > > > > > > > > metric
> > > > > > > > > > > > > > > > > > that will indicate realtime status. It
> > 
> > needs
> > > > > > because of duration
> > > > > > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > So I propose to change the current metric
> > > 
> > > that
> > > > > not
> > > > > > released to the
> > > > > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric
> > > 
> > > and
> > > > > to
> > > > > > add the
> > > > > > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton
> > > 
> > > Vinogradov <
> > > > > > av@apache.org >:
> > > > > > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Still see no reason to replace boolean
> > 
> > with
> > > > > long.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> > > > > 
> > > > > Izhikov <
> > > > > > > > > > > > > 
> > > > > > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI
> > 
> > settings,
> > > > not
> > > > > > in the moment it
> > > > > > > > > > > > > 
> > > > > > > > > > > > > changed.
> > > > > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export
> > > > 
> > > > start
> > > > > > time, we should
> > > > > > > > > > > > > 
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton
> > > 
> > > Vinogradov
> > > > <
> > > > > > av@apache.org >:
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > What's the reason for duration
> > > 
> > > counting?
> > > > > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system
> > 
> > feature
> > > > to
> > > > > > count the durations.
> > > > > > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > > > > > 
> > > > > > periodically it will know
> > > > > > > > > > > > > 
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > > > > > 
> > > > > > Kovalenko <
> > > > > > > > > > > > > 
> > > > > > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp.
> > > 
> > > For
> > > > > > the metric name, I
> > > > > > > > > > > > > 
> > > > > > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration",
> > 
> > I
> > > > > think
> > > > > > it cleaner
> > > > > > > > > > > > > 
> > > > > > > > > > > > > represents
> > > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > > > > > duration to have better correlation
> > > > 
> > > > when
> > > > > > cache operations were
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > > > > > For instant view (like in JMX
> > 
> > bean) a
> > > > > > calculated value as you
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > > > > > For metrics are exported to some
> > > > 
> > > > backend
> > > > > > (IEP-35) a counter
> > > > > > > > > > > > > 
> > > > > > > > > > > > > can be
> > > > > > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > > > > > The counter is incremented by
> > > 
> > > blocking
> > > > > > time after blocking has
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > > > > > 
> > > > > > Amelchev <
> > > > > > > > > > > > > 
> > > > > > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > The main purpose of this metric
> > 
> > is
> > > > > > > > > > > > > > > > > > > > > > > > > how much time we wait for
> > > > 
> > > > resuming
> > > > > > cache operations
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you
> > > > 
> > > > mean
> > > > > > timestamp or duration
> > > > > > > > > > > > > 
> > > > > > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > > > > > What do you think if we
> > 
> > change
> > > > the
> > > > > > boolean value of metric
> > > > > > > > > > > > > 
> > > > > > > > > > > > > to a
> > > > > > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > > > > > value that represents time in
> > > > > > 
> > > > > > milliseconds when operations
> > > > > > > > > > > > > 
> > > > > > > > > > > > > were
> > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > This time can be calculated as
> > > > > > 
> > > > > > (currentTime -
> > > > > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in
> > 
> > case
> > > > of
> > > > > > timestamp.
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > Duration will be more
> > > 
> > > understandable.
> > > > > > It'll be something like
> > > > > > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration.
> > 
> > But
> > > I
> > > > > > haven't come up with a
> > > > > > > > > > > > > 
> > > > > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30,
> > 
> > Pavel
> > > > > > Kovalenko <
> > > > > > > > > > > > > 
> > > > > > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration
> > > > 
> > > > doesn't
> > > > > > show useful
> > > > > > > > > > > > > 
> > > > > > > > > > > > > information.
> > > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > > > > > 
> > > > > > blocking cache operations.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Not
> > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > > > > > What information gives to an
> > > > 
> > > > end-user
> > > > > > timestamp of
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For
> > > > 
> > > > what
> > > > > > analysis it can be
> > > > > > > > > > > > > 
> > > > > > > > > > > > > used and
> > > > > > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48,
> > > 
> > > Nikita
> > > > > > Amelchev <
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > This time already can be
> > > 
> > > obtained
> > > > > > from the
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> > > > > 
> > > > > metrics.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I
> > > 
> > > can
> > > > > > rework recently added
> > > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric
> > > 
> > > (not
> > > > > > released yet). Seems for
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > > > > > useless in case of
> > 
> > non-blocking
> > > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > Lets name it
> > > > > > 
> > > > > > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > blocking started (minimal
> > 
> > value
> > > > of
> > > > > > cluster nodes) and 0 if
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > ends (there is no running
> > 
> > PME).
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56,
> > > > 
> > > > Pavel
> > > > > > Kovalenko <
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on
> > > 
> > > this.
> > > > > > What do you think if we
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > > > > > value of metric to a long
> > > 
> > > value
> > > > > > that represents time in
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX
> > > 
> > > and
> > > > > now
> > > > > > metrics are periodically
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > > some backend it can give a
> > > 
> > > more
> > > > > > clear picture of how much
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > > > > > resuming cache operations
> > > > 
> > > > instead
> > > > > > of instant boolean
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в
> > 
> > 14:41,
> > > > > > Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > > > > > 
> > > > > > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > > > > > 
> > > > > > correctly. PME can be without
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > > > operations. For example,
> > > > 
> > > > client
> > > > > > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric
> > 
> > -
> > > > > > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > > > > > metrics will show
> > 
> > influence
> > > > of
> > > > > > the PME on cluster and user
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for
> > 
> > this
> > > > > (Bot
> > > > > > visa is green). [1] Can
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > 
> > > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в
> > > 
> > > 14:58,
> > > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> > > > > 
> > > > > Ignite
> > > > > > cluster should be able to
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite process, including
> > > 
> > > non
> > > > > > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> > 
> > 14:57
> > > > > > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > > > > > 
> > > > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows
> > 
> > exactly
> > > > PME
> > > > > > time and not so useful
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show
> > > > 
> > > > exactly
> > > > > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no
> > > > 
> > > > blocking,
> > > > > > it's a good PME and I see
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to
> > > 
> > > it
> > > > :)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019
> > 
> > at
> > > > > 2:50
> > > > > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> > > > > 
> > > > > postpone
> > > > > > implementation of this
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now,
> > > 
> > > implementation
> > > > > of
> > > > > > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can
> > > > 
> > > > implement
> > > > > > this metrics as a single
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> > > > 
> > > > 13:47
> > > > > > +0300, Anton Vinogradov
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we
> > > > 
> > > > need
> > > > > > now is a 1 simple metric:
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or
> > > 
> > > false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from
> > > 
> > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics
> > > 
> > > can
> > > > > be
> > > > > > extracted from logs now
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16,
> > > 
> > > 2019
> > > > at
> > > > > > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please,
> > > 
> > > go
> > > > > > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля
> > 
> > 2019
> > > > г.,
> > > > > > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello,
> > > 
> > > Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to
> > > 
> > > add
> > > > > > some useful metrics about the
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For
> > 
> > now,
> > > > the
> > > > > > duration of PME stages
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > > > > > 
> > > > > > obtained using JMX or other
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the
> > 
> > list
> > > > of
> > > > > > local node metrics that
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status
> > > 
> > > of
> > > > > > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.
> > > > 
> > > > initialVersion.
> > > > > > Topology version that
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime.
> > > 
> > > Time
> > > > > > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent.
> > > > 
> > > > Event
> > > > > > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > > > > > 
> > > > > > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > > > > > 
> > > > > > translations on a previous
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > > > > > 
> > > > > > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > > > > > 
> > > > > > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7.
> > 
> > finishTime.
> > > > Time
> > > > > > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> > > > > 
> > > > > started
> > > > > > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics
> > > > 
> > > > help
> > > > > > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> > 
> > PME
> > > > was
> > > > > > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> > > > 
> > > > awaited
> > > > > > for all updates was
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node
> > > > 
> > > > blocks
> > > > > > PME (didn't send a single
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what
> > > 
> > > triggered
> > > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > 
> > > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev
> > 
> > Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > Best wishes,
> > > > > > > > > > > Amelchev Nikita
> > > > > > > 
> > > > > > > 
> > > > > > >

Re: Partition map exchange metrics

Posted by Pavel Kovalenko <jo...@gmail.com>.

Nikolay,

Okay, sounds reasonable.
I just want to add that currentPmeTime is also useful alerting systems, not
only for eye observing. If the time become too long and exceeds some
threshold appropriate alert firing can help to early determine a critical
problem.

On Thu, 25 Jul 2019 at 21.12, Nikolay Izhikov <ni...@apache.org> wrote:

> I think exact time should be obtained from logs, isnt it?
>
>
> чт, 25 июля 2019 г., 20:00 Pavel Kovalenko <jo...@gmail.com>:
>
> > Nikolay,
> >
> > Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
> > question was mostly about what exactly we will track in Histogram.
> > If we use histogram do you know how we can find exact time e.g. when PME
> > with time > 1s happened?
> >
> > чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <ni...@apache.org>:
> >
> > > Pavel
> > >
> > > Do you have a chance to see HistogramMetric source?
> > > It in master now.
> > > Look in source would be better then my explanation)
> > >
> > > We should count PME processes that blocks operations for some amount of
> > > time. For example [less then 50, less then 250, less then 1000, more
> then
> > > 1000] millis.
> > >
> > > чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jo...@gmail.com>:
> > >
> > > > Nikolay,
> > > >
> > > > Could you please explain deeper what structure will be of PME
> > histogram?
> > > >
> > > > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <ni...@apache.org>:
> > > >
> > > > > Hello, Nikita.
> > > > >
> > > > > I think
> > > > >
> > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > accumulate
> > > > > > all blocking durations that happen after node starts.
> > > > >
> > > > > No, we don't need it.
> > > > >
> > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> class.
> > > > >
> > > > > Yes, we need it.
> > > > >
> > > > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > > > > Igniters,
> > > > > >
> > > > > > All want to see the сacheOperationsBlockedDuration metric that
> will
> > > > > > show current blocking duration or 0 if there is no blocking right
> > > now.
> > > > > >
> > > > > > Do we need the following metrics? It seems one of them will be
> > > > > superfluous.
> > > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > > accumulate
> > > > > > all blocking durations that happen after node starts.
> > > > > > 2. Blocking duration histogram. Based on the HistogramMetric
> class.
> > > > > > User will be able to configure bounds.
> > > > > >
> > > > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <
> nizhikov@apache.org
> > >:
> > > > > > >
> > > > > > > Guys.
> > > > > > >
> > > > > > > I think we should go with the 2 metrics
> > > > > > >
> > > > > > >         * current PME duration (resets on finish)
> > > > > > >
> > > > > > >                 This metric required for alerting(or automatic
> > > > > actions) on long PME.
> > > > > > >
> > > > > > >         * PME duration histogram (value added to metrics on PME
> > > > finish)
> > > > > > >                 This metric required for an:
> > > > > > >                         * Quick PME trend analysis
> > > > > > >                         * Quick PME history analysis
> > > > > > >
> > > > > > >
> > > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > > > > Nikita and Maxim,
> > > > > > > >
> > > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > > behaviour
> > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > Remain it as a long value and rename it to
> > > > > getCacheOperationsBlockedDuration.
> > > > > > > > >
> > > > > > > > > No other changes will require.
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > I agree with these two metrics. I also think that current
> > > > > > > > getCurrentPmeDuration will become redundant.
> > > > > > > >
> > > > > > > > Anton,
> > > > > > > >
> > > > > > > > > It looks like we're trying to implement "extended debug"
> > > instead
> > > > of
> > > > > > > > > "monitoring".
> > > > > > > > > It should not be interesting for real admin what phase of
> PME
> > > is
> > > > in
> > > > > > > > > progress and so on.
> > > > > > > >
> > > > > > > > PME is mission critical cluster process. I agree that
> there's a
> > > > fine
> > > > > > > > line between monitoring and debug here. However, it's not
> good
> > to
> > > > add
> > > > > > > > monitoring capabilities only for scenario when everything is
> > > > alright.
> > > > > > > > If PME will really hang, *real admin* will be extremely
> > > interested
> > > > > how
> > > > > > > > to return cluster back to working state. Metrics about stages
> > > > > completion
> > > > > > > > time may really help here: e.g. if one specific node hasn't
> > > > completed
> > > > > > > > stage X while rest of the cluster has, it can be a signal
> that
> > > this
> > > > > node
> > > > > > > > should be killed.
> > > > > > > >
> > > > > > > > Of course, it's possible to build monitoring system that
> > extract
> > > > this
> > > > > > > > information from logs, but:
> > > > > > > > - It's more resource intensive as it requires parsing logs
> for
> > > all
> > > > > the time
> > > > > > > > - It's less reliable as log messages may change
> > > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > Ivan Rakov
> > > > > > > >
> > > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > +1 with Anton post.
> > > > > > > > >
> > > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > > behaviour
> > > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > > Remain it as a long value and rename it to
> > > > > getCacheOperationsBlockedDuration.
> > > > > > > > >
> > > > > > > > > No other changes will require.
> > > > > > > > >
> > > > > > > > > WDYT?
> > > > > > > > >
> > > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > > > > nsamelchev@gmail.com> wrote:
> > > > > > > > > > Nikolay,
> > > > > > > > > >
> > > > > > > > > > The сacheOperationsBlockedDuration metric will show
> current
> > > > > blocking
> > > > > > > > > > duration or 0 if there is no blocking right now.
> > > > > > > > > >
> > > > > > > > > > The totalCacheOperationsBlockedDuration metric will
> > > accumulate
> > > > > all
> > > > > > > > > > blocking durations that happen after node starts.
> > > > > > > > > >
> > > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > > > > nizhikov@apache.org>:
> > > > > > > > > > > Nikita
> > > > > > > > > > >
> > > > > > > > > > > What is the difference between those two metrics?
> > > > > > > > > > >
> > > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > > > > nsamelchev@gmail.com>:
> > > > > > > > > > >
> > > > > > > > > > > > Igniters, thanks for comments.
> > > > > > > > > > > >
> > > > > > > > > > > >  From the discussion it can be seen that we need only
> > two
> > > > > metrics for now:
> > > > > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > > > > >
> > > > > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > > > > <arzamas123@mail.ru.invalid
> > > > > > > > > > > > > :
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton
> > Vinogradov
> > > <
> > > > > av@apache.org>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It looks like we're trying to implement "extended
> > > > debug"
> > > > > instead of
> > > > > > > > > > > > > > "monitoring".
> > > > > > > > > > > > > > It should not be interesting for real admin what
> > > phase
> > > > > of PME is in
> > > > > > > > > > > > > > progress and so on.
> > > > > > > > > > > > > > Interested metrics are
> > > > > > > > > > > > > > - total blocked time (will be used for real SLA
> > > > counting)
> > > > > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > > > > degradation right now)
> > > > > > > > > > > > > > Duration of the current blocking period can be
> > easily
> > > > > presented using
> > > > > > > > > > > >
> > > > > > > > > > > > any
> > > > > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > > > > Initial true will means "period start", precision
> > > will
> > > > > be a result of
> > > > > > > > > > > > > > checks frequency.
> > > > > > > > > > > > > > Anyway, I'm ok to have current metric presented
> > with
> > > > > long, where long
> > > > > > > > > > > >
> > > > > > > > > > > > is a
> > > > > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All other features you mentioned are useful for
> > code
> > > or
> > > > > > > > > > > > > > deployment improving and can (should) be taken
> from
> > > > logs
> > > > > at the analysis
> > > > > > > > > > > > > > phase.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > > > > ivan.glukos@gmail.com >
> > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > initiates
> > > > > the exchange.
> > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > > finished waiting for
> > > > > > > > > > > >
> > > > > > > > > > > > all
> > > > > > > > > > > > > > > > updates and translations on a previous
> > topology.
> > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> > sent a
> > > > > single message.
> > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > > received
> > > > > a full message.
> > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When new PME started all these metrics
> resets.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Every metric from Nikita's list looks useful
> and
> > > > > simple to implement.
> > > > > > > > > > > > > > > I think that it would be better to change
> format
> > of
> > > > > metrics 4, 5, 6
> > > > > > > > > > > >
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > 7 a bit: we can keep only difference between
> time
> > > of
> > > > > previous event
> > > > > > > > > > > >
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > time of corresponding event. Such metrics would
> > be
> > > > > easier to perceive:
> > > > > > > > > > > > > > > they answer to specific questions "how much
> time
> > > did
> > > > > partition release
> > > > > > > > > > > > > > > take?" or "how much time did awaiting of
> > > distributed
> > > > > phase end take?".
> > > > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported
> > to
> > > > > monitoring system,
> > > > > > > > > > > > > > > graphs will show how different stages times
> > change
> > > > > from one PME to
> > > > > > > > > > > >
> > > > > > > > > > > > another.
> > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME
> > and I
> > > > > see no reason to
> > > > > > > > > > > >
> > > > > > > > > > > > have
> > > > > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Agree with Anton here. These metrics should be
> > > > > measured only for true
> > > > > > > > > > > > > > > distributed exchange. Saving results for client
> > > > > leave/join PMEs will
> > > > > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I agree with total blocking duration metric
> but
> > > > > > > > > > > > > > > > I still don't understand why instant value
> > > > > indicating that
> > > > > > > > > > > >
> > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > Duration time since blocking has started
> looks
> > > more
> > > > > appropriate and
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > It gives more information while semantic is
> > left
> > > > the
> > > > > same.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Totally agree with Pavel here. Both
> "accumulated
> > > > block
> > > > > time" and
> > > > > > > > > > > > > > > "current PME block time" metrics are useful.
> > Growth
> > > > of
> > > > > accumulated
> > > > > > > > > > > > > > > metric for specific period of time (should be
> > easy
> > > to
> > > > > check via
> > > > > > > > > > > > > > > monitoring system graph) will show for how much
> > > > > business operations
> > > > > > > > > > > >
> > > > > > > > > > > > were
> > > > > > > > > > > > > > > blocked in total, and non-zero current metric
> > will
> > > > > show that we are
> > > > > > > > > > > > > > > experiencing issues right now. Boolean metric
> > "are
> > > we
> > > > > blocked right
> > > > > > > > > > > >
> > > > > > > > > > > > now"
> > > > > > > > > > > > > > > is not needed as it's obviously can be inferred
> > > from
> > > > > "current PME
> > > > > > > > > > > >
> > > > > > > > > > > > block
> > > > > > > > > > > > > > > time".
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I agree with total blocking duration metric
> but
> > > > > > > > > > > > > > > > I still don't understand why instant value
> > > > > indicating that
> > > > > > > > > > > >
> > > > > > > > > > > > operations are
> > > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > > Duration time since blocking has started
> looks
> > > more
> > > > > appropriate and
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > > It gives more information while semantic is
> > left
> > > > the
> > > > > same.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev
> <
> > > > > nsamelchev@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > All previous suggestions have some
> > > disadvantages.
> > > > > It can be several
> > > > > > > > > > > > > > > > > exchanges between two metric updates and
> fast
> > > > > exchange can rewrite
> > > > > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > We can introduce a metric of total blocking
> > > > > duration that will
> > > > > > > > > > > > > > > > > accumulate at the end of the exchange. So,
> > > users
> > > > > will get actual
> > > > > > > > > > > > > > > > > information about how long operations were
> > > > > blocked. Cluster metric
> > > > > > > > > > > > > > > > > will be a maximum of local nodes metrics.
> And
> > > we
> > > > > need a boolean
> > > > > > > > > > > >
> > > > > > > > > > > > metric
> > > > > > > > > > > > > > > > > that will indicate realtime status. It
> needs
> > > > > because of duration
> > > > > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > So I propose to change the current metric
> > that
> > > > not
> > > > > released to the
> > > > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric
> > and
> > > > to
> > > > > add the
> > > > > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton
> > Vinogradov <
> > > > > av@apache.org >:
> > > > > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Still see no reason to replace boolean
> with
> > > > long.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> > > > Izhikov <
> > > > > > > > > > > >
> > > > > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI
> settings,
> > > not
> > > > > in the moment it
> > > > > > > > > > > >
> > > > > > > > > > > > changed.
> > > > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export
> > > start
> > > > > time, we should
> > > > > > > > > > > >
> > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton
> > Vinogradov
> > > <
> > > > > av@apache.org >:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > What's the reason for duration
> > counting?
> > > > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system
> feature
> > > to
> > > > > count the durations.
> > > > > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > > > > periodically it will know
> > > > > > > > > > > >
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > > > > Kovalenko <
> > > > > > > > > > > >
> > > > > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp.
> > For
> > > > > the metric name, I
> > > > > > > > > > > >
> > > > > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration",
> I
> > > > think
> > > > > it cleaner
> > > > > > > > > > > >
> > > > > > > > > > > > represents
> > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > > > > duration to have better correlation
> > > when
> > > > > cache operations were
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > > > > For instant view (like in JMX
> bean) a
> > > > > calculated value as you
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > > > > For metrics are exported to some
> > > backend
> > > > > (IEP-35) a counter
> > > > > > > > > > > >
> > > > > > > > > > > > can be
> > > > > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > > > > The counter is incremented by
> > blocking
> > > > > time after blocking has
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > > > > Amelchev <
> > > > > > > > > > > >
> > > > > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > The main purpose of this metric
> is
> > > > > > > > > > > > > > > > > > > > > > > > how much time we wait for
> > > resuming
> > > > > cache operations
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you
> > > mean
> > > > > timestamp or duration
> > > > > > > > > > > >
> > > > > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > > > > What do you think if we
> change
> > > the
> > > > > boolean value of metric
> > > > > > > > > > > >
> > > > > > > > > > > > to a
> > > > > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > > > > value that represents time in
> > > > > milliseconds when operations
> > > > > > > > > > > >
> > > > > > > > > > > > were
> > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > This time can be calculated as
> > > > > (currentTime -
> > > > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in
> case
> > > of
> > > > > timestamp.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Duration will be more
> > understandable.
> > > > > It'll be something like
> > > > > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration.
> But
> > I
> > > > > haven't come up with a
> > > > > > > > > > > >
> > > > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30,
> Pavel
> > > > > Kovalenko <
> > > > > > > > > > > >
> > > > > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration
> > > doesn't
> > > > > show useful
> > > > > > > > > > > >
> > > > > > > > > > > > information.
> > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > > > > blocking cache operations.
> > > > > > > > > > > >
> > > > > > > > > > > > Not
> > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > > > > What information gives to an
> > > end-user
> > > > > timestamp of
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For
> > > what
> > > > > analysis it can be
> > > > > > > > > > > >
> > > > > > > > > > > > used and
> > > > > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48,
> > Nikita
> > > > > Amelchev <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > This time already can be
> > obtained
> > > > > from the
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> > > > metrics.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I
> > can
> > > > > rework recently added
> > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric
> > (not
> > > > > released yet). Seems for
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > > > > useless in case of
> non-blocking
> > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > Lets name it
> > > > > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > blocking started (minimal
> value
> > > of
> > > > > cluster nodes) and 0 if
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > ends (there is no running
> PME).
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56,
> > > Pavel
> > > > > Kovalenko <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on
> > this.
> > > > > What do you think if we
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > > > > value of metric to a long
> > value
> > > > > that represents time in
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX
> > and
> > > > now
> > > > > metrics are periodically
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > > some backend it can give a
> > more
> > > > > clear picture of how much
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > > > > resuming cache operations
> > > instead
> > > > > of instant boolean
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в
> 14:41,
> > > > > Nikita Amelchev <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > > > > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > > > > correctly. PME can be without
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > > operations. For example,
> > > client
> > > > > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric
> -
> > > > > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > > > > metrics will show
> influence
> > > of
> > > > > the PME on cluster and user
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for
> this
> > > > (Bot
> > > > > visa is green). [1] Can
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в
> > 14:58,
> > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> > > > Ignite
> > > > > cluster should be able to
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > Ignite process, including
> > non
> > > > > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> 14:57
> > > > > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows
> exactly
> > > PME
> > > > > time and not so useful
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show
> > > exactly
> > > > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no
> > > blocking,
> > > > > it's a good PME and I see
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to
> > it
> > > :)
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019
> at
> > > > 2:50
> > > > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> > > > postpone
> > > > > implementation of this
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now,
> > implementation
> > > > of
> > > > > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can
> > > implement
> > > > > this metrics as a single
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> > > 13:47
> > > > > +0300, Anton Vinogradov
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we
> > > need
> > > > > now is a 1 simple metric:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or
> > false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from
> > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics
> > can
> > > > be
> > > > > extracted from logs now
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16,
> > 2019
> > > at
> > > > > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please,
> > go
> > > > > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля
> 2019
> > > г.,
> > > > > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello,
> > Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to
> > add
> > > > > some useful metrics about the
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For
> now,
> > > the
> > > > > duration of PME stages
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > > > > obtained using JMX or other
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the
> list
> > > of
> > > > > local node metrics that
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status
> > of
> > > > > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.
> > > initialVersion.
> > > > > Topology version that
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime.
> > Time
> > > > > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent.
> > > Event
> > > > > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > > > > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > > > > translations on a previous
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > > > > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > > > > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7.
> finishTime.
> > > Time
> > > > > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> > > > started
> > > > > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics
> > > help
> > > > > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> PME
> > > was
> > > > > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> > > awaited
> > > > > for all updates was
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node
> > > blocks
> > > > > PME (didn't send a single
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what
> > triggered
> > > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev
> Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best wishes,
> > > > > > > > > > Amelchev Nikita
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Partition map exchange metrics

Posted by Nikolay Izhikov <ni...@apache.org>.

I think exact time should be obtained from logs, isnt it?


чт, 25 июля 2019 г., 20:00 Pavel Kovalenko <jo...@gmail.com>:

> Nikolay,
>
> Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
> question was mostly about what exactly we will track in Histogram.
> If we use histogram do you know how we can find exact time e.g. when PME
> with time > 1s happened?
>
> чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <ni...@apache.org>:
>
> > Pavel
> >
> > Do you have a chance to see HistogramMetric source?
> > It in master now.
> > Look in source would be better then my explanation)
> >
> > We should count PME processes that blocks operations for some amount of
> > time. For example [less then 50, less then 250, less then 1000, more then
> > 1000] millis.
> >
> > чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jo...@gmail.com>:
> >
> > > Nikolay,
> > >
> > > Could you please explain deeper what structure will be of PME
> histogram?
> > >
> > > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <ni...@apache.org>:
> > >
> > > > Hello, Nikita.
> > > >
> > > > I think
> > > >
> > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > accumulate
> > > > > all blocking durations that happen after node starts.
> > > >
> > > > No, we don't need it.
> > > >
> > > > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > > >
> > > > Yes, we need it.
> > > >
> > > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > > > Igniters,
> > > > >
> > > > > All want to see the сacheOperationsBlockedDuration metric that will
> > > > > show current blocking duration or 0 if there is no blocking right
> > now.
> > > > >
> > > > > Do we need the following metrics? It seems one of them will be
> > > > superfluous.
> > > > > 1. The totalCacheOperationsBlockedDuration metric that will
> > accumulate
> > > > > all blocking durations that happen after node starts.
> > > > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > > > > User will be able to configure bounds.
> > > > >
> > > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <nizhikov@apache.org
> >:
> > > > > >
> > > > > > Guys.
> > > > > >
> > > > > > I think we should go with the 2 metrics
> > > > > >
> > > > > >         * current PME duration (resets on finish)
> > > > > >
> > > > > >                 This metric required for alerting(or automatic
> > > > actions) on long PME.
> > > > > >
> > > > > >         * PME duration histogram (value added to metrics on PME
> > > finish)
> > > > > >                 This metric required for an:
> > > > > >                         * Quick PME trend analysis
> > > > > >                         * Quick PME history analysis
> > > > > >
> > > > > >
> > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > > > Nikita and Maxim,
> > > > > > >
> > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > behaviour
> > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > Remain it as a long value and rename it to
> > > > getCacheOperationsBlockedDuration.
> > > > > > > >
> > > > > > > > No other changes will require.
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > >
> > > > > > > I agree with these two metrics. I also think that current
> > > > > > > getCurrentPmeDuration will become redundant.
> > > > > > >
> > > > > > > Anton,
> > > > > > >
> > > > > > > > It looks like we're trying to implement "extended debug"
> > instead
> > > of
> > > > > > > > "monitoring".
> > > > > > > > It should not be interesting for real admin what phase of PME
> > is
> > > in
> > > > > > > > progress and so on.
> > > > > > >
> > > > > > > PME is mission critical cluster process. I agree that there's a
> > > fine
> > > > > > > line between monitoring and debug here. However, it's not good
> to
> > > add
> > > > > > > monitoring capabilities only for scenario when everything is
> > > alright.
> > > > > > > If PME will really hang, *real admin* will be extremely
> > interested
> > > > how
> > > > > > > to return cluster back to working state. Metrics about stages
> > > > completion
> > > > > > > time may really help here: e.g. if one specific node hasn't
> > > completed
> > > > > > > stage X while rest of the cluster has, it can be a signal that
> > this
> > > > node
> > > > > > > should be killed.
> > > > > > >
> > > > > > > Of course, it's possible to build monitoring system that
> extract
> > > this
> > > > > > > information from logs, but:
> > > > > > > - It's more resource intensive as it requires parsing logs for
> > all
> > > > the time
> > > > > > > - It's less reliable as log messages may change
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Ivan Rakov
> > > > > > >
> > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > +1 with Anton post.
> > > > > > > >
> > > > > > > > What if we just update current metric getCurrentPmeDuration
> > > > behaviour
> > > > > > > > to show durations only for blocking PMEs?
> > > > > > > > Remain it as a long value and rename it to
> > > > getCacheOperationsBlockedDuration.
> > > > > > > >
> > > > > > > > No other changes will require.
> > > > > > > >
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > > > nsamelchev@gmail.com> wrote:
> > > > > > > > > Nikolay,
> > > > > > > > >
> > > > > > > > > The сacheOperationsBlockedDuration metric will show current
> > > > blocking
> > > > > > > > > duration or 0 if there is no blocking right now.
> > > > > > > > >
> > > > > > > > > The totalCacheOperationsBlockedDuration metric will
> > accumulate
> > > > all
> > > > > > > > > blocking durations that happen after node starts.
> > > > > > > > >
> > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > > > nizhikov@apache.org>:
> > > > > > > > > > Nikita
> > > > > > > > > >
> > > > > > > > > > What is the difference between those two metrics?
> > > > > > > > > >
> > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > > > nsamelchev@gmail.com>:
> > > > > > > > > >
> > > > > > > > > > > Igniters, thanks for comments.
> > > > > > > > > > >
> > > > > > > > > > >  From the discussion it can be seen that we need only
> two
> > > > metrics for now:
> > > > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > > > >
> > > > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > > > >
> > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > > > <arzamas123@mail.ru.invalid
> > > > > > > > > > > > :
> > > > > > > > > > > >
> > > > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton
> Vinogradov
> > <
> > > > av@apache.org>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > It looks like we're trying to implement "extended
> > > debug"
> > > > instead of
> > > > > > > > > > > > > "monitoring".
> > > > > > > > > > > > > It should not be interesting for real admin what
> > phase
> > > > of PME is in
> > > > > > > > > > > > > progress and so on.
> > > > > > > > > > > > > Interested metrics are
> > > > > > > > > > > > > - total blocked time (will be used for real SLA
> > > counting)
> > > > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > > > degradation right now)
> > > > > > > > > > > > > Duration of the current blocking period can be
> easily
> > > > presented using
> > > > > > > > > > >
> > > > > > > > > > > any
> > > > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > > > Initial true will means "period start", precision
> > will
> > > > be a result of
> > > > > > > > > > > > > checks frequency.
> > > > > > > > > > > > > Anyway, I'm ok to have current metric presented
> with
> > > > long, where long
> > > > > > > > > > >
> > > > > > > > > > > is a
> > > > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > > > >
> > > > > > > > > > > > > All other features you mentioned are useful for
> code
> > or
> > > > > > > > > > > > > deployment improving and can (should) be taken from
> > > logs
> > > > at the analysis
> > > > > > > > > > > > > phase.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > > > ivan.glukos@gmail.com >
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > initiates
> > > > the exchange.
> > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > finished waiting for
> > > > > > > > > > >
> > > > > > > > > > > all
> > > > > > > > > > > > > > > updates and translations on a previous
> topology.
> > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> sent a
> > > > single message.
> > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > received
> > > > a full message.
> > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Every metric from Nikita's list looks useful and
> > > > simple to implement.
> > > > > > > > > > > > > > I think that it would be better to change format
> of
> > > > metrics 4, 5, 6
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > > > > > 7 a bit: we can keep only difference between time
> > of
> > > > previous event
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > > > > > time of corresponding event. Such metrics would
> be
> > > > easier to perceive:
> > > > > > > > > > > > > > they answer to specific questions "how much time
> > did
> > > > partition release
> > > > > > > > > > > > > > take?" or "how much time did awaiting of
> > distributed
> > > > phase end take?".
> > > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported
> to
> > > > monitoring system,
> > > > > > > > > > > > > > graphs will show how different stages times
> change
> > > > from one PME to
> > > > > > > > > > >
> > > > > > > > > > > another.
> > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME
> and I
> > > > see no reason to
> > > > > > > > > > >
> > > > > > > > > > > have
> > > > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Agree with Anton here. These metrics should be
> > > > measured only for true
> > > > > > > > > > > > > > distributed exchange. Saving results for client
> > > > leave/join PMEs will
> > > > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > > > I still don't understand why instant value
> > > > indicating that
> > > > > > > > > > >
> > > > > > > > > > > operations are
> > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > Duration time since blocking has started looks
> > more
> > > > appropriate and
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > It gives more information while semantic is
> left
> > > the
> > > > same.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated
> > > block
> > > > time" and
> > > > > > > > > > > > > > "current PME block time" metrics are useful.
> Growth
> > > of
> > > > accumulated
> > > > > > > > > > > > > > metric for specific period of time (should be
> easy
> > to
> > > > check via
> > > > > > > > > > > > > > monitoring system graph) will show for how much
> > > > business operations
> > > > > > > > > > >
> > > > > > > > > > > were
> > > > > > > > > > > > > > blocked in total, and non-zero current metric
> will
> > > > show that we are
> > > > > > > > > > > > > > experiencing issues right now. Boolean metric
> "are
> > we
> > > > blocked right
> > > > > > > > > > >
> > > > > > > > > > > now"
> > > > > > > > > > > > > > is not needed as it's obviously can be inferred
> > from
> > > > "current PME
> > > > > > > > > > >
> > > > > > > > > > > block
> > > > > > > > > > > > > > time".
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > > > I still don't understand why instant value
> > > > indicating that
> > > > > > > > > > >
> > > > > > > > > > > operations are
> > > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > > Duration time since blocking has started looks
> > more
> > > > appropriate and
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > > It gives more information while semantic is
> left
> > > the
> > > > same.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <
> > > > nsamelchev@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > :
> > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > All previous suggestions have some
> > disadvantages.
> > > > It can be several
> > > > > > > > > > > > > > > > exchanges between two metric updates and fast
> > > > exchange can rewrite
> > > > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We can introduce a metric of total blocking
> > > > duration that will
> > > > > > > > > > > > > > > > accumulate at the end of the exchange. So,
> > users
> > > > will get actual
> > > > > > > > > > > > > > > > information about how long operations were
> > > > blocked. Cluster metric
> > > > > > > > > > > > > > > > will be a maximum of local nodes metrics. And
> > we
> > > > need a boolean
> > > > > > > > > > >
> > > > > > > > > > > metric
> > > > > > > > > > > > > > > > that will indicate realtime status. It needs
> > > > because of duration
> > > > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > So I propose to change the current metric
> that
> > > not
> > > > released to the
> > > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric
> and
> > > to
> > > > add the
> > > > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton
> Vinogradov <
> > > > av@apache.org >:
> > > > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Still see no reason to replace boolean with
> > > long.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> > > Izhikov <
> > > > > > > > > > >
> > > > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. Value exported based on SPI settings,
> > not
> > > > in the moment it
> > > > > > > > > > >
> > > > > > > > > > > changed.
> > > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export
> > start
> > > > time, we should
> > > > > > > > > > >
> > > > > > > > > > > also
> > > > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton
> Vinogradov
> > <
> > > > av@apache.org >:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > What's the reason for duration
> counting?
> > > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature
> > to
> > > > count the durations.
> > > > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > > > periodically it will know
> > > > > > > > > > >
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > > > Kovalenko <
> > > > > > > > > > >
> > > > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp.
> For
> > > > the metric name, I
> > > > > > > > > > >
> > > > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I
> > > think
> > > > it cleaner
> > > > > > > > > > >
> > > > > > > > > > > represents
> > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > > > duration to have better correlation
> > when
> > > > cache operations were
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a
> > > > calculated value as you
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > > > For metrics are exported to some
> > backend
> > > > (IEP-35) a counter
> > > > > > > > > > >
> > > > > > > > > > > can be
> > > > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > > > The counter is incremented by
> blocking
> > > > time after blocking has
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > > > Amelchev <
> > > > > > > > > > >
> > > > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > > > > > > how much time we wait for
> > resuming
> > > > cache operations
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you
> > mean
> > > > timestamp or duration
> > > > > > > > > > >
> > > > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > > > What do you think if we change
> > the
> > > > boolean value of metric
> > > > > > > > > > >
> > > > > > > > > > > to a
> > > > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > > > value that represents time in
> > > > milliseconds when operations
> > > > > > > > > > >
> > > > > > > > > > > were
> > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > This time can be calculated as
> > > > (currentTime -
> > > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case
> > of
> > > > timestamp.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Duration will be more
> understandable.
> > > > It'll be something like
> > > > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But
> I
> > > > haven't come up with a
> > > > > > > > > > >
> > > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel
> > > > Kovalenko <
> > > > > > > > > > >
> > > > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration
> > doesn't
> > > > show useful
> > > > > > > > > > >
> > > > > > > > > > > information.
> > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > > > blocking cache operations.
> > > > > > > > > > >
> > > > > > > > > > > Not
> > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > > > What information gives to an
> > end-user
> > > > timestamp of
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For
> > what
> > > > analysis it can be
> > > > > > > > > > >
> > > > > > > > > > > used and
> > > > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48,
> Nikita
> > > > Amelchev <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > This time already can be
> obtained
> > > > from the
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> > > metrics.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I
> can
> > > > rework recently added
> > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric
> (not
> > > > released yet). Seems for
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > > > useless in case of non-blocking
> > > PME.
> > > > > > > > > > > > > > > > > > > > > > > Lets name it
> > > > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > blocking started (minimal value
> > of
> > > > cluster nodes) and 0 if
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56,
> > Pavel
> > > > Kovalenko <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on
> this.
> > > > What do you think if we
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > > > value of metric to a long
> value
> > > > that represents time in
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX
> and
> > > now
> > > > metrics are periodically
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > > some backend it can give a
> more
> > > > clear picture of how much
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > > > resuming cache operations
> > instead
> > > > of instant boolean
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41,
> > > > Nikita Amelchev <
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > > > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > > > correctly. PME can be without
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > > operations. For example,
> > client
> > > > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric -
> > > > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > > > metrics will show influence
> > of
> > > > the PME on cluster and user
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this
> > > (Bot
> > > > visa is green). [1] Can
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в
> 14:58,
> > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> > > Ignite
> > > > cluster should be able to
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > Ignite process, including
> non
> > > > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57
> > > > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly
> > PME
> > > > time and not so useful
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show
> > exactly
> > > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no
> > blocking,
> > > > it's a good PME and I see
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to
> it
> > :)
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> > > 2:50
> > > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> > > postpone
> > > > implementation of this
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > For now,
> implementation
> > > of
> > > > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can
> > implement
> > > > this metrics as a single
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> > 13:47
> > > > +0300, Anton Vinogradov
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we
> > need
> > > > now is a 1 simple metric:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or
> false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from
> this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics
> can
> > > be
> > > > extracted from logs now
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16,
> 2019
> > at
> > > > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please,
> go
> > > > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019
> > г.,
> > > > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello,
> Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to
> add
> > > > some useful metrics about the
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now,
> > the
> > > > duration of PME stages
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > > > obtained using JMX or other
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list
> > of
> > > > local node metrics that
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status
> of
> > > > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.
> > initialVersion.
> > > > Topology version that
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime.
> Time
> > > > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent.
> > Event
> > > > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > > > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > > > translations on a previous
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > > > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > > > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime.
> > Time
> > > > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> > > started
> > > > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics
> > help
> > > > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME
> > was
> > > > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> > awaited
> > > > for all updates was
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node
> > blocks
> > > > PME (didn't send a single
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what
> triggered
> > > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best wishes,
> > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best wishes,
> > > > > > > > > Amelchev Nikita
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Partition map exchange metrics

Posted by Pavel Kovalenko <jo...@gmail.com>.

Nikolay,

Yes, I have a chance to see HistogramMetric and moreover reviewed it) My
question was mostly about what exactly we will track in Histogram.
If we use histogram do you know how we can find exact time e.g. when PME
with time > 1s happened?

чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <ni...@apache.org>:

> Pavel
>
> Do you have a chance to see HistogramMetric source?
> It in master now.
> Look in source would be better then my explanation)
>
> We should count PME processes that blocks operations for some amount of
> time. For example [less then 50, less then 250, less then 1000, more then
> 1000] millis.
>
> чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jo...@gmail.com>:
>
> > Nikolay,
> >
> > Could you please explain deeper what structure will be of PME histogram?
> >
> > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <ni...@apache.org>:
> >
> > > Hello, Nikita.
> > >
> > > I think
> > >
> > > > 1. The totalCacheOperationsBlockedDuration metric that will
> accumulate
> > > > all blocking durations that happen after node starts.
> > >
> > > No, we don't need it.
> > >
> > > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > >
> > > Yes, we need it.
> > >
> > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > > Igniters,
> > > >
> > > > All want to see the сacheOperationsBlockedDuration metric that will
> > > > show current blocking duration or 0 if there is no blocking right
> now.
> > > >
> > > > Do we need the following metrics? It seems one of them will be
> > > superfluous.
> > > > 1. The totalCacheOperationsBlockedDuration metric that will
> accumulate
> > > > all blocking durations that happen after node starts.
> > > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > > > User will be able to configure bounds.
> > > >
> > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <ni...@apache.org>:
> > > > >
> > > > > Guys.
> > > > >
> > > > > I think we should go with the 2 metrics
> > > > >
> > > > >         * current PME duration (resets on finish)
> > > > >
> > > > >                 This metric required for alerting(or automatic
> > > actions) on long PME.
> > > > >
> > > > >         * PME duration histogram (value added to metrics on PME
> > finish)
> > > > >                 This metric required for an:
> > > > >                         * Quick PME trend analysis
> > > > >                         * Quick PME history analysis
> > > > >
> > > > >
> > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > > Nikita and Maxim,
> > > > > >
> > > > > > > What if we just update current metric getCurrentPmeDuration
> > > behaviour
> > > > > > > to show durations only for blocking PMEs?
> > > > > > > Remain it as a long value and rename it to
> > > getCacheOperationsBlockedDuration.
> > > > > > >
> > > > > > > No other changes will require.
> > > > > > >
> > > > > > > WDYT?
> > > > > >
> > > > > > I agree with these two metrics. I also think that current
> > > > > > getCurrentPmeDuration will become redundant.
> > > > > >
> > > > > > Anton,
> > > > > >
> > > > > > > It looks like we're trying to implement "extended debug"
> instead
> > of
> > > > > > > "monitoring".
> > > > > > > It should not be interesting for real admin what phase of PME
> is
> > in
> > > > > > > progress and so on.
> > > > > >
> > > > > > PME is mission critical cluster process. I agree that there's a
> > fine
> > > > > > line between monitoring and debug here. However, it's not good to
> > add
> > > > > > monitoring capabilities only for scenario when everything is
> > alright.
> > > > > > If PME will really hang, *real admin* will be extremely
> interested
> > > how
> > > > > > to return cluster back to working state. Metrics about stages
> > > completion
> > > > > > time may really help here: e.g. if one specific node hasn't
> > completed
> > > > > > stage X while rest of the cluster has, it can be a signal that
> this
> > > node
> > > > > > should be killed.
> > > > > >
> > > > > > Of course, it's possible to build monitoring system that extract
> > this
> > > > > > information from logs, but:
> > > > > > - It's more resource intensive as it requires parsing logs for
> all
> > > the time
> > > > > > - It's less reliable as log messages may change
> > > > > >
> > > > > > Best Regards,
> > > > > > Ivan Rakov
> > > > > >
> > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > > Folks,
> > > > > > >
> > > > > > > +1 with Anton post.
> > > > > > >
> > > > > > > What if we just update current metric getCurrentPmeDuration
> > > behaviour
> > > > > > > to show durations only for blocking PMEs?
> > > > > > > Remain it as a long value and rename it to
> > > getCacheOperationsBlockedDuration.
> > > > > > >
> > > > > > > No other changes will require.
> > > > > > >
> > > > > > > WDYT?
> > > > > > >
> > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > > nsamelchev@gmail.com> wrote:
> > > > > > > > Nikolay,
> > > > > > > >
> > > > > > > > The сacheOperationsBlockedDuration metric will show current
> > > blocking
> > > > > > > > duration or 0 if there is no blocking right now.
> > > > > > > >
> > > > > > > > The totalCacheOperationsBlockedDuration metric will
> accumulate
> > > all
> > > > > > > > blocking durations that happen after node starts.
> > > > > > > >
> > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > > nizhikov@apache.org>:
> > > > > > > > > Nikita
> > > > > > > > >
> > > > > > > > > What is the difference between those two metrics?
> > > > > > > > >
> > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > > nsamelchev@gmail.com>:
> > > > > > > > >
> > > > > > > > > > Igniters, thanks for comments.
> > > > > > > > > >
> > > > > > > > > >  From the discussion it can be seen that we need only two
> > > metrics for now:
> > > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > > >
> > > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > > >
> > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > > <arzamas123@mail.ru.invalid
> > > > > > > > > > > :
> > > > > > > > > > >
> > > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov
> <
> > > av@apache.org>:
> > > > > > > > > > > >
> > > > > > > > > > > > Folks,
> > > > > > > > > > > >
> > > > > > > > > > > > It looks like we're trying to implement "extended
> > debug"
> > > instead of
> > > > > > > > > > > > "monitoring".
> > > > > > > > > > > > It should not be interesting for real admin what
> phase
> > > of PME is in
> > > > > > > > > > > > progress and so on.
> > > > > > > > > > > > Interested metrics are
> > > > > > > > > > > > - total blocked time (will be used for real SLA
> > counting)
> > > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > > degradation right now)
> > > > > > > > > > > > Duration of the current blocking period can be easily
> > > presented using
> > > > > > > > > >
> > > > > > > > > > any
> > > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > > Initial true will means "period start", precision
> will
> > > be a result of
> > > > > > > > > > > > checks frequency.
> > > > > > > > > > > > Anyway, I'm ok to have current metric presented with
> > > long, where long
> > > > > > > > > >
> > > > > > > > > > is a
> > > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > > >
> > > > > > > > > > > > All other features you mentioned are useful for code
> or
> > > > > > > > > > > > deployment improving and can (should) be taken from
> > logs
> > > at the analysis
> > > > > > > > > > > > phase.
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > > ivan.glukos@gmail.com >
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. initialVersion. Topology version that
> initiates
> > > the exchange.
> > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > finished waiting for
> > > > > > > > > >
> > > > > > > > > > all
> > > > > > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> > > single message.
> > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > received
> > > a full message.
> > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Every metric from Nikita's list looks useful and
> > > simple to implement.
> > > > > > > > > > > > > I think that it would be better to change format of
> > > metrics 4, 5, 6
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > > > > > 7 a bit: we can keep only difference between time
> of
> > > previous event
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > > > > > time of corresponding event. Such metrics would be
> > > easier to perceive:
> > > > > > > > > > > > > they answer to specific questions "how much time
> did
> > > partition release
> > > > > > > > > > > > > take?" or "how much time did awaiting of
> distributed
> > > phase end take?".
> > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to
> > > monitoring system,
> > > > > > > > > > > > > graphs will show how different stages times change
> > > from one PME to
> > > > > > > > > >
> > > > > > > > > > another.
> > > > > > > > > > > > > > When PME cause no blocking, it's a good PME and I
> > > see no reason to
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > > >
> > > > > > > > > > > > > Agree with Anton here. These metrics should be
> > > measured only for true
> > > > > > > > > > > > > distributed exchange. Saving results for client
> > > leave/join PMEs will
> > > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > > I still don't understand why instant value
> > > indicating that
> > > > > > > > > >
> > > > > > > > > > operations are
> > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > Duration time since blocking has started looks
> more
> > > appropriate and
> > > > > > > > > > > > >
> > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > It gives more information while semantic is left
> > the
> > > same.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated
> > block
> > > time" and
> > > > > > > > > > > > > "current PME block time" metrics are useful. Growth
> > of
> > > accumulated
> > > > > > > > > > > > > metric for specific period of time (should be easy
> to
> > > check via
> > > > > > > > > > > > > monitoring system graph) will show for how much
> > > business operations
> > > > > > > > > >
> > > > > > > > > > were
> > > > > > > > > > > > > blocked in total, and non-zero current metric will
> > > show that we are
> > > > > > > > > > > > > experiencing issues right now. Boolean metric "are
> we
> > > blocked right
> > > > > > > > > >
> > > > > > > > > > now"
> > > > > > > > > > > > > is not needed as it's obviously can be inferred
> from
> > > "current PME
> > > > > > > > > >
> > > > > > > > > > block
> > > > > > > > > > > > > time".
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > > I still don't understand why instant value
> > > indicating that
> > > > > > > > > >
> > > > > > > > > > operations are
> > > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > > Duration time since blocking has started looks
> more
> > > appropriate and
> > > > > > > > > > > > >
> > > > > > > > > > > > > useful.
> > > > > > > > > > > > > > It gives more information while semantic is left
> > the
> > > same.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <
> > > nsamelchev@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > :
> > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All previous suggestions have some
> disadvantages.
> > > It can be several
> > > > > > > > > > > > > > > exchanges between two metric updates and fast
> > > exchange can rewrite
> > > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We can introduce a metric of total blocking
> > > duration that will
> > > > > > > > > > > > > > > accumulate at the end of the exchange. So,
> users
> > > will get actual
> > > > > > > > > > > > > > > information about how long operations were
> > > blocked. Cluster metric
> > > > > > > > > > > > > > > will be a maximum of local nodes metrics. And
> we
> > > need a boolean
> > > > > > > > > >
> > > > > > > > > > metric
> > > > > > > > > > > > > > > that will indicate realtime status. It needs
> > > because of duration
> > > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So I propose to change the current metric that
> > not
> > > released to the
> > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric and
> > to
> > > add the
> > > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <
> > > av@apache.org >:
> > > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Still see no reason to replace boolean with
> > long.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> > Izhikov <
> > > > > > > > > >
> > > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. Value exported based on SPI settings,
> not
> > > in the moment it
> > > > > > > > > >
> > > > > > > > > > changed.
> > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export
> start
> > > time, we should
> > > > > > > > > >
> > > > > > > > > > also
> > > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov
> <
> > > av@apache.org >:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature
> to
> > > count the durations.
> > > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > > periodically it will know
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > > Kovalenko <
> > > > > > > > > >
> > > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For
> > > the metric name, I
> > > > > > > > > >
> > > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I
> > think
> > > it cleaner
> > > > > > > > > >
> > > > > > > > > > represents
> > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > > duration to have better correlation
> when
> > > cache operations were
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a
> > > calculated value as you
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > > For metrics are exported to some
> backend
> > > (IEP-35) a counter
> > > > > > > > > >
> > > > > > > > > > can be
> > > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > > The counter is incremented by blocking
> > > time after blocking has
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > > Amelchev <
> > > > > > > > > >
> > > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > > > > > how much time we wait for
> resuming
> > > cache operations
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you
> mean
> > > timestamp or duration
> > > > > > > > > >
> > > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > > What do you think if we change
> the
> > > boolean value of metric
> > > > > > > > > >
> > > > > > > > > > to a
> > > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > > value that represents time in
> > > milliseconds when operations
> > > > > > > > > >
> > > > > > > > > > were
> > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > This time can be calculated as
> > > (currentTime -
> > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case
> of
> > > timestamp.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Duration will be more understandable.
> > > It'll be something like
> > > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I
> > > haven't come up with a
> > > > > > > > > >
> > > > > > > > > > better
> > > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel
> > > Kovalenko <
> > > > > > > > > >
> > > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration
> doesn't
> > > show useful
> > > > > > > > > >
> > > > > > > > > > information.
> > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > > blocking cache operations.
> > > > > > > > > >
> > > > > > > > > > Not
> > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > > What information gives to an
> end-user
> > > timestamp of
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For
> what
> > > analysis it can be
> > > > > > > > > >
> > > > > > > > > > used and
> > > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita
> > > Amelchev <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > This time already can be obtained
> > > from the
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> > metrics.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I can
> > > rework recently added
> > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not
> > > released yet). Seems for
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > > useless in case of non-blocking
> > PME.
> > > > > > > > > > > > > > > > > > > > > > Lets name it
> > > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > blocking started (minimal value
> of
> > > cluster nodes) and 0 if
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56,
> Pavel
> > > Kovalenko <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thank you for working on this.
> > > What do you think if we
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > > value of metric to a long value
> > > that represents time in
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX and
> > now
> > > metrics are periodically
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > > some backend it can give a more
> > > clear picture of how much
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > > resuming cache operations
> instead
> > > of instant boolean
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41,
> > > Nikita Amelchev <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > > correctly. PME can be without
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > > operations. For example,
> client
> > > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric -
> > > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > > metrics will show influence
> of
> > > the PME on cluster and user
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this
> > (Bot
> > > visa is green). [1] Can
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58,
> > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> > Ignite
> > > cluster should be able to
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > Ignite process, including non
> > > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57
> > > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly
> PME
> > > time and not so useful
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show
> exactly
> > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no
> blocking,
> > > it's a good PME and I see
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to it
> :)
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> > 2:50
> > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> > postpone
> > > implementation of this
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > > For now, implementation
> > of
> > > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can
> implement
> > > this metrics as a single
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в
> 13:47
> > > +0300, Anton Vinogradov
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we
> need
> > > now is a 1 simple metric:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics can
> > be
> > > extracted from logs now
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019
> at
> > > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go
> > > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019
> г.,
> > > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add
> > > some useful metrics about the
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now,
> the
> > > duration of PME stages
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > > obtained using JMX or other
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list
> of
> > > local node metrics that
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status of
> > > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.
> initialVersion.
> > > Topology version that
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time
> > > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent.
> Event
> > > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > > translations on a previous
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime.
> Time
> > > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> > started
> > > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics
> help
> > > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME
> was
> > > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long
> awaited
> > > for all updates was
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node
> blocks
> > > PME (didn't send a single
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered
> > PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best wishes,
> > > > > > > > > > Amelchev Nikita
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best wishes,
> > > > > > > > Amelchev Nikita
> > > >
> > > >
> > > >
> > >
> >
>

Re: Partition map exchange metrics

Posted by Nikolay Izhikov <ni...@apache.org>.

Pavel

Do you have a chance to see HistogramMetric source?
It in master now.
Look in source would be better then my explanation)

We should count PME processes that blocks operations for some amount of
time. For example [less then 50, less then 250, less then 1000, more then
1000] millis.

чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <jo...@gmail.com>:

> Nikolay,
>
> Could you please explain deeper what structure will be of PME histogram?
>
> чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <ni...@apache.org>:
>
> > Hello, Nikita.
> >
> > I think
> >
> > > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > > all blocking durations that happen after node starts.
> >
> > No, we don't need it.
> >
> > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> >
> > Yes, we need it.
> >
> > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > > Igniters,
> > >
> > > All want to see the сacheOperationsBlockedDuration metric that will
> > > show current blocking duration or 0 if there is no blocking right now.
> > >
> > > Do we need the following metrics? It seems one of them will be
> > superfluous.
> > > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > > all blocking durations that happen after node starts.
> > > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > > User will be able to configure bounds.
> > >
> > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <ni...@apache.org>:
> > > >
> > > > Guys.
> > > >
> > > > I think we should go with the 2 metrics
> > > >
> > > >         * current PME duration (resets on finish)
> > > >
> > > >                 This metric required for alerting(or automatic
> > actions) on long PME.
> > > >
> > > >         * PME duration histogram (value added to metrics on PME
> finish)
> > > >                 This metric required for an:
> > > >                         * Quick PME trend analysis
> > > >                         * Quick PME history analysis
> > > >
> > > >
> > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > > Nikita and Maxim,
> > > > >
> > > > > > What if we just update current metric getCurrentPmeDuration
> > behaviour
> > > > > > to show durations only for blocking PMEs?
> > > > > > Remain it as a long value and rename it to
> > getCacheOperationsBlockedDuration.
> > > > > >
> > > > > > No other changes will require.
> > > > > >
> > > > > > WDYT?
> > > > >
> > > > > I agree with these two metrics. I also think that current
> > > > > getCurrentPmeDuration will become redundant.
> > > > >
> > > > > Anton,
> > > > >
> > > > > > It looks like we're trying to implement "extended debug" instead
> of
> > > > > > "monitoring".
> > > > > > It should not be interesting for real admin what phase of PME is
> in
> > > > > > progress and so on.
> > > > >
> > > > > PME is mission critical cluster process. I agree that there's a
> fine
> > > > > line between monitoring and debug here. However, it's not good to
> add
> > > > > monitoring capabilities only for scenario when everything is
> alright.
> > > > > If PME will really hang, *real admin* will be extremely interested
> > how
> > > > > to return cluster back to working state. Metrics about stages
> > completion
> > > > > time may really help here: e.g. if one specific node hasn't
> completed
> > > > > stage X while rest of the cluster has, it can be a signal that this
> > node
> > > > > should be killed.
> > > > >
> > > > > Of course, it's possible to build monitoring system that extract
> this
> > > > > information from logs, but:
> > > > > - It's more resource intensive as it requires parsing logs for all
> > the time
> > > > > - It's less reliable as log messages may change
> > > > >
> > > > > Best Regards,
> > > > > Ivan Rakov
> > > > >
> > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > > Folks,
> > > > > >
> > > > > > +1 with Anton post.
> > > > > >
> > > > > > What if we just update current metric getCurrentPmeDuration
> > behaviour
> > > > > > to show durations only for blocking PMEs?
> > > > > > Remain it as a long value and rename it to
> > getCacheOperationsBlockedDuration.
> > > > > >
> > > > > > No other changes will require.
> > > > > >
> > > > > > WDYT?
> > > > > >
> > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> > nsamelchev@gmail.com> wrote:
> > > > > > > Nikolay,
> > > > > > >
> > > > > > > The сacheOperationsBlockedDuration metric will show current
> > blocking
> > > > > > > duration or 0 if there is no blocking right now.
> > > > > > >
> > > > > > > The totalCacheOperationsBlockedDuration metric will accumulate
> > all
> > > > > > > blocking durations that happen after node starts.
> > > > > > >
> > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> > nizhikov@apache.org>:
> > > > > > > > Nikita
> > > > > > > >
> > > > > > > > What is the difference between those two metrics?
> > > > > > > >
> > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> > nsamelchev@gmail.com>:
> > > > > > > >
> > > > > > > > > Igniters, thanks for comments.
> > > > > > > > >
> > > > > > > > >  From the discussion it can be seen that we need only two
> > metrics for now:
> > > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > > >
> > > > > > > > > I will prepare PR at the nearest time.
> > > > > > > > >
> > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> > <arzamas123@mail.ru.invalid
> > > > > > > > > > :
> > > > > > > > > >
> > > > > > > > > > +1 with Anton decisions.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <
> > av@apache.org>:
> > > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > It looks like we're trying to implement "extended
> debug"
> > instead of
> > > > > > > > > > > "monitoring".
> > > > > > > > > > > It should not be interesting for real admin what phase
> > of PME is in
> > > > > > > > > > > progress and so on.
> > > > > > > > > > > Interested metrics are
> > > > > > > > > > > - total blocked time (will be used for real SLA
> counting)
> > > > > > > > > > > - are we blocked right now (shows we have an SLA
> > degradation right now)
> > > > > > > > > > > Duration of the current blocking period can be easily
> > presented using
> > > > > > > > >
> > > > > > > > > any
> > > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > > Initial true will means "period start", precision will
> > be a result of
> > > > > > > > > > > checks frequency.
> > > > > > > > > > > Anyway, I'm ok to have current metric presented with
> > long, where long
> > > > > > > > >
> > > > > > > > > is a
> > > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > > >
> > > > > > > > > > > All other features you mentioned are useful for code or
> > > > > > > > > > > deployment improving and can (should) be taken from
> logs
> > at the analysis
> > > > > > > > > > > phase.
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> > ivan.glukos@gmail.com >
> > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > > >
> > > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > > >
> > > > > > > > > > > > > 1. initialVersion. Topology version that initiates
> > the exchange.
> > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > finished waiting for
> > > > > > > > >
> > > > > > > > > all
> > > > > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> > single message.
> > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> received
> > a full message.
> > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > >
> > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > >
> > > > > > > > > > > > Every metric from Nikita's list looks useful and
> > simple to implement.
> > > > > > > > > > > > I think that it would be better to change format of
> > metrics 4, 5, 6
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > > > > > 7 a bit: we can keep only difference between time of
> > previous event
> > > > > > > > >
> > > > > > > > > and
> > > > > > > > > > > > time of corresponding event. Such metrics would be
> > easier to perceive:
> > > > > > > > > > > > they answer to specific questions "how much time did
> > partition release
> > > > > > > > > > > > take?" or "how much time did awaiting of distributed
> > phase end take?".
> > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to
> > monitoring system,
> > > > > > > > > > > > graphs will show how different stages times change
> > from one PME to
> > > > > > > > >
> > > > > > > > > another.
> > > > > > > > > > > > > When PME cause no blocking, it's a good PME and I
> > see no reason to
> > > > > > > > >
> > > > > > > > > have
> > > > > > > > > > > > > monitoring related to it
> > > > > > > > > > > >
> > > > > > > > > > > > Agree with Anton here. These metrics should be
> > measured only for true
> > > > > > > > > > > > distributed exchange. Saving results for client
> > leave/join PMEs will
> > > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > > >
> > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > I still don't understand why instant value
> > indicating that
> > > > > > > > >
> > > > > > > > > operations are
> > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > Duration time since blocking has started looks more
> > appropriate and
> > > > > > > > > > > >
> > > > > > > > > > > > useful.
> > > > > > > > > > > > > It gives more information while semantic is left
> the
> > same.
> > > > > > > > > > > >
> > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated
> block
> > time" and
> > > > > > > > > > > > "current PME block time" metrics are useful. Growth
> of
> > accumulated
> > > > > > > > > > > > metric for specific period of time (should be easy to
> > check via
> > > > > > > > > > > > monitoring system graph) will show for how much
> > business operations
> > > > > > > > >
> > > > > > > > > were
> > > > > > > > > > > > blocked in total, and non-zero current metric will
> > show that we are
> > > > > > > > > > > > experiencing issues right now. Boolean metric "are we
> > blocked right
> > > > > > > > >
> > > > > > > > > now"
> > > > > > > > > > > > is not needed as it's obviously can be inferred from
> > "current PME
> > > > > > > > >
> > > > > > > > > block
> > > > > > > > > > > > time".
> > > > > > > > > > > >
> > > > > > > > > > > > Best Regards,
> > > > > > > > > > > > Ivan Rakov
> > > > > > > > > > > >
> > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > > I still don't understand why instant value
> > indicating that
> > > > > > > > >
> > > > > > > > > operations are
> > > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > > Duration time since blocking has started looks more
> > appropriate and
> > > > > > > > > > > >
> > > > > > > > > > > > useful.
> > > > > > > > > > > > > It gives more information while semantic is left
> the
> > same.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <
> > nsamelchev@gmail.com
> > > > > > > > > >
> > > > > > > > > > :
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All previous suggestions have some disadvantages.
> > It can be several
> > > > > > > > > > > > > > exchanges between two metric updates and fast
> > exchange can rewrite
> > > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We can introduce a metric of total blocking
> > duration that will
> > > > > > > > > > > > > > accumulate at the end of the exchange. So, users
> > will get actual
> > > > > > > > > > > > > > information about how long operations were
> > blocked. Cluster metric
> > > > > > > > > > > > > > will be a maximum of local nodes metrics. And we
> > need a boolean
> > > > > > > > >
> > > > > > > > > metric
> > > > > > > > > > > > > > that will indicate realtime status. It needs
> > because of duration
> > > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So I propose to change the current metric that
> not
> > released to the
> > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric and
> to
> > add the
> > > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <
> > av@apache.org >:
> > > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Still see no reason to replace boolean with
> long.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay
> Izhikov <
> > > > > > > > >
> > > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Value exported based on SPI settings, not
> > in the moment it
> > > > > > > > >
> > > > > > > > > changed.
> > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export start
> > time, we should
> > > > > > > > >
> > > > > > > > > also
> > > > > > > > > > > > > > export
> > > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <
> > av@apache.org >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to
> > count the durations.
> > > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> > periodically it will know
> > > > > > > > >
> > > > > > > > > the
> > > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> > Kovalenko <
> > > > > > > > >
> > > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For
> > the metric name, I
> > > > > > > > >
> > > > > > > > > suggest
> > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I
> think
> > it cleaner
> > > > > > > > >
> > > > > > > > > represents
> > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > > duration to have better correlation when
> > cache operations were
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a
> > calculated value as you
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > > For metrics are exported to some backend
> > (IEP-35) a counter
> > > > > > > > >
> > > > > > > > > can be
> > > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > > The counter is incremented by blocking
> > time after blocking has
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> > Amelchev <
> > > > > > > > >
> > > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > > > > how much time we wait for resuming
> > cache operations
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean
> > timestamp or duration
> > > > > > > > >
> > > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > > What do you think if we change the
> > boolean value of metric
> > > > > > > > >
> > > > > > > > > to a
> > > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > > value that represents time in
> > milliseconds when operations
> > > > > > > > >
> > > > > > > > > were
> > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > This time can be calculated as
> > (currentTime -
> > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of
> > timestamp.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Duration will be more understandable.
> > It'll be something like
> > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I
> > haven't come up with a
> > > > > > > > >
> > > > > > > > > better
> > > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel
> > Kovalenko <
> > > > > > > > >
> > > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't
> > show useful
> > > > > > > > >
> > > > > > > > > information.
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > > PME side effect for end-users is
> > blocking cache operations.
> > > > > > > > >
> > > > > > > > > Not
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > > What information gives to an end-user
> > timestamp of
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what
> > analysis it can be
> > > > > > > > >
> > > > > > > > > used and
> > > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita
> > Amelchev <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > This time already can be obtained
> > from the
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme
> metrics.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > As an alternative solution, I can
> > rework recently added
> > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not
> > released yet). Seems for
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > > useless in case of non-blocking
> PME.
> > > > > > > > > > > > > > > > > > > > > Lets name it
> > timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > blocking started (minimal value of
> > cluster nodes) and 0 if
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel
> > Kovalenko <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thank you for working on this.
> > What do you think if we
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > > value of metric to a long value
> > that represents time in
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX and
> now
> > metrics are periodically
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > > some backend it can give a more
> > clear picture of how much
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > > resuming cache operations instead
> > of instant boolean
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41,
> > Nikita Amelchev <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > For now, we have the
> > getCurrentPmeDuration() metric that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > does
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> > correctly. PME can be without
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > > operations. For example, client
> > node join/leave events.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric -
> > isOperationsBlockedByPme().
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > > metrics will show influence of
> > the PME on cluster and user
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this
> (Bot
> > visa is green). [1] Can
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > [1]
> > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58,
> > Nikolay Izhikov <
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > I think administator of
> Ignite
> > cluster should be able to
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > Ignite process, including non
> > blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57
> > +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME
> > time and not so useful
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > because
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly
> > blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking,
> > it's a good PME and I see
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> 2:50
> > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to
> postpone
> > implementation of this
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > > For now, implementation
> of
> > new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I think we can implement
> > this metrics as a single
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47
> > +0300, Anton Vinogradov
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need
> > now is a 1 simple metric:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics can
> be
> > extracted from logs now
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> > 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go
> > ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г.,
> > 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add
> > some useful metrics about the
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the
> > duration of PME stages
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> > obtained using JMX or other
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of
> > local node metrics that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status of
> > current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion.
> > Topology version that
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time
> > PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event
> > that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> > partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> > translations on a previous
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> > sendSingleMessageTime. Time when a node
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> > recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time
> > PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME
> started
> > all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help
> > to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was
> > (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited
> > for all updates was
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks
> > PME (didn't send a single
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered
> PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best wishes,
> > > > > > > > > Amelchev Nikita
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best wishes,
> > > > > > > Amelchev Nikita
> > >
> > >
> > >
> >
>

Re: Partition map exchange metrics

Posted by Pavel Kovalenko <jo...@gmail.com>.

Nikolay,

Could you please explain deeper what structure will be of PME histogram?

чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <ni...@apache.org>:

> Hello, Nikita.
>
> I think
>
> > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > all blocking durations that happen after node starts.
>
> No, we don't need it.
>
> > 2. Blocking duration histogram. Based on the HistogramMetric class.
>
> Yes, we need it.
>
> В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> > Igniters,
> >
> > All want to see the сacheOperationsBlockedDuration metric that will
> > show current blocking duration or 0 if there is no blocking right now.
> >
> > Do we need the following metrics? It seems one of them will be
> superfluous.
> > 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> > all blocking durations that happen after node starts.
> > 2. Blocking duration histogram. Based on the HistogramMetric class.
> > User will be able to configure bounds.
> >
> > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <ni...@apache.org>:
> > >
> > > Guys.
> > >
> > > I think we should go with the 2 metrics
> > >
> > >         * current PME duration (resets on finish)
> > >
> > >                 This metric required for alerting(or automatic
> actions) on long PME.
> > >
> > >         * PME duration histogram (value added to metrics on PME finish)
> > >                 This metric required for an:
> > >                         * Quick PME trend analysis
> > >                         * Quick PME history analysis
> > >
> > >
> > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > > Nikita and Maxim,
> > > >
> > > > > What if we just update current metric getCurrentPmeDuration
> behaviour
> > > > > to show durations only for blocking PMEs?
> > > > > Remain it as a long value and rename it to
> getCacheOperationsBlockedDuration.
> > > > >
> > > > > No other changes will require.
> > > > >
> > > > > WDYT?
> > > >
> > > > I agree with these two metrics. I also think that current
> > > > getCurrentPmeDuration will become redundant.
> > > >
> > > > Anton,
> > > >
> > > > > It looks like we're trying to implement "extended debug" instead of
> > > > > "monitoring".
> > > > > It should not be interesting for real admin what phase of PME is in
> > > > > progress and so on.
> > > >
> > > > PME is mission critical cluster process. I agree that there's a fine
> > > > line between monitoring and debug here. However, it's not good to add
> > > > monitoring capabilities only for scenario when everything is alright.
> > > > If PME will really hang, *real admin* will be extremely interested
> how
> > > > to return cluster back to working state. Metrics about stages
> completion
> > > > time may really help here: e.g. if one specific node hasn't completed
> > > > stage X while rest of the cluster has, it can be a signal that this
> node
> > > > should be killed.
> > > >
> > > > Of course, it's possible to build monitoring system that extract this
> > > > information from logs, but:
> > > > - It's more resource intensive as it requires parsing logs for all
> the time
> > > > - It's less reliable as log messages may change
> > > >
> > > > Best Regards,
> > > > Ivan Rakov
> > > >
> > > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > > Folks,
> > > > >
> > > > > +1 with Anton post.
> > > > >
> > > > > What if we just update current metric getCurrentPmeDuration
> behaviour
> > > > > to show durations only for blocking PMEs?
> > > > > Remain it as a long value and rename it to
> getCacheOperationsBlockedDuration.
> > > > >
> > > > > No other changes will require.
> > > > >
> > > > > WDYT?
> > > > >
> > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <
> nsamelchev@gmail.com> wrote:
> > > > > > Nikolay,
> > > > > >
> > > > > > The сacheOperationsBlockedDuration metric will show current
> blocking
> > > > > > duration or 0 if there is no blocking right now.
> > > > > >
> > > > > > The totalCacheOperationsBlockedDuration metric will accumulate
> all
> > > > > > blocking durations that happen after node starts.
> > > > > >
> > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <
> nizhikov@apache.org>:
> > > > > > > Nikita
> > > > > > >
> > > > > > > What is the difference between those two metrics?
> > > > > > >
> > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <
> nsamelchev@gmail.com>:
> > > > > > >
> > > > > > > > Igniters, thanks for comments.
> > > > > > > >
> > > > > > > >  From the discussion it can be seen that we need only two
> metrics for now:
> > > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > >
> > > > > > > > I will prepare PR at the nearest time.
> > > > > > > >
> > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky
> <arzamas123@mail.ru.invalid
> > > > > > > > > :
> > > > > > > > >
> > > > > > > > > +1 with Anton decisions.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <
> av@apache.org>:
> > > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > It looks like we're trying to implement "extended debug"
> instead of
> > > > > > > > > > "monitoring".
> > > > > > > > > > It should not be interesting for real admin what phase
> of PME is in
> > > > > > > > > > progress and so on.
> > > > > > > > > > Interested metrics are
> > > > > > > > > > - total blocked time (will be used for real SLA counting)
> > > > > > > > > > - are we blocked right now (shows we have an SLA
> degradation right now)
> > > > > > > > > > Duration of the current blocking period can be easily
> presented using
> > > > > > > >
> > > > > > > > any
> > > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > > Initial true will means "period start", precision will
> be a result of
> > > > > > > > > > checks frequency.
> > > > > > > > > > Anyway, I'm ok to have current metric presented with
> long, where long
> > > > > > > >
> > > > > > > > is a
> > > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > >
> > > > > > > > > > All other features you mentioned are useful for code or
> > > > > > > > > > deployment improving and can (should) be taken from logs
> at the analysis
> > > > > > > > > > phase.
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov <
> ivan.glukos@gmail.com >
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > Folks, let me step in.
> > > > > > > > > > >
> > > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > >
> > > > > > > > > > > > 1. initialVersion. Topology version that initiates
> the exchange.
> > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> finished waiting for
> > > > > > > >
> > > > > > > > all
> > > > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a
> single message.
> > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node received
> a full message.
> > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > >
> > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > >
> > > > > > > > > > > Every metric from Nikita's list looks useful and
> simple to implement.
> > > > > > > > > > > I think that it would be better to change format of
> metrics 4, 5, 6
> > > > > > > >
> > > > > > > > and
> > > > > > > > > > > 7 a bit: we can keep only difference between time of
> previous event
> > > > > > > >
> > > > > > > > and
> > > > > > > > > > > time of corresponding event. Such metrics would be
> easier to perceive:
> > > > > > > > > > > they answer to specific questions "how much time did
> partition release
> > > > > > > > > > > take?" or "how much time did awaiting of distributed
> phase end take?".
> > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to
> monitoring system,
> > > > > > > > > > > graphs will show how different stages times change
> from one PME to
> > > > > > > >
> > > > > > > > another.
> > > > > > > > > > > > When PME cause no blocking, it's a good PME and I
> see no reason to
> > > > > > > >
> > > > > > > > have
> > > > > > > > > > > > monitoring related to it
> > > > > > > > > > >
> > > > > > > > > > > Agree with Anton here. These metrics should be
> measured only for true
> > > > > > > > > > > distributed exchange. Saving results for client
> leave/join PMEs will
> > > > > > > > > > > just complicate monitoring.
> > > > > > > > > > >
> > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > I still don't understand why instant value
> indicating that
> > > > > > > >
> > > > > > > > operations are
> > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > Duration time since blocking has started looks more
> appropriate and
> > > > > > > > > > >
> > > > > > > > > > > useful.
> > > > > > > > > > > > It gives more information while semantic is left the
> same.
> > > > > > > > > > >
> > > > > > > > > > > Totally agree with Pavel here. Both "accumulated block
> time" and
> > > > > > > > > > > "current PME block time" metrics are useful. Growth of
> accumulated
> > > > > > > > > > > metric for specific period of time (should be easy to
> check via
> > > > > > > > > > > monitoring system graph) will show for how much
> business operations
> > > > > > > >
> > > > > > > > were
> > > > > > > > > > > blocked in total, and non-zero current metric will
> show that we are
> > > > > > > > > > > experiencing issues right now. Boolean metric "are we
> blocked right
> > > > > > > >
> > > > > > > > now"
> > > > > > > > > > > is not needed as it's obviously can be inferred from
> "current PME
> > > > > > > >
> > > > > > > > block
> > > > > > > > > > > time".
> > > > > > > > > > >
> > > > > > > > > > > Best Regards,
> > > > > > > > > > > Ivan Rakov
> > > > > > > > > > >
> > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > > Nikita,
> > > > > > > > > > > >
> > > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > > I still don't understand why instant value
> indicating that
> > > > > > > >
> > > > > > > > operations are
> > > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > > Duration time since blocking has started looks more
> appropriate and
> > > > > > > > > > >
> > > > > > > > > > > useful.
> > > > > > > > > > > > It gives more information while semantic is left the
> same.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <
> nsamelchev@gmail.com
> > > > > > > > >
> > > > > > > > > :
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > >
> > > > > > > > > > > > > All previous suggestions have some disadvantages.
> It can be several
> > > > > > > > > > > > > exchanges between two metric updates and fast
> exchange can rewrite
> > > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We can introduce a metric of total blocking
> duration that will
> > > > > > > > > > > > > accumulate at the end of the exchange. So, users
> will get actual
> > > > > > > > > > > > > information about how long operations were
> blocked. Cluster metric
> > > > > > > > > > > > > will be a maximum of local nodes metrics. And we
> need a boolean
> > > > > > > >
> > > > > > > > metric
> > > > > > > > > > > > > that will indicate realtime status. It needs
> because of duration
> > > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So I propose to change the current metric that not
> released to the
> > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric and to
> add the
> > > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > >
> > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <
> av@apache.org >:
> > > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Still see no reason to replace boolean with long.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > > > > > >
> > > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Value exported based on SPI settings, not
> in the moment it
> > > > > > > >
> > > > > > > > changed.
> > > > > > > > > > > > > > > 2. Clock synchronisation - if we export start
> time, we should
> > > > > > > >
> > > > > > > > also
> > > > > > > > > > > > > export
> > > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <
> av@apache.org >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to
> count the durations.
> > > > > > > > > > > > > > > > Sine monitoring system checks metrics
> periodically it will know
> > > > > > > >
> > > > > > > > the
> > > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel
> Kovalenko <
> > > > > > > >
> > > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For
> the metric name, I
> > > > > > > >
> > > > > > > > suggest
> > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think
> it cleaner
> > > > > > > >
> > > > > > > > represents
> > > > > > > > > > > > > what
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > >
> > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > > duration to have better correlation when
> cache operations were
> > > > > > > > > > > > >
> > > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a
> calculated value as you
> > > > > > > > > > > > >
> > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > > For metrics are exported to some backend
> (IEP-35) a counter
> > > > > > > >
> > > > > > > > can be
> > > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > > The counter is incremented by blocking
> time after blocking has
> > > > > > > > > > > > >
> > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita
> Amelchev <
> > > > > > > >
> > > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > > > how much time we wait for resuming
> cache operations
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean
> timestamp or duration
> > > > > > > >
> > > > > > > > here?
> > > > > > > > > > > > > > > > > > > > What do you think if we change the
> boolean value of metric
> > > > > > > >
> > > > > > > > to a
> > > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > > value that represents time in
> milliseconds when operations
> > > > > > > >
> > > > > > > > were
> > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > This time can be calculated as
> (currentTime -
> > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of
> timestamp.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Duration will be more understandable.
> It'll be something like
> > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I
> haven't come up with a
> > > > > > > >
> > > > > > > > better
> > > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel
> Kovalenko <
> > > > > > > >
> > > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't
> show useful
> > > > > > > >
> > > > > > > > information.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > > PME side effect for end-users is
> blocking cache operations.
> > > > > > > >
> > > > > > > > Not
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > > What information gives to an end-user
> timestamp of
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what
> analysis it can be
> > > > > > > >
> > > > > > > > used and
> > > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita
> Amelchev <
> > > > > > > > > > > > >
> > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > This time already can be obtained
> from the
> > > > > > > > > > > > >
> > > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > As an alternative solution, I can
> rework recently added
> > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not
> released yet). Seems for
> > > > > > > > > > > > >
> > > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > > useless in case of non-blocking PME.
> > > > > > > > > > > > > > > > > > > > Lets name it
> timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > >
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > blocking started (minimal value of
> cluster nodes) and 0 if
> > > > > > > > > > > > >
> > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel
> Kovalenko <
> > > > > > > > > > > > >
> > > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thank you for working on this.
> What do you think if we
> > > > > > > > > > > > >
> > > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > > value of metric to a long value
> that represents time in
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > > Since we have not only JMX and now
> metrics are periodically
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > some backend it can give a more
> clear picture of how much
> > > > > > > > > > > > >
> > > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > > resuming cache operations instead
> of instant boolean
> > > > > > > > > > > > >
> > > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41,
> Nikita Amelchev <
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > For now, we have the
> getCurrentPmeDuration() metric that
> > > > > > > > > > > > >
> > > > > > > > > > > > > does
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > > influence on the cluster
> correctly. PME can be without
> > > > > > > > > > > > >
> > > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > > operations. For example, client
> node join/leave events.
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > I suggest add new metric -
> isOperationsBlockedByPme().
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > > metrics will show influence of
> the PME on cluster and user
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot
> visa is green). [1] Can
> > > > > > > > > > > > >
> > > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58,
> Nikolay Izhikov <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > I think administator of Ignite
> cluster should be able to
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > Ignite process, including non
> blocking PME.
> > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57
> +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > > Found PME metric -
> getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME
> time and not so useful
> > > > > > > > > > > > >
> > > > > > > > > > > > > because
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly
> blocking period.
> > > > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking,
> it's a good PME and I see
> > > > > > > > > > > > >
> > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50
> PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone
> implementation of this
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > > For now, implementation of
> new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > I think we can implement
> this metrics as a single
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47
> +0300, Anton Vinogradov
> > > > > > > > > > > > >
> > > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need
> now is a 1 simple metric:
> > > > > > > > > > > > >
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics can be
> extracted from logs now
> > > > > > > > > > > > >
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at
> 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go
> ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г.,
> 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add
> some useful metrics about the
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the
> duration of PME stages
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be
> obtained using JMX or other
> > > > > > > > > > > > >
> > > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of
> local node metrics that
> > > > > > > > > > > > >
> > > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status of
> current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion.
> Topology version that
> > > > > > > > > > > > >
> > > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time
> PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event
> that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 4.
> partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and
> translations on a previous
> > > > > > > > > > > > >
> > > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 5.
> sendSingleMessageTime. Time when a node
> > > > > > > > > > > > >
> > > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 6.
> recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time
> PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started
> all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help
> to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was
> (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited
> for all updates was
> > > > > > > > > > > > >
> > > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks
> PME (didn't send a single
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Zhenya Stanilovsky
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best wishes,
> > > > > > > > Amelchev Nikita
> > > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best wishes,
> > > > > > Amelchev Nikita
> >
> >
> >
>

Re: Partition map exchange metrics

Posted by Nikolay Izhikov <ni...@apache.org>.

Hello, Nikita.

I think

> 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> all blocking durations that happen after node starts.

No, we don't need it.

> 2. Blocking duration histogram. Based on the HistogramMetric class.

Yes, we need it.

В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет:
> Igniters,
> 
> All want to see the сacheOperationsBlockedDuration metric that will
> show current blocking duration or 0 if there is no blocking right now.
> 
> Do we need the following metrics? It seems one of them will be superfluous.
> 1. The totalCacheOperationsBlockedDuration metric that will accumulate
> all blocking durations that happen after node starts.
> 2. Blocking duration histogram. Based on the HistogramMetric class.
> User will be able to configure bounds.
> 
> ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <ni...@apache.org>:
> > 
> > Guys.
> > 
> > I think we should go with the 2 metrics
> > 
> >         * current PME duration (resets on finish)
> > 
> >                 This metric required for alerting(or automatic actions) on long PME.
> > 
> >         * PME duration histogram (value added to metrics on PME finish)
> >                 This metric required for an:
> >                         * Quick PME trend analysis
> >                         * Quick PME history analysis
> > 
> > 
> > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > > Nikita and Maxim,
> > > 
> > > > What if we just update current metric getCurrentPmeDuration behaviour
> > > > to show durations only for blocking PMEs?
> > > > Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
> > > > 
> > > > No other changes will require.
> > > > 
> > > > WDYT?
> > > 
> > > I agree with these two metrics. I also think that current
> > > getCurrentPmeDuration will become redundant.
> > > 
> > > Anton,
> > > 
> > > > It looks like we're trying to implement "extended debug" instead of
> > > > "monitoring".
> > > > It should not be interesting for real admin what phase of PME is in
> > > > progress and so on.
> > > 
> > > PME is mission critical cluster process. I agree that there's a fine
> > > line between monitoring and debug here. However, it's not good to add
> > > monitoring capabilities only for scenario when everything is alright.
> > > If PME will really hang, *real admin* will be extremely interested how
> > > to return cluster back to working state. Metrics about stages completion
> > > time may really help here: e.g. if one specific node hasn't completed
> > > stage X while rest of the cluster has, it can be a signal that this node
> > > should be killed.
> > > 
> > > Of course, it's possible to build monitoring system that extract this
> > > information from logs, but:
> > > - It's more resource intensive as it requires parsing logs for all the time
> > > - It's less reliable as log messages may change
> > > 
> > > Best Regards,
> > > Ivan Rakov
> > > 
> > > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > > Folks,
> > > > 
> > > > +1 with Anton post.
> > > > 
> > > > What if we just update current metric getCurrentPmeDuration behaviour
> > > > to show durations only for blocking PMEs?
> > > > Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
> > > > 
> > > > No other changes will require.
> > > > 
> > > > WDYT?
> > > > 
> > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <ns...@gmail.com> wrote:
> > > > > Nikolay,
> > > > > 
> > > > > The сacheOperationsBlockedDuration metric will show current blocking
> > > > > duration or 0 if there is no blocking right now.
> > > > > 
> > > > > The totalCacheOperationsBlockedDuration metric will accumulate all
> > > > > blocking durations that happen after node starts.
> > > > > 
> > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <ni...@apache.org>:
> > > > > > Nikita
> > > > > > 
> > > > > > What is the difference between those two metrics?
> > > > > > 
> > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:
> > > > > > 
> > > > > > > Igniters, thanks for comments.
> > > > > > > 
> > > > > > >  From the discussion it can be seen that we need only two metrics for now:
> > > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > > > 
> > > > > > > I will prepare PR at the nearest time.
> > > > > > > 
> > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> > > > > > > > :
> > > > > > > > 
> > > > > > > > +1 with Anton decisions.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> > > > > > > > > 
> > > > > > > > > Folks,
> > > > > > > > > 
> > > > > > > > > It looks like we're trying to implement "extended debug" instead of
> > > > > > > > > "monitoring".
> > > > > > > > > It should not be interesting for real admin what phase of PME is in
> > > > > > > > > progress and so on.
> > > > > > > > > Interested metrics are
> > > > > > > > > - total blocked time (will be used for real SLA counting)
> > > > > > > > > - are we blocked right now (shows we have an SLA degradation right now)
> > > > > > > > > Duration of the current blocking period can be easily presented using
> > > > > > > 
> > > > > > > any
> > > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > > Initial true will means "period start", precision will be a result of
> > > > > > > > > checks frequency.
> > > > > > > > > Anyway, I'm ok to have current metric presented with long, where long
> > > > > > > 
> > > > > > > is a
> > > > > > > > > duration, see no reason, but ok :)
> > > > > > > > > 
> > > > > > > > > All other features you mentioned are useful for code or
> > > > > > > > > deployment improving and can (should) be taken from logs at the analysis
> > > > > > > > > phase.
> > > > > > > > > 
> > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > > > Folks, let me step in.
> > > > > > > > > > 
> > > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > > > 
> > > > > > > > > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> > > > > > > 
> > > > > > > all
> > > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > 
> > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > 
> > > > > > > > > > Every metric from Nikita's list looks useful and simple to implement.
> > > > > > > > > > I think that it would be better to change format of metrics 4, 5, 6
> > > > > > > 
> > > > > > > and
> > > > > > > > > > 7 a bit: we can keep only difference between time of previous event
> > > > > > > 
> > > > > > > and
> > > > > > > > > > time of corresponding event. Such metrics would be easier to perceive:
> > > > > > > > > > they answer to specific questions "how much time did partition release
> > > > > > > > > > take?" or "how much time did awaiting of distributed phase end take?".
> > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > > > > > > > > > graphs will show how different stages times change from one PME to
> > > > > > > 
> > > > > > > another.
> > > > > > > > > > > When PME cause no blocking, it's a good PME and I see no reason to
> > > > > > > 
> > > > > > > have
> > > > > > > > > > > monitoring related to it
> > > > > > > > > > 
> > > > > > > > > > Agree with Anton here. These metrics should be measured only for true
> > > > > > > > > > distributed exchange. Saving results for client leave/join PMEs will
> > > > > > > > > > just complicate monitoring.
> > > > > > > > > > 
> > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > I still don't understand why instant value indicating that
> > > > > > > 
> > > > > > > operations are
> > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > Duration time since blocking has started looks more appropriate and
> > > > > > > > > > 
> > > > > > > > > > useful.
> > > > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > > > 
> > > > > > > > > > Totally agree with Pavel here. Both "accumulated block time" and
> > > > > > > > > > "current PME block time" metrics are useful. Growth of accumulated
> > > > > > > > > > metric for specific period of time (should be easy to check via
> > > > > > > > > > monitoring system graph) will show for how much business operations
> > > > > > > 
> > > > > > > were
> > > > > > > > > > blocked in total, and non-zero current metric will show that we are
> > > > > > > > > > experiencing issues right now. Boolean metric "are we blocked right
> > > > > > > 
> > > > > > > now"
> > > > > > > > > > is not needed as it's obviously can be inferred from "current PME
> > > > > > > 
> > > > > > > block
> > > > > > > > > > time".
> > > > > > > > > > 
> > > > > > > > > > Best Regards,
> > > > > > > > > > Ivan Rakov
> > > > > > > > > > 
> > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > > Nikita,
> > > > > > > > > > > 
> > > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > > I still don't understand why instant value indicating that
> > > > > > > 
> > > > > > > operations are
> > > > > > > > > > > blocked should be boolean.
> > > > > > > > > > > Duration time since blocking has started looks more appropriate and
> > > > > > > > > > 
> > > > > > > > > > useful.
> > > > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> > > > > > > > 
> > > > > > > > :
> > > > > > > > > > > > Folks,
> > > > > > > > > > > > 
> > > > > > > > > > > > All previous suggestions have some disadvantages. It can be several
> > > > > > > > > > > > exchanges between two metric updates and fast exchange can rewrite
> > > > > > > > > > > > previous long exchange.
> > > > > > > > > > > > 
> > > > > > > > > > > > We can introduce a metric of total blocking duration that will
> > > > > > > > > > > > accumulate at the end of the exchange. So, users will get actual
> > > > > > > > > > > > information about how long operations were blocked. Cluster metric
> > > > > > > > > > > > will be a maximum of local nodes metrics. And we need a boolean
> > > > > > > 
> > > > > > > metric
> > > > > > > > > > > > that will indicate realtime status. It needs because of duration
> > > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > > > 
> > > > > > > > > > > > So I propose to change the current metric that not released to the
> > > > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add the
> > > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > > > 
> > > > > > > > > > > > WDYT?
> > > > > > > > > > > > 
> > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> > > > > > > > > > > > > Nikolay,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Still see no reason to replace boolean with long.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > > > > > 
> > > > > > > nizhikov@apache.org >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 1. Value exported based on SPI settings, not in the moment it
> > > > > > > 
> > > > > > > changed.
> > > > > > > > > > > > > > 2. Clock synchronisation - if we export start time, we should
> > > > > > > 
> > > > > > > also
> > > > > > > > > > > > export
> > > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count the durations.
> > > > > > > > > > > > > > > Sine monitoring system checks metrics periodically it will know
> > > > > > > 
> > > > > > > the
> > > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > > > > > > 
> > > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the metric name, I
> > > > > > > 
> > > > > > > suggest
> > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it cleaner
> > > > > > > 
> > > > > > > represents
> > > > > > > > > > > > what
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > > > 
> > > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > > duration to have better correlation when cache operations were
> > > > > > > > > > > > 
> > > > > > > > > > > > blocked
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > > For instant view (like in JMX bean) a calculated value as you
> > > > > > > > > > > > 
> > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > > For metrics are exported to some backend (IEP-35) a counter
> > > > > > > 
> > > > > > > can be
> > > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > > The counter is incremented by blocking time after blocking has
> > > > > > > > > > > > 
> > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > > > > > > 
> > > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > > how much time we wait for resuming cache operations
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean timestamp or duration
> > > > > > > 
> > > > > > > here?
> > > > > > > > > > > > > > > > > > > What do you think if we change the boolean value of metric
> > > > > > > 
> > > > > > > to a
> > > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > > value that represents time in milliseconds when operations
> > > > > > > 
> > > > > > > were
> > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > This time can be calculated as (currentTime -
> > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of timestamp.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Duration will be more understandable. It'll be something like
> > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't come up with a
> > > > > > > 
> > > > > > > better
> > > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > > > > > > 
> > > > > > > jokserfn@gmail.com
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show useful
> > > > > > > 
> > > > > > > information.
> > > > > > > > > > > > The
> > > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > > PME side effect for end-users is blocking cache operations.
> > > > > > > 
> > > > > > > Not
> > > > > > > > > > > > all
> > > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > > What information gives to an end-user timestamp of
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what analysis it can be
> > > > > > > 
> > > > > > > used and
> > > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > > > > > > > > > > 
> > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > This time already can be obtained from the
> > > > > > > > > > > > 
> > > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > As an alternative solution, I can rework recently added
> > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not released yet). Seems for
> > > > > > > > > > > > 
> > > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > > useless in case of non-blocking PME.
> > > > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > > > 
> > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > blocking started (minimal value of cluster nodes) and 0 if
> > > > > > > > > > > > 
> > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > > > > > > > > > > > 
> > > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Thank you for working on this. What do you think if we
> > > > > > > > > > > > 
> > > > > > > > > > > > change the
> > > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > > value of metric to a long value that represents time in
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > > Since we have not only JMX and now metrics are periodically
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > some backend it can give a more clear picture of how much
> > > > > > > > > > > > 
> > > > > > > > > > > > time we
> > > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > > resuming cache operations instead of instant boolean
> > > > > > > > > > > > 
> > > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > For now, we have the getCurrentPmeDuration() metric that
> > > > > > > > > > > > 
> > > > > > > > > > > > does
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > > influence on the cluster correctly. PME can be without
> > > > > > > > > > > > 
> > > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > > operations. For example, client node join/leave events.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > I suggest add new metric - isOperationsBlockedByPme().
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > > metrics will show influence of the PME on cluster and user
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa is green). [1] Can
> > > > > > > > > > > > 
> > > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > I think administator of Ignite cluster should be able to
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > Ignite process, including non blocking PME.
> > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > > Found PME metric - getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time and not so useful
> > > > > > > > > > > > 
> > > > > > > > > > > > because
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly blocking period.
> > > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME and I see
> > > > > > > > > > > > 
> > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone implementation of this
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > > For now, implementation of new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > I think we can implement this metrics as a single
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > > > > > > > > > > > 
> > > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is a 1 simple metric:
> > > > > > > > > > > > 
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > > All other metrics can be extracted from logs now
> > > > > > > > > > > > 
> > > > > > > > > > > > and
> > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead.
> > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some useful metrics about the
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the duration of PME stages
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained using JMX or other
> > > > > > > > > > > > 
> > > > > > > > > > > > external
> > > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local node metrics that
> > > > > > > > > > > > 
> > > > > > > > > > > > help to
> > > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > > actual status of current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > > > > > > > > > > 
> > > > > > > > > > > > initiates
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > > updates and translations on a previous
> > > > > > > > > > > > 
> > > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> > > > > > > > > > > > 
> > > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > received
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for all updates was
> > > > > > > > > > > > 
> > > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME (didn't send a single
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > --
> > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Zhenya Stanilovsky
> > > > > > > 
> > > > > > > 
> > > > > > > --
> > > > > > > Best wishes,
> > > > > > > Amelchev Nikita
> > > > > > > 
> > > > > 
> > > > > 
> > > > > --
> > > > > Best wishes,
> > > > > Amelchev Nikita
> 
> 
>

Re: Partition map exchange metrics

Posted by Nikita Amelchev <ns...@gmail.com>.

Igniters,

All want to see the сacheOperationsBlockedDuration metric that will
show current blocking duration or 0 if there is no blocking right now.

Do we need the following metrics? It seems one of them will be superfluous.
1. The totalCacheOperationsBlockedDuration metric that will accumulate
all blocking durations that happen after node starts.
2. Blocking duration histogram. Based on the HistogramMetric class.
User will be able to configure bounds.

ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <ni...@apache.org>:
>
> Guys.
>
> I think we should go with the 2 metrics
>
>         * current PME duration (resets on finish)
>
>                 This metric required for alerting(or automatic actions) on long PME.
>
>         * PME duration histogram (value added to metrics on PME finish)
>                 This metric required for an:
>                         * Quick PME trend analysis
>                         * Quick PME history analysis
>
>
> В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > Nikita and Maxim,
> >
> > > What if we just update current metric getCurrentPmeDuration behaviour
> > > to show durations only for blocking PMEs?
> > > Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
> > >
> > > No other changes will require.
> > >
> > > WDYT?
> >
> > I agree with these two metrics. I also think that current
> > getCurrentPmeDuration will become redundant.
> >
> > Anton,
> >
> > > It looks like we're trying to implement "extended debug" instead of
> > > "monitoring".
> > > It should not be interesting for real admin what phase of PME is in
> > > progress and so on.
> >
> > PME is mission critical cluster process. I agree that there's a fine
> > line between monitoring and debug here. However, it's not good to add
> > monitoring capabilities only for scenario when everything is alright.
> > If PME will really hang, *real admin* will be extremely interested how
> > to return cluster back to working state. Metrics about stages completion
> > time may really help here: e.g. if one specific node hasn't completed
> > stage X while rest of the cluster has, it can be a signal that this node
> > should be killed.
> >
> > Of course, it's possible to build monitoring system that extract this
> > information from logs, but:
> > - It's more resource intensive as it requires parsing logs for all the time
> > - It's less reliable as log messages may change
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > Folks,
> > >
> > > +1 with Anton post.
> > >
> > > What if we just update current metric getCurrentPmeDuration behaviour
> > > to show durations only for blocking PMEs?
> > > Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
> > >
> > > No other changes will require.
> > >
> > > WDYT?
> > >
> > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <ns...@gmail.com> wrote:
> > > > Nikolay,
> > > >
> > > > The сacheOperationsBlockedDuration metric will show current blocking
> > > > duration or 0 if there is no blocking right now.
> > > >
> > > > The totalCacheOperationsBlockedDuration metric will accumulate all
> > > > blocking durations that happen after node starts.
> > > >
> > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <ni...@apache.org>:
> > > > > Nikita
> > > > >
> > > > > What is the difference between those two metrics?
> > > > >
> > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:
> > > > >
> > > > > > Igniters, thanks for comments.
> > > > > >
> > > > > >  From the discussion it can be seen that we need only two metrics for now:
> > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > >
> > > > > > I will prepare PR at the nearest time.
> > > > > >
> > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> > > > > > > :
> > > > > > >
> > > > > > > +1 with Anton decisions.
> > > > > > >
> > > > > > >
> > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> > > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > It looks like we're trying to implement "extended debug" instead of
> > > > > > > > "monitoring".
> > > > > > > > It should not be interesting for real admin what phase of PME is in
> > > > > > > > progress and so on.
> > > > > > > > Interested metrics are
> > > > > > > > - total blocked time (will be used for real SLA counting)
> > > > > > > > - are we blocked right now (shows we have an SLA degradation right now)
> > > > > > > > Duration of the current blocking period can be easily presented using
> > > > > >
> > > > > > any
> > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > Initial true will means "period start", precision will be a result of
> > > > > > > > checks frequency.
> > > > > > > > Anyway, I'm ok to have current metric presented with long, where long
> > > > > >
> > > > > > is a
> > > > > > > > duration, see no reason, but ok :)
> > > > > > > >
> > > > > > > > All other features you mentioned are useful for code or
> > > > > > > > deployment improving and can (should) be taken from logs at the analysis
> > > > > > > > phase.
> > > > > > > >
> > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> > > > > >
> > > > > > wrote:
> > > > > > > > > Folks, let me step in.
> > > > > > > > >
> > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > >
> > > > > > > > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> > > > > >
> > > > > > all
> > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > >
> > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > >
> > > > > > > > > Every metric from Nikita's list looks useful and simple to implement.
> > > > > > > > > I think that it would be better to change format of metrics 4, 5, 6
> > > > > >
> > > > > > and
> > > > > > > > > 7 a bit: we can keep only difference between time of previous event
> > > > > >
> > > > > > and
> > > > > > > > > time of corresponding event. Such metrics would be easier to perceive:
> > > > > > > > > they answer to specific questions "how much time did partition release
> > > > > > > > > take?" or "how much time did awaiting of distributed phase end take?".
> > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > > > > > > > > graphs will show how different stages times change from one PME to
> > > > > >
> > > > > > another.
> > > > > > > > > > When PME cause no blocking, it's a good PME and I see no reason to
> > > > > >
> > > > > > have
> > > > > > > > > > monitoring related to it
> > > > > > > > >
> > > > > > > > > Agree with Anton here. These metrics should be measured only for true
> > > > > > > > > distributed exchange. Saving results for client leave/join PMEs will
> > > > > > > > > just complicate monitoring.
> > > > > > > > >
> > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > I still don't understand why instant value indicating that
> > > > > >
> > > > > > operations are
> > > > > > > > > > blocked should be boolean.
> > > > > > > > > > Duration time since blocking has started looks more appropriate and
> > > > > > > > >
> > > > > > > > > useful.
> > > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > >
> > > > > > > > > Totally agree with Pavel here. Both "accumulated block time" and
> > > > > > > > > "current PME block time" metrics are useful. Growth of accumulated
> > > > > > > > > metric for specific period of time (should be easy to check via
> > > > > > > > > monitoring system graph) will show for how much business operations
> > > > > >
> > > > > > were
> > > > > > > > > blocked in total, and non-zero current metric will show that we are
> > > > > > > > > experiencing issues right now. Boolean metric "are we blocked right
> > > > > >
> > > > > > now"
> > > > > > > > > is not needed as it's obviously can be inferred from "current PME
> > > > > >
> > > > > > block
> > > > > > > > > time".
> > > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Ivan Rakov
> > > > > > > > >
> > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > Nikita,
> > > > > > > > > >
> > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > I still don't understand why instant value indicating that
> > > > > >
> > > > > > operations are
> > > > > > > > > > blocked should be boolean.
> > > > > > > > > > Duration time since blocking has started looks more appropriate and
> > > > > > > > >
> > > > > > > > > useful.
> > > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> > > > > > >
> > > > > > > :
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > All previous suggestions have some disadvantages. It can be several
> > > > > > > > > > > exchanges between two metric updates and fast exchange can rewrite
> > > > > > > > > > > previous long exchange.
> > > > > > > > > > >
> > > > > > > > > > > We can introduce a metric of total blocking duration that will
> > > > > > > > > > > accumulate at the end of the exchange. So, users will get actual
> > > > > > > > > > > information about how long operations were blocked. Cluster metric
> > > > > > > > > > > will be a maximum of local nodes metrics. And we need a boolean
> > > > > >
> > > > > > metric
> > > > > > > > > > > that will indicate realtime status. It needs because of duration
> > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > >
> > > > > > > > > > > So I propose to change the current metric that not released to the
> > > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add the
> > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > >
> > > > > > > > > > > WDYT?
> > > > > > > > > > >
> > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> > > > > > > > > > > > Nikolay,
> > > > > > > > > > > >
> > > > > > > > > > > > Still see no reason to replace boolean with long.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > > > >
> > > > > > nizhikov@apache.org >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > Anton.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Value exported based on SPI settings, not in the moment it
> > > > > >
> > > > > > changed.
> > > > > > > > > > > > > 2. Clock synchronisation - if we export start time, we should
> > > > > >
> > > > > > also
> > > > > > > > > > > export
> > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count the durations.
> > > > > > > > > > > > > > Sine monitoring system checks metrics periodically it will know
> > > > > >
> > > > > > the
> > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > > > > >
> > > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the metric name, I
> > > > > >
> > > > > > suggest
> > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it cleaner
> > > > > >
> > > > > > represents
> > > > > > > > > > > what
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > >
> > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > duration to have better correlation when cache operations were
> > > > > > > > > > >
> > > > > > > > > > > blocked
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > For instant view (like in JMX bean) a calculated value as you
> > > > > > > > > > >
> > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > For metrics are exported to some backend (IEP-35) a counter
> > > > > >
> > > > > > can be
> > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > The counter is incremented by blocking time after blocking has
> > > > > > > > > > >
> > > > > > > > > > > ended.
> > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > > > > >
> > > > > > nsamelchev@gmail.com
> > > > > > > > > > > > :
> > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > how much time we wait for resuming cache operations
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean timestamp or duration
> > > > > >
> > > > > > here?
> > > > > > > > > > > > > > > > > > What do you think if we change the boolean value of metric
> > > > > >
> > > > > > to a
> > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > value that represents time in milliseconds when operations
> > > > > >
> > > > > > were
> > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > This time can be calculated as (currentTime -
> > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of timestamp.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Duration will be more understandable. It'll be something like
> > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't come up with a
> > > > > >
> > > > > > better
> > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > > > > >
> > > > > > jokserfn@gmail.com
> > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show useful
> > > > > >
> > > > > > information.
> > > > > > > > > > > The
> > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > PME side effect for end-users is blocking cache operations.
> > > > > >
> > > > > > Not
> > > > > > > > > > > all
> > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > What information gives to an end-user timestamp of
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what analysis it can be
> > > > > >
> > > > > > used and
> > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > > > > > > > > >
> > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This time already can be obtained from the
> > > > > > > > > > >
> > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > As an alternative solution, I can rework recently added
> > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not released yet). Seems for
> > > > > > > > > > >
> > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > useless in case of non-blocking PME.
> > > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > >
> > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > blocking started (minimal value of cluster nodes) and 0 if
> > > > > > > > > > >
> > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > > > > > > > > > >
> > > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you for working on this. What do you think if we
> > > > > > > > > > >
> > > > > > > > > > > change the
> > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > value of metric to a long value that represents time in
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > Since we have not only JMX and now metrics are periodically
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > some backend it can give a more clear picture of how much
> > > > > > > > > > >
> > > > > > > > > > > time we
> > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > resuming cache operations instead of instant boolean
> > > > > > > > > > >
> > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > > > > > > > > > > >
> > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > For now, we have the getCurrentPmeDuration() metric that
> > > > > > > > > > >
> > > > > > > > > > > does
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > influence on the cluster correctly. PME can be without
> > > > > > > > > > >
> > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > operations. For example, client node join/leave events.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I suggest add new metric - isOperationsBlockedByPme().
> > > > > > > > > > > > >
> > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > metrics will show influence of the PME on cluster and user
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa is green). [1] Can
> > > > > > > > > > >
> > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > I think administator of Ignite cluster should be able to
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > Ignite process, including non blocking PME.
> > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > Found PME metric - getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time and not so useful
> > > > > > > > > > >
> > > > > > > > > > > because
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly blocking period.
> > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME and I see
> > > > > > > > > > >
> > > > > > > > > > > no
> > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone implementation of this
> > > > > > > > > > > > >
> > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > For now, implementation of new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I think we can implement this metrics as a single
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > > > > > > > > > >
> > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is a 1 simple metric:
> > > > > > > > > > >
> > > > > > > > > > > are
> > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > All other metrics can be extracted from logs now
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some useful metrics about the
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the duration of PME stages
> > > > > > > > > > > > >
> > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained using JMX or other
> > > > > > > > > > >
> > > > > > > > > > > external
> > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local node metrics that
> > > > > > > > > > >
> > > > > > > > > > > help to
> > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > actual status of current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > > > > > > > > >
> > > > > > > > > > > initiates
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > updates and translations on a previous
> > > > > > > > > > >
> > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> > > > > > > > > > >
> > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > > > > > > > > > > >
> > > > > > > > > > > > > received
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for all updates was
> > > > > > > > > > >
> > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME (didn't send a single
> > > > > > > > > > > > >
> > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best wishes,
> > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Zhenya Stanilovsky
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best wishes,
> > > > > > Amelchev Nikita
> > > > > >
> > > >
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita

Re: Partition map exchange metrics

Posted by Nikolay Izhikov <ni...@apache.org>.

Guys.

I think we should go with the 2 metrics

	* current PME duration (resets on finish)

		This metric required for alerting(or automatic actions) on long PME.

	* PME duration histogram (value added to metrics on PME finish)
		This metric required for an:
			* Quick PME trend analysis
			* Quick PME history analysis


В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> Nikita and Maxim,
> 
> > What if we just update current metric getCurrentPmeDuration behaviour
> > to show durations only for blocking PMEs?
> > Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
> > 
> > No other changes will require.
> > 
> > WDYT?
> 
> I agree with these two metrics. I also think that current 
> getCurrentPmeDuration will become redundant.
> 
> Anton,
> 
> > It looks like we're trying to implement "extended debug" instead of
> > "monitoring".
> > It should not be interesting for real admin what phase of PME is in
> > progress and so on.
> 
> PME is mission critical cluster process. I agree that there's a fine 
> line between monitoring and debug here. However, it's not good to add 
> monitoring capabilities only for scenario when everything is alright.
> If PME will really hang, *real admin* will be extremely interested how 
> to return cluster back to working state. Metrics about stages completion 
> time may really help here: e.g. if one specific node hasn't completed 
> stage X while rest of the cluster has, it can be a signal that this node 
> should be killed.
> 
> Of course, it's possible to build monitoring system that extract this 
> information from logs, but:
> - It's more resource intensive as it requires parsing logs for all the time
> - It's less reliable as log messages may change
> 
> Best Regards,
> Ivan Rakov
> 
> On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > Folks,
> > 
> > +1 with Anton post.
> > 
> > What if we just update current metric getCurrentPmeDuration behaviour
> > to show durations only for blocking PMEs?
> > Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
> > 
> > No other changes will require.
> > 
> > WDYT?
> > 
> > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <ns...@gmail.com> wrote:
> > > Nikolay,
> > > 
> > > The сacheOperationsBlockedDuration metric will show current blocking
> > > duration or 0 if there is no blocking right now.
> > > 
> > > The totalCacheOperationsBlockedDuration metric will accumulate all
> > > blocking durations that happen after node starts.
> > > 
> > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <ni...@apache.org>:
> > > > Nikita
> > > > 
> > > > What is the difference between those two metrics?
> > > > 
> > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:
> > > > 
> > > > > Igniters, thanks for comments.
> > > > > 
> > > > >  From the discussion it can be seen that we need only two metrics for now:
> > > > > - сacheOperationsBlockedDuration (long)
> > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > 
> > > > > I will prepare PR at the nearest time.
> > > > > 
> > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> > > > > > :
> > > > > > 
> > > > > > +1 with Anton decisions.
> > > > > > 
> > > > > > 
> > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> > > > > > > 
> > > > > > > Folks,
> > > > > > > 
> > > > > > > It looks like we're trying to implement "extended debug" instead of
> > > > > > > "monitoring".
> > > > > > > It should not be interesting for real admin what phase of PME is in
> > > > > > > progress and so on.
> > > > > > > Interested metrics are
> > > > > > > - total blocked time (will be used for real SLA counting)
> > > > > > > - are we blocked right now (shows we have an SLA degradation right now)
> > > > > > > Duration of the current blocking period can be easily presented using
> > > > > 
> > > > > any
> > > > > > > modern monitoring tool by regular checks.
> > > > > > > Initial true will means "period start", precision will be a result of
> > > > > > > checks frequency.
> > > > > > > Anyway, I'm ok to have current metric presented with long, where long
> > > > > 
> > > > > is a
> > > > > > > duration, see no reason, but ok :)
> > > > > > > 
> > > > > > > All other features you mentioned are useful for code or
> > > > > > > deployment improving and can (should) be taken from logs at the analysis
> > > > > > > phase.
> > > > > > > 
> > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> > > > > 
> > > > > wrote:
> > > > > > > > Folks, let me step in.
> > > > > > > > 
> > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > 
> > > > > > > > > 1. initialVersion. Topology version that initiates the exchange.
> > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished waiting for
> > > > > 
> > > > > all
> > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > 
> > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > 
> > > > > > > > Every metric from Nikita's list looks useful and simple to implement.
> > > > > > > > I think that it would be better to change format of metrics 4, 5, 6
> > > > > 
> > > > > and
> > > > > > > > 7 a bit: we can keep only difference between time of previous event
> > > > > 
> > > > > and
> > > > > > > > time of corresponding event. Such metrics would be easier to perceive:
> > > > > > > > they answer to specific questions "how much time did partition release
> > > > > > > > take?" or "how much time did awaiting of distributed phase end take?".
> > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > > > > > > > graphs will show how different stages times change from one PME to
> > > > > 
> > > > > another.
> > > > > > > > > When PME cause no blocking, it's a good PME and I see no reason to
> > > > > 
> > > > > have
> > > > > > > > > monitoring related to it
> > > > > > > > 
> > > > > > > > Agree with Anton here. These metrics should be measured only for true
> > > > > > > > distributed exchange. Saving results for client leave/join PMEs will
> > > > > > > > just complicate monitoring.
> > > > > > > > 
> > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > I still don't understand why instant value indicating that
> > > > > 
> > > > > operations are
> > > > > > > > > blocked should be boolean.
> > > > > > > > > Duration time since blocking has started looks more appropriate and
> > > > > > > > 
> > > > > > > > useful.
> > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > 
> > > > > > > > Totally agree with Pavel here. Both "accumulated block time" and
> > > > > > > > "current PME block time" metrics are useful. Growth of accumulated
> > > > > > > > metric for specific period of time (should be easy to check via
> > > > > > > > monitoring system graph) will show for how much business operations
> > > > > 
> > > > > were
> > > > > > > > blocked in total, and non-zero current metric will show that we are
> > > > > > > > experiencing issues right now. Boolean metric "are we blocked right
> > > > > 
> > > > > now"
> > > > > > > > is not needed as it's obviously can be inferred from "current PME
> > > > > 
> > > > > block
> > > > > > > > time".
> > > > > > > > 
> > > > > > > > Best Regards,
> > > > > > > > Ivan Rakov
> > > > > > > > 
> > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > Nikita,
> > > > > > > > > 
> > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > I still don't understand why instant value indicating that
> > > > > 
> > > > > operations are
> > > > > > > > > blocked should be boolean.
> > > > > > > > > Duration time since blocking has started looks more appropriate and
> > > > > > > > 
> > > > > > > > useful.
> > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> > > > > > 
> > > > > > :
> > > > > > > > > > Folks,
> > > > > > > > > > 
> > > > > > > > > > All previous suggestions have some disadvantages. It can be several
> > > > > > > > > > exchanges between two metric updates and fast exchange can rewrite
> > > > > > > > > > previous long exchange.
> > > > > > > > > > 
> > > > > > > > > > We can introduce a metric of total blocking duration that will
> > > > > > > > > > accumulate at the end of the exchange. So, users will get actual
> > > > > > > > > > information about how long operations were blocked. Cluster metric
> > > > > > > > > > will be a maximum of local nodes metrics. And we need a boolean
> > > > > 
> > > > > metric
> > > > > > > > > > that will indicate realtime status. It needs because of duration
> > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > 
> > > > > > > > > > So I propose to change the current metric that not released to the
> > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add the
> > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > 
> > > > > > > > > > WDYT?
> > > > > > > > > > 
> > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> > > > > > > > > > > Nikolay,
> > > > > > > > > > > 
> > > > > > > > > > > Still see no reason to replace boolean with long.
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > > > 
> > > > > nizhikov@apache.org >
> > > > > > > > > > wrote:
> > > > > > > > > > > > Anton.
> > > > > > > > > > > > 
> > > > > > > > > > > > 1. Value exported based on SPI settings, not in the moment it
> > > > > 
> > > > > changed.
> > > > > > > > > > > > 2. Clock synchronisation - if we export start time, we should
> > > > > 
> > > > > also
> > > > > > > > > > export
> > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > 
> > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count the durations.
> > > > > > > > > > > > > Sine monitoring system checks metrics periodically it will know
> > > > > 
> > > > > the
> > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > > > > 
> > > > > jokserfn@gmail.com >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the metric name, I
> > > > > 
> > > > > suggest
> > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it cleaner
> > > > > 
> > > > > represents
> > > > > > > > > > what
> > > > > > > > > > > > is
> > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > 
> > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > duration to have better correlation when cache operations were
> > > > > > > > > > 
> > > > > > > > > > blocked
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > For instant view (like in JMX bean) a calculated value as you
> > > > > > > > > > 
> > > > > > > > > > mentioned
> > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > For metrics are exported to some backend (IEP-35) a counter
> > > > > 
> > > > > can be
> > > > > > > > > > > > used.
> > > > > > > > > > > > > > The counter is incremented by blocking time after blocking has
> > > > > > > > > > 
> > > > > > > > > > ended.
> > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > > > > 
> > > > > nsamelchev@gmail.com
> > > > > > > > > > > :
> > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > how much time we wait for resuming cache operations
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean timestamp or duration
> > > > > 
> > > > > here?
> > > > > > > > > > > > > > > > > What do you think if we change the boolean value of metric
> > > > > 
> > > > > to a
> > > > > > > > > > > > long
> > > > > > > > > > > > > > > value that represents time in milliseconds when operations
> > > > > 
> > > > > were
> > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > This time can be calculated as (currentTime -
> > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of timestamp.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Duration will be more understandable. It'll be something like
> > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't come up with a
> > > > > 
> > > > > better
> > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > > > > 
> > > > > jokserfn@gmail.com
> > > > > > > > > > > :
> > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show useful
> > > > > 
> > > > > information.
> > > > > > > > > > The
> > > > > > > > > > > > > main
> > > > > > > > > > > > > > > PME side effect for end-users is blocking cache operations.
> > > > > 
> > > > > Not
> > > > > > > > > > all
> > > > > > > > > > > > PME
> > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > What information gives to an end-user timestamp of
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what analysis it can be
> > > > > 
> > > > > used and
> > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > > > > > > > > 
> > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > This time already can be obtained from the
> > > > > > > > > > 
> > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > As an alternative solution, I can rework recently added
> > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not released yet). Seems for
> > > > > > > > > > 
> > > > > > > > > > users it
> > > > > > > > > > > > > > > > > useless in case of non-blocking PME.
> > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > > > > > > > > > 
> > > > > > > > > > when
> > > > > > > > > > > > > > > > > blocking started (minimal value of cluster nodes) and 0 if
> > > > > > > > > > 
> > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > > > > > > > > > 
> > > > > > > > > >   jokserfn@gmail.com >:
> > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thank you for working on this. What do you think if we
> > > > > > > > > > 
> > > > > > > > > > change the
> > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > value of metric to a long value that represents time in
> > > > > > > > > > > > > 
> > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > Since we have not only JMX and now metrics are periodically
> > > > > > > > > > > > > 
> > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > some backend it can give a more clear picture of how much
> > > > > > > > > > 
> > > > > > > > > > time we
> > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > resuming cache operations instead of instant boolean
> > > > > > > > > > 
> > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > > > > > > > > > > 
> > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > For now, we have the getCurrentPmeDuration() metric that
> > > > > > > > > > 
> > > > > > > > > > does
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > influence on the cluster correctly. PME can be without
> > > > > > > > > > 
> > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > operations. For example, client node join/leave events.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > I suggest add new metric - isOperationsBlockedByPme().
> > > > > > > > > > > > 
> > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > metrics will show influence of the PME on cluster and user
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa is green). [1] Can
> > > > > > > > > > 
> > > > > > > > > > anyone
> > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   nizhikov@apache.org
> > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > I think administator of Ignite cluster should be able to
> > > > > > > > > > > > > 
> > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > Ignite process, including non blocking PME.
> > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > Found PME metric - getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time and not so useful
> > > > > > > > > > 
> > > > > > > > > > because
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > The goal it so show exactly blocking period.
> > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME and I see
> > > > > > > > > > 
> > > > > > > > > > no
> > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone implementation of this
> > > > > > > > > > > > 
> > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > For now, implementation of new metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > I think we can implement this metrics as a single
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > > > > > > > > > 
> > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is a 1 simple metric:
> > > > > > > > > > 
> > > > > > > > > > are
> > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > All other metrics can be extracted from logs now
> > > > > > > > > > 
> > > > > > > > > > and
> > > > > > > > > > > > can
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > >   nizhikov@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   nsamelchev@gmail.com
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some useful metrics about the
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the duration of PME stages
> > > > > > > > > > > > 
> > > > > > > > > > > > available
> > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained using JMX or other
> > > > > > > > > > 
> > > > > > > > > > external
> > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local node metrics that
> > > > > > > > > > 
> > > > > > > > > > help to
> > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > actual status of current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that
> > > > > > > > > > 
> > > > > > > > > > initiates
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > > > > > > > > > > 
> > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > updates and translations on a previous
> > > > > > > > > > 
> > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node
> > > > > > > > > > 
> > > > > > > > > > sent a
> > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > > > > > > > > > > 
> > > > > > > > > > > > received
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to understand:
> > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was (current or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for all updates was
> > > > > > > > > > 
> > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME (didn't send a single
> > > > > > > > > > > > 
> > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Best wishes,
> > > > > > > > > > Amelchev Nikita
> > > > > > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Zhenya Stanilovsky
> > > > > 
> > > > > 
> > > > > --
> > > > > Best wishes,
> > > > > Amelchev Nikita
> > > > > 
> > > 
> > > 
> > > --
> > > Best wishes,
> > > Amelchev Nikita

Re: Partition map exchange metrics

Posted by Ivan Rakov <iv...@gmail.com>.

Nikita and Maxim,

> What if we just update current metric getCurrentPmeDuration behaviour
> to show durations only for blocking PMEs?
> Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
>
> No other changes will require.
>
> WDYT?
I agree with these two metrics. I also think that current 
getCurrentPmeDuration will become redundant.

Anton,

> It looks like we're trying to implement "extended debug" instead of
> "monitoring".
> It should not be interesting for real admin what phase of PME is in
> progress and so on.

PME is mission critical cluster process. I agree that there's a fine 
line between monitoring and debug here. However, it's not good to add 
monitoring capabilities only for scenario when everything is alright.
If PME will really hang, *real admin* will be extremely interested how 
to return cluster back to working state. Metrics about stages completion 
time may really help here: e.g. if one specific node hasn't completed 
stage X while rest of the cluster has, it can be a signal that this node 
should be killed.

Of course, it's possible to build monitoring system that extract this 
information from logs, but:
- It's more resource intensive as it requires parsing logs for all the time
- It's less reliable as log messages may change

Best Regards,
Ivan Rakov

On 24.07.2019 14:57, Maxim Muzafarov wrote:
> Folks,
>
> +1 with Anton post.
>
> What if we just update current metric getCurrentPmeDuration behaviour
> to show durations only for blocking PMEs?
> Remain it as a long value and rename it to getCacheOperationsBlockedDuration.
>
> No other changes will require.
>
> WDYT?
>
> On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <ns...@gmail.com> wrote:
>> Nikolay,
>>
>> The сacheOperationsBlockedDuration metric will show current blocking
>> duration or 0 if there is no blocking right now.
>>
>> The totalCacheOperationsBlockedDuration metric will accumulate all
>> blocking durations that happen after node starts.
>>
>> ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <ni...@apache.org>:
>>> Nikita
>>>
>>> What is the difference between those two metrics?
>>>
>>> ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:
>>>
>>>> Igniters, thanks for comments.
>>>>
>>>>  From the discussion it can be seen that we need only two metrics for now:
>>>> - сacheOperationsBlockedDuration (long)
>>>> - totalCacheOperationsBlockedDuration (long)
>>>>
>>>> I will prepare PR at the nearest time.
>>>>
>>>> ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
>>>>> :
>>>>>
>>>>> +1 with Anton decisions.
>>>>>
>>>>>
>>>>>> Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
>>>>>>
>>>>>> Folks,
>>>>>>
>>>>>> It looks like we're trying to implement "extended debug" instead of
>>>>>> "monitoring".
>>>>>> It should not be interesting for real admin what phase of PME is in
>>>>>> progress and so on.
>>>>>> Interested metrics are
>>>>>> - total blocked time (will be used for real SLA counting)
>>>>>> - are we blocked right now (shows we have an SLA degradation right now)
>>>>>> Duration of the current blocking period can be easily presented using
>>>> any
>>>>>> modern monitoring tool by regular checks.
>>>>>> Initial true will means "period start", precision will be a result of
>>>>>> checks frequency.
>>>>>> Anyway, I'm ok to have current metric presented with long, where long
>>>> is a
>>>>>> duration, see no reason, but ok :)
>>>>>>
>>>>>> All other features you mentioned are useful for code or
>>>>>> deployment improving and can (should) be taken from logs at the analysis
>>>>>> phase.
>>>>>>
>>>>>> On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
>>>> wrote:
>>>>>>> Folks, let me step in.
>>>>>>>
>>>>>>> Nikita, thanks for your suggestions!
>>>>>>>
>>>>>>>> 1. initialVersion. Topology version that initiates the exchange.
>>>>>>>> 2. initTime. Time PME was started.
>>>>>>>> 3. initEvent. Event that triggered PME.
>>>>>>>> 4. partitionReleaseTime. Time when a node has finished waiting for
>>>> all
>>>>>>>> updates and translations on a previous topology.
>>>>>>>> 5. sendSingleMessageTime. Time when a node sent a single message.
>>>>>>>> 6. recieveFullMessageTime. Time when a node received a full message.
>>>>>>>> 7. finishTime. Time PME was ended.
>>>>>>>>
>>>>>>>> When new PME started all these metrics resets.
>>>>>>> Every metric from Nikita's list looks useful and simple to implement.
>>>>>>> I think that it would be better to change format of metrics 4, 5, 6
>>>> and
>>>>>>> 7 a bit: we can keep only difference between time of previous event
>>>> and
>>>>>>> time of corresponding event. Such metrics would be easier to perceive:
>>>>>>> they answer to specific questions "how much time did partition release
>>>>>>> take?" or "how much time did awaiting of distributed phase end take?".
>>>>>>> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
>>>>>>> graphs will show how different stages times change from one PME to
>>>> another.
>>>>>>>> When PME cause no blocking, it's a good PME and I see no reason to
>>>> have
>>>>>>>> monitoring related to it
>>>>>>> Agree with Anton here. These metrics should be measured only for true
>>>>>>> distributed exchange. Saving results for client leave/join PMEs will
>>>>>>> just complicate monitoring.
>>>>>>>
>>>>>>>> I agree with total blocking duration metric but
>>>>>>>> I still don't understand why instant value indicating that
>>>> operations are
>>>>>>>> blocked should be boolean.
>>>>>>>> Duration time since blocking has started looks more appropriate and
>>>>>>> useful.
>>>>>>>> It gives more information while semantic is left the same.
>>>>>>> Totally agree with Pavel here. Both "accumulated block time" and
>>>>>>> "current PME block time" metrics are useful. Growth of accumulated
>>>>>>> metric for specific period of time (should be easy to check via
>>>>>>> monitoring system graph) will show for how much business operations
>>>> were
>>>>>>> blocked in total, and non-zero current metric will show that we are
>>>>>>> experiencing issues right now. Boolean metric "are we blocked right
>>>> now"
>>>>>>> is not needed as it's obviously can be inferred from "current PME
>>>> block
>>>>>>> time".
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Ivan Rakov
>>>>>>>
>>>>>>> On 23.07.2019 16:02, Pavel Kovalenko wrote:
>>>>>>>> Nikita,
>>>>>>>>
>>>>>>>> I agree with total blocking duration metric but
>>>>>>>> I still don't understand why instant value indicating that
>>>> operations are
>>>>>>>> blocked should be boolean.
>>>>>>>> Duration time since blocking has started looks more appropriate and
>>>>>>> useful.
>>>>>>>> It gives more information while semantic is left the same.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
>>>>> :
>>>>>>>>> Folks,
>>>>>>>>>
>>>>>>>>> All previous suggestions have some disadvantages. It can be several
>>>>>>>>> exchanges between two metric updates and fast exchange can rewrite
>>>>>>>>> previous long exchange.
>>>>>>>>>
>>>>>>>>> We can introduce a metric of total blocking duration that will
>>>>>>>>> accumulate at the end of the exchange. So, users will get actual
>>>>>>>>> information about how long operations were blocked. Cluster metric
>>>>>>>>> will be a maximum of local nodes metrics. And we need a boolean
>>>> metric
>>>>>>>>> that will indicate realtime status. It needs because of duration
>>>>>>>>> metric updates at the end of the exchange.
>>>>>>>>>
>>>>>>>>> So I propose to change the current metric that not released to the
>>>>>>>>> totalCacheOperationsBlockingDuration metric and to add the
>>>>>>>>> isCacheOperationsBlocked metric.
>>>>>>>>>
>>>>>>>>> WDYT?
>>>>>>>>>
>>>>>>>>> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
>>>>>>>>>> Nikolay,
>>>>>>>>>>
>>>>>>>>>> Still see no reason to replace boolean with long.
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
>>>> nizhikov@apache.org >
>>>>>>>>> wrote:
>>>>>>>>>>> Anton.
>>>>>>>>>>>
>>>>>>>>>>> 1. Value exported based on SPI settings, not in the moment it
>>>> changed.
>>>>>>>>>>> 2. Clock synchronisation - if we export start time, we should
>>>> also
>>>>>>>>> export
>>>>>>>>>>> node local timestamp.
>>>>>>>>>>>
>>>>>>>>>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
>>>>>>>>>>>
>>>>>>>>>>>> Folks,
>>>>>>>>>>>>
>>>>>>>>>>>> What's the reason for duration counting?
>>>>>>>>>>>> AFAIU, it's a monitoring system feature to count the durations.
>>>>>>>>>>>> Sine monitoring system checks metrics periodically it will know
>>>> the
>>>>>>>>>>>> duration by its own log.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
>>>> jokserfn@gmail.com >
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Nikita,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, I mean duration not timestamp. For the metric name, I
>>>> suggest
>>>>>>>>>>>>> "cacheOperationsBlockingDuration", I think it cleaner
>>>> represents
>>>>>>>>> what
>>>>>>>>>>> is
>>>>>>>>>>>>> blocked during PME.
>>>>>>>>>>>>> We can also combine both timestamp
>>>>>>>>> "cacheOperationsBlockingStartTs" and
>>>>>>>>>>>>> duration to have better correlation when cache operations were
>>>>>>>>> blocked
>>>>>>>>>>>> and
>>>>>>>>>>>>> how much time it's taken.
>>>>>>>>>>>>> For instant view (like in JMX bean) a calculated value as you
>>>>>>>>> mentioned
>>>>>>>>>>>>> can be used.
>>>>>>>>>>>>> For metrics are exported to some backend (IEP-35) a counter
>>>> can be
>>>>>>>>>>> used.
>>>>>>>>>>>>> The counter is incremented by blocking time after blocking has
>>>>>>>>> ended.
>>>>>>>>>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
>>>> nsamelchev@gmail.com
>>>>>>>>>> :
>>>>>>>>>>>>>> Pavel,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The main purpose of this metric is
>>>>>>>>>>>>>>>> how much time we wait for resuming cache operations
>>>>>>>>>>>>>> Seems I misunderstood you. Do you mean timestamp or duration
>>>> here?
>>>>>>>>>>>>>>>> What do you think if we change the boolean value of metric
>>>> to a
>>>>>>>>>>> long
>>>>>>>>>>>>>> value that represents time in milliseconds when operations
>>>> were
>>>>>>>>>>> blocked?
>>>>>>>>>>>>>> This time can be calculated as (currentTime -
>>>>>>>>>>>>>> timeSinceOperationsBlocked) in case of timestamp.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Duration will be more understandable. It'll be something like
>>>>>>>>>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a
>>>> better
>>>>>>>>>>>>>> name yet.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
>>>> jokserfn@gmail.com
>>>>>>>>>> :
>>>>>>>>>>>>>>> Nikita,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think getCurrentPmeDuration doesn't show useful
>>>> information.
>>>>>>>>> The
>>>>>>>>>>>> main
>>>>>>>>>>>>>> PME side effect for end-users is blocking cache operations.
>>>> Not
>>>>>>>>> all
>>>>>>>>>>> PME
>>>>>>>>>>>>>> time blocks it.
>>>>>>>>>>>>>>> What information gives to an end-user timestamp of
>>>>>>>>>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be
>>>> used and
>>>>>>>>>>> how?
>>>>>>>>>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
>>>>>>>>>   nsamelchev@gmail.com
>>>>>>>>>>>> :
>>>>>>>>>>>>>>>> Hi Pavel,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This time already can be obtained from the
>>>>>>>>> getCurrentPmeDuration
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> new isOperationsBlockedByPme metrics.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As an alternative solution, I can rework recently added
>>>>>>>>>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
>>>>>>>>> users it
>>>>>>>>>>>>>>>> useless in case of non-blocking PME.
>>>>>>>>>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
>>>>>>>>> when
>>>>>>>>>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
>>>>>>>>> blocking
>>>>>>>>>>>>>>>> ends (there is no running PME).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WDYT?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
>>>>>>>>>   jokserfn@gmail.com >:
>>>>>>>>>>>>>>>>> Hi Nikita,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thank you for working on this. What do you think if we
>>>>>>>>> change the
>>>>>>>>>>>>>> boolean
>>>>>>>>>>>>>>>>> value of metric to a long value that represents time in
>>>>>>>>>>>> milliseconds
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>>> operations were blocked?
>>>>>>>>>>>>>>>>> Since we have not only JMX and now metrics are periodically
>>>>>>>>>>>> exported
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> some backend it can give a more clear picture of how much
>>>>>>>>> time we
>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>>>> resuming cache operations instead of instant boolean
>>>>>>>>> indicator.
>>>>>>>>>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
>>>>>>>>>>>   nsamelchev@gmail.com
>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>> Anton, Nikolay,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the support.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
>>>>>>>>> does
>>>>>>>>>>> not
>>>>>>>>>>>>>> show
>>>>>>>>>>>>>>>>>> influence on the cluster correctly. PME can be without
>>>>>>>>> blocking
>>>>>>>>>>>>>>>>>> operations. For example, client node join/leave events.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
>>>>>>>>>>> Together,
>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>>> metrics will show influence of the PME on cluster and user
>>>>>>>>>>>>>> operations.
>>>>>>>>>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
>>>>>>>>> anyone
>>>>>>>>>>>>>> take a
>>>>>>>>>>>>>>>>>> look?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
>>>>>>>>>>>>   nizhikov@apache.org
>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>> I think administator of Ignite cluster should be able to
>>>>>>>>>>>> monitor
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> Ignite process, including non blocking PME.
>>>>>>>>>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
>>>>>>>>>>>>>>>>>>>> BTW,
>>>>>>>>>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
>>>>>>>>>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
>>>>>>>>> because
>>>>>>>>>>> of
>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>>>> The goal it so show exactly blocking period.
>>>>>>>>>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
>>>>>>>>> no
>>>>>>>>>>>>>> reason to have
>>>>>>>>>>>>>>>>>>>> monitoring related to it :)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
>>>>>>>>>>>>>>   nizhikov@apache.org >
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Anton.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Why do we need to postpone implementation of this
>>>>>>>>>>> metrics?
>>>>>>>>>>>>>>>>>>>>> For now, implementation of new metric is very simple.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I think we can implement this metrics as a single
>>>>>>>>>>>>>> contribution.
>>>>>>>>>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
>>>>>>>>> пишет:
>>>>>>>>>>>>>>>>>>>>>> Nikita,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
>>>>>>>>> are
>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>>>>>> blocked?
>>>>>>>>>>>>>>>>>>>>>> Just a true or false.
>>>>>>>>>>>>>>>>>>>>>> Lest start from this.
>>>>>>>>>>>>>>>>>>>>>> All other metrics can be extracted from logs now
>>>>>>>>> and
>>>>>>>>>>> can
>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>> implemented
>>>>>>>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
>>>>>>>>>>>>>>>>>>   nizhikov@apache.org >
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +1.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Nikita, please, go ahead.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
>>>>>>>>>>>>>>   nsamelchev@gmail.com
>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>> Hello, Igniters.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
>>>>>>>>>>>>>> partition map
>>>>>>>>>>>>>>>>>> exchange
>>>>>>>>>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
>>>>>>>>>>> available
>>>>>>>>>>>>>> only in
>>>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
>>>>>>>>> external
>>>>>>>>>>>>>> tools. [1]
>>>>>>>>>>>>>>>>>>>>>>>> I made the list of local node metrics that
>>>>>>>>> help to
>>>>>>>>>>>>>> understand
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> actual status of current PME:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
>>>>>>>>> initiates
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> exchange.
>>>>>>>>>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
>>>>>>>>>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
>>>>>>>>>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
>>>>>>>>>>>> finished
>>>>>>>>>>>>>> waiting
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>>>> updates and translations on a previous
>>>>>>>>> topology.
>>>>>>>>>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
>>>>>>>>> sent a
>>>>>>>>>>>>>> single
>>>>>>>>>>>>>>>>>> message.
>>>>>>>>>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
>>>>>>>>>>> received
>>>>>>>>>>>> a
>>>>>>>>>>>>>> full
>>>>>>>>>>>>>>>>>> message.
>>>>>>>>>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> These metrics help to understand:
>>>>>>>>>>>>>>>>>>>>>>>> - how long PME was (current or previous).
>>>>>>>>>>>>>>>>>>>>>>>> - how long awaited for all updates was
>>>>>>>>> completed.
>>>>>>>>>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
>>>>>>>>>>> message)
>>>>>>>>>>>>>>>>>>>>>>>> - what triggered PME.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>   https://issues.apache.org/jira/browse/IGNITE-11961
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>>>>>>>>>> Amelchev Nikita
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>>>> Amelchev Nikita
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>>>> Amelchev Nikita
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best wishes,
>>>>>>>>>>>>>> Amelchev Nikita
>>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best wishes,
>>>>>>>>> Amelchev Nikita
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Zhenya Stanilovsky
>>>>
>>>>
>>>> --
>>>> Best wishes,
>>>> Amelchev Nikita
>>>>
>>
>>
>> --
>> Best wishes,
>> Amelchev Nikita

Re: Re[2]: Partition map exchange metrics

Posted by Maxim Muzafarov <ma...@gmail.com>.

Folks,

+1 with Anton post.

What if we just update current metric getCurrentPmeDuration behaviour
to show durations only for blocking PMEs?
Remain it as a long value and rename it to getCacheOperationsBlockedDuration.

No other changes will require.

WDYT?

On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <ns...@gmail.com> wrote:
>
> Nikolay,
>
> The сacheOperationsBlockedDuration metric will show current blocking
> duration or 0 if there is no blocking right now.
>
> The totalCacheOperationsBlockedDuration metric will accumulate all
> blocking durations that happen after node starts.
>
> ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <ni...@apache.org>:
> >
> > Nikita
> >
> > What is the difference between those two metrics?
> >
> > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:
> >
> > > Igniters, thanks for comments.
> > >
> > > From the discussion it can be seen that we need only two metrics for now:
> > > - сacheOperationsBlockedDuration (long)
> > > - totalCacheOperationsBlockedDuration (long)
> > >
> > > I will prepare PR at the nearest time.
> > >
> > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> > > >:
> > > >
> > > > +1 with Anton decisions.
> > > >
> > > >
> > > > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> > > > >
> > > > >Folks,
> > > > >
> > > > >It looks like we're trying to implement "extended debug" instead of
> > > > >"monitoring".
> > > > >It should not be interesting for real admin what phase of PME is in
> > > > >progress and so on.
> > > > >Interested metrics are
> > > > >- total blocked time (will be used for real SLA counting)
> > > > >- are we blocked right now (shows we have an SLA degradation right now)
> > > > >Duration of the current blocking period can be easily presented using
> > > any
> > > > >modern monitoring tool by regular checks.
> > > > >Initial true will means "period start", precision will be a result of
> > > > >checks frequency.
> > > > >Anyway, I'm ok to have current metric presented with long, where long
> > > is a
> > > > >duration, see no reason, but ok :)
> > > > >
> > > > >All other features you mentioned are useful for code or
> > > > >deployment improving and can (should) be taken from logs at the analysis
> > > > >phase.
> > > > >
> > > > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> > > wrote:
> > > > >
> > > > >> Folks, let me step in.
> > > > >>
> > > > >> Nikita, thanks for your suggestions!
> > > > >>
> > > > >> > 1. initialVersion. Topology version that initiates the exchange.
> > > > >> > 2. initTime. Time PME was started.
> > > > >> > 3. initEvent. Event that triggered PME.
> > > > >> > 4. partitionReleaseTime. Time when a node has finished waiting for
> > > all
> > > > >> > updates and translations on a previous topology.
> > > > >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > > >> > 6. recieveFullMessageTime. Time when a node received a full message.
> > > > >> > 7. finishTime. Time PME was ended.
> > > > >> >
> > > > >> > When new PME started all these metrics resets.
> > > > >> Every metric from Nikita's list looks useful and simple to implement.
> > > > >> I think that it would be better to change format of metrics 4, 5, 6
> > > and
> > > > >> 7 a bit: we can keep only difference between time of previous event
> > > and
> > > > >> time of corresponding event. Such metrics would be easier to perceive:
> > > > >> they answer to specific questions "how much time did partition release
> > > > >> take?" or "how much time did awaiting of distributed phase end take?".
> > > > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > > > >> graphs will show how different stages times change from one PME to
> > > another.
> > > > >>
> > > > >> > When PME cause no blocking, it's a good PME and I see no reason to
> > > have
> > > > >> > monitoring related to it
> > > > >> Agree with Anton here. These metrics should be measured only for true
> > > > >> distributed exchange. Saving results for client leave/join PMEs will
> > > > >> just complicate monitoring.
> > > > >>
> > > > >> > I agree with total blocking duration metric but
> > > > >> > I still don't understand why instant value indicating that
> > > operations are
> > > > >> > blocked should be boolean.
> > > > >> > Duration time since blocking has started looks more appropriate and
> > > > >> useful.
> > > > >> > It gives more information while semantic is left the same.
> > > > >> Totally agree with Pavel here. Both "accumulated block time" and
> > > > >> "current PME block time" metrics are useful. Growth of accumulated
> > > > >> metric for specific period of time (should be easy to check via
> > > > >> monitoring system graph) will show for how much business operations
> > > were
> > > > >> blocked in total, and non-zero current metric will show that we are
> > > > >> experiencing issues right now. Boolean metric "are we blocked right
> > > now"
> > > > >> is not needed as it's obviously can be inferred from "current PME
> > > block
> > > > >> time".
> > > > >>
> > > > >> Best Regards,
> > > > >> Ivan Rakov
> > > > >>
> > > > >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > >> > Nikita,
> > > > >> >
> > > > >> > I agree with total blocking duration metric but
> > > > >> > I still don't understand why instant value indicating that
> > > operations are
> > > > >> > blocked should be boolean.
> > > > >> > Duration time since blocking has started looks more appropriate and
> > > > >> useful.
> > > > >> > It gives more information while semantic is left the same.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> > > >:
> > > > >> >
> > > > >> >> Folks,
> > > > >> >>
> > > > >> >> All previous suggestions have some disadvantages. It can be several
> > > > >> >> exchanges between two metric updates and fast exchange can rewrite
> > > > >> >> previous long exchange.
> > > > >> >>
> > > > >> >> We can introduce a metric of total blocking duration that will
> > > > >> >> accumulate at the end of the exchange. So, users will get actual
> > > > >> >> information about how long operations were blocked. Cluster metric
> > > > >> >> will be a maximum of local nodes metrics. And we need a boolean
> > > metric
> > > > >> >> that will indicate realtime status. It needs because of duration
> > > > >> >> metric updates at the end of the exchange.
> > > > >> >>
> > > > >> >> So I propose to change the current metric that not released to the
> > > > >> >> totalCacheOperationsBlockingDuration metric and to add the
> > > > >> >> isCacheOperationsBlocked metric.
> > > > >> >>
> > > > >> >> WDYT?
> > > > >> >>
> > > > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> > > > >> >>> Nikolay,
> > > > >> >>>
> > > > >> >>> Still see no reason to replace boolean with long.
> > > > >> >>>
> > > > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > nizhikov@apache.org >
> > > > >> >> wrote:
> > > > >> >>>> Anton.
> > > > >> >>>>
> > > > >> >>>> 1. Value exported based on SPI settings, not in the moment it
> > > changed.
> > > > >> >>>>
> > > > >> >>>> 2. Clock synchronisation - if we export start time, we should
> > > also
> > > > >> >> export
> > > > >> >>>> node local timestamp.
> > > > >> >>>>
> > > > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> > > > >> >>>>
> > > > >> >>>>> Folks,
> > > > >> >>>>>
> > > > >> >>>>> What's the reason for duration counting?
> > > > >> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> > > > >> >>>>> Sine monitoring system checks metrics periodically it will know
> > > the
> > > > >> >>>>> duration by its own log.
> > > > >> >>>>>
> > > > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > > jokserfn@gmail.com >
> > > > >> >>>>> wrote:
> > > > >> >>>>>
> > > > >> >>>>>> Nikita,
> > > > >> >>>>>>
> > > > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I
> > > suggest
> > > > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner
> > > represents
> > > > >> >> what
> > > > >> >>>> is
> > > > >> >>>>>> blocked during PME.
> > > > >> >>>>>> We can also combine both timestamp
> > > > >> >> "cacheOperationsBlockingStartTs" and
> > > > >> >>>>>> duration to have better correlation when cache operations were
> > > > >> >> blocked
> > > > >> >>>>> and
> > > > >> >>>>>> how much time it's taken.
> > > > >> >>>>>> For instant view (like in JMX bean) a calculated value as you
> > > > >> >> mentioned
> > > > >> >>>>>> can be used.
> > > > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter
> > > can be
> > > > >> >>>> used.
> > > > >> >>>>>> The counter is incremented by blocking time after blocking has
> > > > >> >> ended.
> > > > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > > nsamelchev@gmail.com
> > > > >> >>> :
> > > > >> >>>>>>> Pavel,
> > > > >> >>>>>>>
> > > > >> >>>>>>> The main purpose of this metric is
> > > > >> >>>>>>>>> how much time we wait for resuming cache operations
> > > > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration
> > > here?
> > > > >> >>>>>>>>> What do you think if we change the boolean value of metric
> > > to a
> > > > >> >>>> long
> > > > >> >>>>>>> value that represents time in milliseconds when operations
> > > were
> > > > >> >>>> blocked?
> > > > >> >>>>>>> This time can be calculated as (currentTime -
> > > > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> > > > >> >>>>>>>
> > > > >> >>>>>>> Duration will be more understandable. It'll be something like
> > > > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a
> > > better
> > > > >> >>>>>>> name yet.
> > > > >> >>>>>>>
> > > > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > > jokserfn@gmail.com
> > > > >> >>> :
> > > > >> >>>>>>>> Nikita,
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful
> > > information.
> > > > >> >> The
> > > > >> >>>>> main
> > > > >> >>>>>>> PME side effect for end-users is blocking cache operations.
> > > Not
> > > > >> >> all
> > > > >> >>>> PME
> > > > >> >>>>>>> time blocks it.
> > > > >> >>>>>>>> What information gives to an end-user timestamp of
> > > > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be
> > > used and
> > > > >> >>>> how?
> > > > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > > >> >>  nsamelchev@gmail.com
> > > > >> >>>>> :
> > > > >> >>>>>>>>> Hi Pavel,
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> This time already can be obtained from the
> > > > >> >> getCurrentPmeDuration
> > > > >> >>>> and
> > > > >> >>>>>>>>> new isOperationsBlockedByPme metrics.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> As an alternative solution, I can rework recently added
> > > > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> > > > >> >> users it
> > > > >> >>>>>>>>> useless in case of non-blocking PME.
> > > > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > > > >> >> when
> > > > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> > > > >> >> blocking
> > > > >> >>>>>>>>> ends (there is no running PME).
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> WDYT?
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > > > >> >>  jokserfn@gmail.com >:
> > > > >> >>>>>>>>>> Hi Nikita,
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Thank you for working on this. What do you think if we
> > > > >> >> change the
> > > > >> >>>>>>> boolean
> > > > >> >>>>>>>>>> value of metric to a long value that represents time in
> > > > >> >>>>> milliseconds
> > > > >> >>>>>>> when
> > > > >> >>>>>>>>>> operations were blocked?
> > > > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> > > > >> >>>>> exported
> > > > >> >>>>>>> to
> > > > >> >>>>>>>>>> some backend it can give a more clear picture of how much
> > > > >> >> time we
> > > > >> >>>>>>> wait for
> > > > >> >>>>>>>>>> resuming cache operations instead of instant boolean
> > > > >> >> indicator.
> > > > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > > >> >>>>  nsamelchev@gmail.com
> > > > >> >>>>>> :
> > > > >> >>>>>>>>>>> Anton, Nikolay,
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> Thanks for the support.
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> > > > >> >> does
> > > > >> >>>> not
> > > > >> >>>>>>> show
> > > > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> > > > >> >> blocking
> > > > >> >>>>>>>>>>> operations. For example, client node join/leave events.
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> > > > >> >>>> Together,
> > > > >> >>>>>>> these
> > > > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> > > > >> >>>>>>> operations.
> > > > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> > > > >> >> anyone
> > > > >> >>>>>>> take a
> > > > >> >>>>>>>>>>> look?
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > > >> >>>>>  nizhikov@apache.org
> > > > >> >>>>>>>> :
> > > > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> > > > >> >>>>> monitor
> > > > >> >>>>>>> all
> > > > >> >>>>>>>>>>> Ignite process, including non blocking PME.
> > > > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > >> >>>>>>>>>>>>> BTW,
> > > > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> > > > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> > > > >> >> because
> > > > >> >>>> of
> > > > >> >>>>>>> this.
> > > > >> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> > > > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> > > > >> >> no
> > > > >> >>>>>>> reason to have
> > > > >> >>>>>>>>>>>>> monitoring related to it :)
> > > > >> >>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > > >> >>>>>>>  nizhikov@apache.org >
> > > > >> >>>>>>>>>>> wrote:
> > > > >> >>>>>>>>>>>>>> Anton.
> > > > >> >>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> > > > >> >>>> metrics?
> > > > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> > > > >> >>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> > > > >> >>>>>>> contribution.
> > > > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > > > >> >> пишет:
> > > > >> >>>>>>>>>>>>>>> Nikita,
> > > > >> >>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> > > > >> >> are
> > > > >> >>>>>>> operations
> > > > >> >>>>>>>>>>> blocked?
> > > > >> >>>>>>>>>>>>>>> Just a true or false.
> > > > >> >>>>>>>>>>>>>>> Lest start from this.
> > > > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> > > > >> >> and
> > > > >> >>>> can
> > > > >> >>>>> be
> > > > >> >>>>>>>>>>> implemented
> > > > >> >>>>>>>>>>>>>>> later.
> > > > >> >>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > > >> >>>>>>>>>>>  nizhikov@apache.org >
> > > > >> >>>>>>>>>>>>>>> wrote:
> > > > >> >>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>> +1.
> > > > >> >>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> > > > >> >>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > > >> >>>>>>>  nsamelchev@gmail.com
> > > > >> >>>>>>>>>>>> :
> > > > >> >>>>>>>>>>>>>>>>> Hello, Igniters.
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> > > > >> >>>>>>> partition map
> > > > >> >>>>>>>>>>> exchange
> > > > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> > > > >> >>>> available
> > > > >> >>>>>>> only in
> > > > >> >>>>>>>>>>> log
> > > > >> >>>>>>>>>>>>>> files
> > > > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> > > > >> >> external
> > > > >> >>>>>>> tools. [1]
> > > > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> > > > >> >> help to
> > > > >> >>>>>>> understand
> > > > >> >>>>>>>>>>> the
> > > > >> >>>>>>>>>>>>>>>>> actual status of current PME:
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> > > > >> >> initiates
> > > > >> >>>>> the
> > > > >> >>>>>>>>>>> exchange.
> > > > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> > > > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> > > > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> > > > >> >>>>> finished
> > > > >> >>>>>>> waiting
> > > > >> >>>>>>>>>>> for
> > > > >> >>>>>>>>>>>>>> all
> > > > >> >>>>>>>>>>>>>>>>> updates and translations on a previous
> > > > >> >> topology.
> > > > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> > > > >> >> sent a
> > > > >> >>>>>>> single
> > > > >> >>>>>>>>>>> message.
> > > > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> > > > >> >>>> received
> > > > >> >>>>> a
> > > > >> >>>>>>> full
> > > > >> >>>>>>>>>>> message.
> > > > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>> These metrics help to understand:
> > > > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> > > > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> > > > >> >> completed.
> > > > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> > > > >> >>>> message)
> > > > >> >>>>>>>>>>>>>>>>> - what triggered PME.
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>> Thoughts?
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>>>>>>> [1]
> > > > >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >> >>>>>>>>>>>>>>>>> --
> > > > >> >>>>>>>>>>>>>>>>> Best wishes,
> > > > >> >>>>>>>>>>>>>>>>> Amelchev Nikita
> > > > >> >>>>>>>>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> --
> > > > >> >>>>>>>>>>> Best wishes,
> > > > >> >>>>>>>>>>> Amelchev Nikita
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> --
> > > > >> >>>>>>>>> Best wishes,
> > > > >> >>>>>>>>> Amelchev Nikita
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> --
> > > > >> >>>>>>> Best wishes,
> > > > >> >>>>>>> Amelchev Nikita
> > > > >> >>>>>>>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Best wishes,
> > > > >> >> Amelchev Nikita
> > > > >> >>
> > > > >>
> > > >
> > > >
> > > > --
> > > > Zhenya Stanilovsky
> > >
> > >
> > >
> > > --
> > > Best wishes,
> > > Amelchev Nikita
> > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita

Re: Re[2]: Partition map exchange metrics

Posted by Nikita Amelchev <ns...@gmail.com>.

Nikolay,

The сacheOperationsBlockedDuration metric will show current blocking
duration or 0 if there is no blocking right now.

The totalCacheOperationsBlockedDuration metric will accumulate all
blocking durations that happen after node starts.

ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <ni...@apache.org>:
>
> Nikita
>
> What is the difference between those two metrics?
>
> ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:
>
> > Igniters, thanks for comments.
> >
> > From the discussion it can be seen that we need only two metrics for now:
> > - сacheOperationsBlockedDuration (long)
> > - totalCacheOperationsBlockedDuration (long)
> >
> > I will prepare PR at the nearest time.
> >
> > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> > >:
> > >
> > > +1 with Anton decisions.
> > >
> > >
> > > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> > > >
> > > >Folks,
> > > >
> > > >It looks like we're trying to implement "extended debug" instead of
> > > >"monitoring".
> > > >It should not be interesting for real admin what phase of PME is in
> > > >progress and so on.
> > > >Interested metrics are
> > > >- total blocked time (will be used for real SLA counting)
> > > >- are we blocked right now (shows we have an SLA degradation right now)
> > > >Duration of the current blocking period can be easily presented using
> > any
> > > >modern monitoring tool by regular checks.
> > > >Initial true will means "period start", precision will be a result of
> > > >checks frequency.
> > > >Anyway, I'm ok to have current metric presented with long, where long
> > is a
> > > >duration, see no reason, but ok :)
> > > >
> > > >All other features you mentioned are useful for code or
> > > >deployment improving and can (should) be taken from logs at the analysis
> > > >phase.
> > > >
> > > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> > wrote:
> > > >
> > > >> Folks, let me step in.
> > > >>
> > > >> Nikita, thanks for your suggestions!
> > > >>
> > > >> > 1. initialVersion. Topology version that initiates the exchange.
> > > >> > 2. initTime. Time PME was started.
> > > >> > 3. initEvent. Event that triggered PME.
> > > >> > 4. partitionReleaseTime. Time when a node has finished waiting for
> > all
> > > >> > updates and translations on a previous topology.
> > > >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > > >> > 6. recieveFullMessageTime. Time when a node received a full message.
> > > >> > 7. finishTime. Time PME was ended.
> > > >> >
> > > >> > When new PME started all these metrics resets.
> > > >> Every metric from Nikita's list looks useful and simple to implement.
> > > >> I think that it would be better to change format of metrics 4, 5, 6
> > and
> > > >> 7 a bit: we can keep only difference between time of previous event
> > and
> > > >> time of corresponding event. Such metrics would be easier to perceive:
> > > >> they answer to specific questions "how much time did partition release
> > > >> take?" or "how much time did awaiting of distributed phase end take?".
> > > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > > >> graphs will show how different stages times change from one PME to
> > another.
> > > >>
> > > >> > When PME cause no blocking, it's a good PME and I see no reason to
> > have
> > > >> > monitoring related to it
> > > >> Agree with Anton here. These metrics should be measured only for true
> > > >> distributed exchange. Saving results for client leave/join PMEs will
> > > >> just complicate monitoring.
> > > >>
> > > >> > I agree with total blocking duration metric but
> > > >> > I still don't understand why instant value indicating that
> > operations are
> > > >> > blocked should be boolean.
> > > >> > Duration time since blocking has started looks more appropriate and
> > > >> useful.
> > > >> > It gives more information while semantic is left the same.
> > > >> Totally agree with Pavel here. Both "accumulated block time" and
> > > >> "current PME block time" metrics are useful. Growth of accumulated
> > > >> metric for specific period of time (should be easy to check via
> > > >> monitoring system graph) will show for how much business operations
> > were
> > > >> blocked in total, and non-zero current metric will show that we are
> > > >> experiencing issues right now. Boolean metric "are we blocked right
> > now"
> > > >> is not needed as it's obviously can be inferred from "current PME
> > block
> > > >> time".
> > > >>
> > > >> Best Regards,
> > > >> Ivan Rakov
> > > >>
> > > >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > >> > Nikita,
> > > >> >
> > > >> > I agree with total blocking duration metric but
> > > >> > I still don't understand why instant value indicating that
> > operations are
> > > >> > blocked should be boolean.
> > > >> > Duration time since blocking has started looks more appropriate and
> > > >> useful.
> > > >> > It gives more information while semantic is left the same.
> > > >> >
> > > >> >
> > > >> >
> > > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> > >:
> > > >> >
> > > >> >> Folks,
> > > >> >>
> > > >> >> All previous suggestions have some disadvantages. It can be several
> > > >> >> exchanges between two metric updates and fast exchange can rewrite
> > > >> >> previous long exchange.
> > > >> >>
> > > >> >> We can introduce a metric of total blocking duration that will
> > > >> >> accumulate at the end of the exchange. So, users will get actual
> > > >> >> information about how long operations were blocked. Cluster metric
> > > >> >> will be a maximum of local nodes metrics. And we need a boolean
> > metric
> > > >> >> that will indicate realtime status. It needs because of duration
> > > >> >> metric updates at the end of the exchange.
> > > >> >>
> > > >> >> So I propose to change the current metric that not released to the
> > > >> >> totalCacheOperationsBlockingDuration metric and to add the
> > > >> >> isCacheOperationsBlocked metric.
> > > >> >>
> > > >> >> WDYT?
> > > >> >>
> > > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> > > >> >>> Nikolay,
> > > >> >>>
> > > >> >>> Still see no reason to replace boolean with long.
> > > >> >>>
> > > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > nizhikov@apache.org >
> > > >> >> wrote:
> > > >> >>>> Anton.
> > > >> >>>>
> > > >> >>>> 1. Value exported based on SPI settings, not in the moment it
> > changed.
> > > >> >>>>
> > > >> >>>> 2. Clock synchronisation - if we export start time, we should
> > also
> > > >> >> export
> > > >> >>>> node local timestamp.
> > > >> >>>>
> > > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> > > >> >>>>
> > > >> >>>>> Folks,
> > > >> >>>>>
> > > >> >>>>> What's the reason for duration counting?
> > > >> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> > > >> >>>>> Sine monitoring system checks metrics periodically it will know
> > the
> > > >> >>>>> duration by its own log.
> > > >> >>>>>
> > > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > jokserfn@gmail.com >
> > > >> >>>>> wrote:
> > > >> >>>>>
> > > >> >>>>>> Nikita,
> > > >> >>>>>>
> > > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I
> > suggest
> > > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner
> > represents
> > > >> >> what
> > > >> >>>> is
> > > >> >>>>>> blocked during PME.
> > > >> >>>>>> We can also combine both timestamp
> > > >> >> "cacheOperationsBlockingStartTs" and
> > > >> >>>>>> duration to have better correlation when cache operations were
> > > >> >> blocked
> > > >> >>>>> and
> > > >> >>>>>> how much time it's taken.
> > > >> >>>>>> For instant view (like in JMX bean) a calculated value as you
> > > >> >> mentioned
> > > >> >>>>>> can be used.
> > > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter
> > can be
> > > >> >>>> used.
> > > >> >>>>>> The counter is incremented by blocking time after blocking has
> > > >> >> ended.
> > > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > nsamelchev@gmail.com
> > > >> >>> :
> > > >> >>>>>>> Pavel,
> > > >> >>>>>>>
> > > >> >>>>>>> The main purpose of this metric is
> > > >> >>>>>>>>> how much time we wait for resuming cache operations
> > > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration
> > here?
> > > >> >>>>>>>>> What do you think if we change the boolean value of metric
> > to a
> > > >> >>>> long
> > > >> >>>>>>> value that represents time in milliseconds when operations
> > were
> > > >> >>>> blocked?
> > > >> >>>>>>> This time can be calculated as (currentTime -
> > > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> > > >> >>>>>>>
> > > >> >>>>>>> Duration will be more understandable. It'll be something like
> > > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a
> > better
> > > >> >>>>>>> name yet.
> > > >> >>>>>>>
> > > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > jokserfn@gmail.com
> > > >> >>> :
> > > >> >>>>>>>> Nikita,
> > > >> >>>>>>>>
> > > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful
> > information.
> > > >> >> The
> > > >> >>>>> main
> > > >> >>>>>>> PME side effect for end-users is blocking cache operations.
> > Not
> > > >> >> all
> > > >> >>>> PME
> > > >> >>>>>>> time blocks it.
> > > >> >>>>>>>> What information gives to an end-user timestamp of
> > > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be
> > used and
> > > >> >>>> how?
> > > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > >> >>  nsamelchev@gmail.com
> > > >> >>>>> :
> > > >> >>>>>>>>> Hi Pavel,
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> This time already can be obtained from the
> > > >> >> getCurrentPmeDuration
> > > >> >>>> and
> > > >> >>>>>>>>> new isOperationsBlockedByPme metrics.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> As an alternative solution, I can rework recently added
> > > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> > > >> >> users it
> > > >> >>>>>>>>> useless in case of non-blocking PME.
> > > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > > >> >> when
> > > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> > > >> >> blocking
> > > >> >>>>>>>>> ends (there is no running PME).
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> WDYT?
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > > >> >>  jokserfn@gmail.com >:
> > > >> >>>>>>>>>> Hi Nikita,
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Thank you for working on this. What do you think if we
> > > >> >> change the
> > > >> >>>>>>> boolean
> > > >> >>>>>>>>>> value of metric to a long value that represents time in
> > > >> >>>>> milliseconds
> > > >> >>>>>>> when
> > > >> >>>>>>>>>> operations were blocked?
> > > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> > > >> >>>>> exported
> > > >> >>>>>>> to
> > > >> >>>>>>>>>> some backend it can give a more clear picture of how much
> > > >> >> time we
> > > >> >>>>>>> wait for
> > > >> >>>>>>>>>> resuming cache operations instead of instant boolean
> > > >> >> indicator.
> > > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > >> >>>>  nsamelchev@gmail.com
> > > >> >>>>>> :
> > > >> >>>>>>>>>>> Anton, Nikolay,
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Thanks for the support.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> > > >> >> does
> > > >> >>>> not
> > > >> >>>>>>> show
> > > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> > > >> >> blocking
> > > >> >>>>>>>>>>> operations. For example, client node join/leave events.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> > > >> >>>> Together,
> > > >> >>>>>>> these
> > > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> > > >> >>>>>>> operations.
> > > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> > > >> >> anyone
> > > >> >>>>>>> take a
> > > >> >>>>>>>>>>> look?
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > >> >>>>>  nizhikov@apache.org
> > > >> >>>>>>>> :
> > > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> > > >> >>>>> monitor
> > > >> >>>>>>> all
> > > >> >>>>>>>>>>> Ignite process, including non blocking PME.
> > > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > >> >>>>>>>>>>>>> BTW,
> > > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> > > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> > > >> >> because
> > > >> >>>> of
> > > >> >>>>>>> this.
> > > >> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> > > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> > > >> >> no
> > > >> >>>>>>> reason to have
> > > >> >>>>>>>>>>>>> monitoring related to it :)
> > > >> >>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > >> >>>>>>>  nizhikov@apache.org >
> > > >> >>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>> Anton.
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> > > >> >>>> metrics?
> > > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> > > >> >>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> > > >> >>>>>>> contribution.
> > > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > > >> >> пишет:
> > > >> >>>>>>>>>>>>>>> Nikita,
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> > > >> >> are
> > > >> >>>>>>> operations
> > > >> >>>>>>>>>>> blocked?
> > > >> >>>>>>>>>>>>>>> Just a true or false.
> > > >> >>>>>>>>>>>>>>> Lest start from this.
> > > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> > > >> >> and
> > > >> >>>> can
> > > >> >>>>> be
> > > >> >>>>>>>>>>> implemented
> > > >> >>>>>>>>>>>>>>> later.
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > >> >>>>>>>>>>>  nizhikov@apache.org >
> > > >> >>>>>>>>>>>>>>> wrote:
> > > >> >>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> +1.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > >> >>>>>>>  nsamelchev@gmail.com
> > > >> >>>>>>>>>>>> :
> > > >> >>>>>>>>>>>>>>>>> Hello, Igniters.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> > > >> >>>>>>> partition map
> > > >> >>>>>>>>>>> exchange
> > > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> > > >> >>>> available
> > > >> >>>>>>> only in
> > > >> >>>>>>>>>>> log
> > > >> >>>>>>>>>>>>>> files
> > > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> > > >> >> external
> > > >> >>>>>>> tools. [1]
> > > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> > > >> >> help to
> > > >> >>>>>>> understand
> > > >> >>>>>>>>>>> the
> > > >> >>>>>>>>>>>>>>>>> actual status of current PME:
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> > > >> >> initiates
> > > >> >>>>> the
> > > >> >>>>>>>>>>> exchange.
> > > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> > > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> > > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> > > >> >>>>> finished
> > > >> >>>>>>> waiting
> > > >> >>>>>>>>>>> for
> > > >> >>>>>>>>>>>>>> all
> > > >> >>>>>>>>>>>>>>>>> updates and translations on a previous
> > > >> >> topology.
> > > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> > > >> >> sent a
> > > >> >>>>>>> single
> > > >> >>>>>>>>>>> message.
> > > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> > > >> >>>> received
> > > >> >>>>> a
> > > >> >>>>>>> full
> > > >> >>>>>>>>>>> message.
> > > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> These metrics help to understand:
> > > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> > > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> > > >> >> completed.
> > > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> > > >> >>>> message)
> > > >> >>>>>>>>>>>>>>>>> - what triggered PME.
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> Thoughts?
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>>>>>>> [1]
> > > >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> > > >> >>>>>>>>>>>>>>>>> --
> > > >> >>>>>>>>>>>>>>>>> Best wishes,
> > > >> >>>>>>>>>>>>>>>>> Amelchev Nikita
> > > >> >>>>>>>>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> --
> > > >> >>>>>>>>>>> Best wishes,
> > > >> >>>>>>>>>>> Amelchev Nikita
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> --
> > > >> >>>>>>>>> Best wishes,
> > > >> >>>>>>>>> Amelchev Nikita
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>> --
> > > >> >>>>>>> Best wishes,
> > > >> >>>>>>> Amelchev Nikita
> > > >> >>>>>>>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Best wishes,
> > > >> >> Amelchev Nikita
> > > >> >>
> > > >>
> > >
> > >
> > > --
> > > Zhenya Stanilovsky
> >
> >
> >
> > --
> > Best wishes,
> > Amelchev Nikita
> >



-- 
Best wishes,
Amelchev Nikita

Re: Re[2]: Partition map exchange metrics

Posted by Nikolay Izhikov <ni...@apache.org>.

Nikita

What is the difference between those two metrics?

ср, 24 июля 2019 г., 12:45 Nikita Amelchev <ns...@gmail.com>:

> Igniters, thanks for comments.
>
> From the discussion it can be seen that we need only two metrics for now:
> - сacheOperationsBlockedDuration (long)
> - totalCacheOperationsBlockedDuration (long)
>
> I will prepare PR at the nearest time.
>
> ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
> >:
> >
> > +1 with Anton decisions.
> >
> >
> > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> > >
> > >Folks,
> > >
> > >It looks like we're trying to implement "extended debug" instead of
> > >"monitoring".
> > >It should not be interesting for real admin what phase of PME is in
> > >progress and so on.
> > >Interested metrics are
> > >- total blocked time (will be used for real SLA counting)
> > >- are we blocked right now (shows we have an SLA degradation right now)
> > >Duration of the current blocking period can be easily presented using
> any
> > >modern monitoring tool by regular checks.
> > >Initial true will means "period start", precision will be a result of
> > >checks frequency.
> > >Anyway, I'm ok to have current metric presented with long, where long
> is a
> > >duration, see no reason, but ok :)
> > >
> > >All other features you mentioned are useful for code or
> > >deployment improving and can (should) be taken from logs at the analysis
> > >phase.
> > >
> > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com >
> wrote:
> > >
> > >> Folks, let me step in.
> > >>
> > >> Nikita, thanks for your suggestions!
> > >>
> > >> > 1. initialVersion. Topology version that initiates the exchange.
> > >> > 2. initTime. Time PME was started.
> > >> > 3. initEvent. Event that triggered PME.
> > >> > 4. partitionReleaseTime. Time when a node has finished waiting for
> all
> > >> > updates and translations on a previous topology.
> > >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> > >> > 6. recieveFullMessageTime. Time when a node received a full message.
> > >> > 7. finishTime. Time PME was ended.
> > >> >
> > >> > When new PME started all these metrics resets.
> > >> Every metric from Nikita's list looks useful and simple to implement.
> > >> I think that it would be better to change format of metrics 4, 5, 6
> and
> > >> 7 a bit: we can keep only difference between time of previous event
> and
> > >> time of corresponding event. Such metrics would be easier to perceive:
> > >> they answer to specific questions "how much time did partition release
> > >> take?" or "how much time did awaiting of distributed phase end take?".
> > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> > >> graphs will show how different stages times change from one PME to
> another.
> > >>
> > >> > When PME cause no blocking, it's a good PME and I see no reason to
> have
> > >> > monitoring related to it
> > >> Agree with Anton here. These metrics should be measured only for true
> > >> distributed exchange. Saving results for client leave/join PMEs will
> > >> just complicate monitoring.
> > >>
> > >> > I agree with total blocking duration metric but
> > >> > I still don't understand why instant value indicating that
> operations are
> > >> > blocked should be boolean.
> > >> > Duration time since blocking has started looks more appropriate and
> > >> useful.
> > >> > It gives more information while semantic is left the same.
> > >> Totally agree with Pavel here. Both "accumulated block time" and
> > >> "current PME block time" metrics are useful. Growth of accumulated
> > >> metric for specific period of time (should be easy to check via
> > >> monitoring system graph) will show for how much business operations
> were
> > >> blocked in total, and non-zero current metric will show that we are
> > >> experiencing issues right now. Boolean metric "are we blocked right
> now"
> > >> is not needed as it's obviously can be inferred from "current PME
> block
> > >> time".
> > >>
> > >> Best Regards,
> > >> Ivan Rakov
> > >>
> > >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > >> > Nikita,
> > >> >
> > >> > I agree with total blocking duration metric but
> > >> > I still don't understand why instant value indicating that
> operations are
> > >> > blocked should be boolean.
> > >> > Duration time since blocking has started looks more appropriate and
> > >> useful.
> > >> > It gives more information while semantic is left the same.
> > >> >
> > >> >
> > >> >
> > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com
> >:
> > >> >
> > >> >> Folks,
> > >> >>
> > >> >> All previous suggestions have some disadvantages. It can be several
> > >> >> exchanges between two metric updates and fast exchange can rewrite
> > >> >> previous long exchange.
> > >> >>
> > >> >> We can introduce a metric of total blocking duration that will
> > >> >> accumulate at the end of the exchange. So, users will get actual
> > >> >> information about how long operations were blocked. Cluster metric
> > >> >> will be a maximum of local nodes metrics. And we need a boolean
> metric
> > >> >> that will indicate realtime status. It needs because of duration
> > >> >> metric updates at the end of the exchange.
> > >> >>
> > >> >> So I propose to change the current metric that not released to the
> > >> >> totalCacheOperationsBlockingDuration metric and to add the
> > >> >> isCacheOperationsBlocked metric.
> > >> >>
> > >> >> WDYT?
> > >> >>
> > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> > >> >>> Nikolay,
> > >> >>>
> > >> >>> Still see no reason to replace boolean with long.
> > >> >>>
> > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> nizhikov@apache.org >
> > >> >> wrote:
> > >> >>>> Anton.
> > >> >>>>
> > >> >>>> 1. Value exported based on SPI settings, not in the moment it
> changed.
> > >> >>>>
> > >> >>>> 2. Clock synchronisation - if we export start time, we should
> also
> > >> >> export
> > >> >>>> node local timestamp.
> > >> >>>>
> > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> > >> >>>>
> > >> >>>>> Folks,
> > >> >>>>>
> > >> >>>>> What's the reason for duration counting?
> > >> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> > >> >>>>> Sine monitoring system checks metrics periodically it will know
> the
> > >> >>>>> duration by its own log.
> > >> >>>>>
> > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> jokserfn@gmail.com >
> > >> >>>>> wrote:
> > >> >>>>>
> > >> >>>>>> Nikita,
> > >> >>>>>>
> > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I
> suggest
> > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner
> represents
> > >> >> what
> > >> >>>> is
> > >> >>>>>> blocked during PME.
> > >> >>>>>> We can also combine both timestamp
> > >> >> "cacheOperationsBlockingStartTs" and
> > >> >>>>>> duration to have better correlation when cache operations were
> > >> >> blocked
> > >> >>>>> and
> > >> >>>>>> how much time it's taken.
> > >> >>>>>> For instant view (like in JMX bean) a calculated value as you
> > >> >> mentioned
> > >> >>>>>> can be used.
> > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter
> can be
> > >> >>>> used.
> > >> >>>>>> The counter is incremented by blocking time after blocking has
> > >> >> ended.
> > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> nsamelchev@gmail.com
> > >> >>> :
> > >> >>>>>>> Pavel,
> > >> >>>>>>>
> > >> >>>>>>> The main purpose of this metric is
> > >> >>>>>>>>> how much time we wait for resuming cache operations
> > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration
> here?
> > >> >>>>>>>>> What do you think if we change the boolean value of metric
> to a
> > >> >>>> long
> > >> >>>>>>> value that represents time in milliseconds when operations
> were
> > >> >>>> blocked?
> > >> >>>>>>> This time can be calculated as (currentTime -
> > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> > >> >>>>>>>
> > >> >>>>>>> Duration will be more understandable. It'll be something like
> > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a
> better
> > >> >>>>>>> name yet.
> > >> >>>>>>>
> > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> jokserfn@gmail.com
> > >> >>> :
> > >> >>>>>>>> Nikita,
> > >> >>>>>>>>
> > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful
> information.
> > >> >> The
> > >> >>>>> main
> > >> >>>>>>> PME side effect for end-users is blocking cache operations.
> Not
> > >> >> all
> > >> >>>> PME
> > >> >>>>>>> time blocks it.
> > >> >>>>>>>> What information gives to an end-user timestamp of
> > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be
> used and
> > >> >>>> how?
> > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > >> >>  nsamelchev@gmail.com
> > >> >>>>> :
> > >> >>>>>>>>> Hi Pavel,
> > >> >>>>>>>>>
> > >> >>>>>>>>> This time already can be obtained from the
> > >> >> getCurrentPmeDuration
> > >> >>>> and
> > >> >>>>>>>>> new isOperationsBlockedByPme metrics.
> > >> >>>>>>>>>
> > >> >>>>>>>>> As an alternative solution, I can rework recently added
> > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> > >> >> users it
> > >> >>>>>>>>> useless in case of non-blocking PME.
> > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> > >> >> when
> > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> > >> >> blocking
> > >> >>>>>>>>> ends (there is no running PME).
> > >> >>>>>>>>>
> > >> >>>>>>>>> WDYT?
> > >> >>>>>>>>>
> > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > >> >>  jokserfn@gmail.com >:
> > >> >>>>>>>>>> Hi Nikita,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Thank you for working on this. What do you think if we
> > >> >> change the
> > >> >>>>>>> boolean
> > >> >>>>>>>>>> value of metric to a long value that represents time in
> > >> >>>>> milliseconds
> > >> >>>>>>> when
> > >> >>>>>>>>>> operations were blocked?
> > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> > >> >>>>> exported
> > >> >>>>>>> to
> > >> >>>>>>>>>> some backend it can give a more clear picture of how much
> > >> >> time we
> > >> >>>>>>> wait for
> > >> >>>>>>>>>> resuming cache operations instead of instant boolean
> > >> >> indicator.
> > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > >> >>>>  nsamelchev@gmail.com
> > >> >>>>>> :
> > >> >>>>>>>>>>> Anton, Nikolay,
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Thanks for the support.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> > >> >> does
> > >> >>>> not
> > >> >>>>>>> show
> > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> > >> >> blocking
> > >> >>>>>>>>>>> operations. For example, client node join/leave events.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> > >> >>>> Together,
> > >> >>>>>>> these
> > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> > >> >>>>>>> operations.
> > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> > >> >> anyone
> > >> >>>>>>> take a
> > >> >>>>>>>>>>> look?
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > >> >>>>>  nizhikov@apache.org
> > >> >>>>>>>> :
> > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> > >> >>>>> monitor
> > >> >>>>>>> all
> > >> >>>>>>>>>>> Ignite process, including non blocking PME.
> > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > >> >>>>>>>>>>>>> BTW,
> > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> > >> >> because
> > >> >>>> of
> > >> >>>>>>> this.
> > >> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> > >> >> no
> > >> >>>>>>> reason to have
> > >> >>>>>>>>>>>>> monitoring related to it :)
> > >> >>>>>>>>>>>>>
> > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > >> >>>>>>>  nizhikov@apache.org >
> > >> >>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>> Anton.
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> > >> >>>> metrics?
> > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> > >> >>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> > >> >>>>>>> contribution.
> > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> > >> >> пишет:
> > >> >>>>>>>>>>>>>>> Nikita,
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> > >> >> are
> > >> >>>>>>> operations
> > >> >>>>>>>>>>> blocked?
> > >> >>>>>>>>>>>>>>> Just a true or false.
> > >> >>>>>>>>>>>>>>> Lest start from this.
> > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> > >> >> and
> > >> >>>> can
> > >> >>>>> be
> > >> >>>>>>>>>>> implemented
> > >> >>>>>>>>>>>>>>> later.
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > >> >>>>>>>>>>>  nizhikov@apache.org >
> > >> >>>>>>>>>>>>>>> wrote:
> > >> >>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> +1.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > >> >>>>>>>  nsamelchev@gmail.com
> > >> >>>>>>>>>>>> :
> > >> >>>>>>>>>>>>>>>>> Hello, Igniters.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> > >> >>>>>>> partition map
> > >> >>>>>>>>>>> exchange
> > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> > >> >>>> available
> > >> >>>>>>> only in
> > >> >>>>>>>>>>> log
> > >> >>>>>>>>>>>>>> files
> > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> > >> >> external
> > >> >>>>>>> tools. [1]
> > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> > >> >> help to
> > >> >>>>>>> understand
> > >> >>>>>>>>>>> the
> > >> >>>>>>>>>>>>>>>>> actual status of current PME:
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> > >> >> initiates
> > >> >>>>> the
> > >> >>>>>>>>>>> exchange.
> > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> > >> >>>>> finished
> > >> >>>>>>> waiting
> > >> >>>>>>>>>>> for
> > >> >>>>>>>>>>>>>> all
> > >> >>>>>>>>>>>>>>>>> updates and translations on a previous
> > >> >> topology.
> > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> > >> >> sent a
> > >> >>>>>>> single
> > >> >>>>>>>>>>> message.
> > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> > >> >>>> received
> > >> >>>>> a
> > >> >>>>>>> full
> > >> >>>>>>>>>>> message.
> > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> These metrics help to understand:
> > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> > >> >> completed.
> > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> > >> >>>> message)
> > >> >>>>>>>>>>>>>>>>> - what triggered PME.
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> Thoughts?
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>>>>>>> [1]
> > >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> > >> >>>>>>>>>>>>>>>>> --
> > >> >>>>>>>>>>>>>>>>> Best wishes,
> > >> >>>>>>>>>>>>>>>>> Amelchev Nikita
> > >> >>>>>>>>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> --
> > >> >>>>>>>>>>> Best wishes,
> > >> >>>>>>>>>>> Amelchev Nikita
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>> --
> > >> >>>>>>>>> Best wishes,
> > >> >>>>>>>>> Amelchev Nikita
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> --
> > >> >>>>>>> Best wishes,
> > >> >>>>>>> Amelchev Nikita
> > >> >>>>>>>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Best wishes,
> > >> >> Amelchev Nikita
> > >> >>
> > >>
> >
> >
> > --
> > Zhenya Stanilovsky
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>

Re: Re[2]: Partition map exchange metrics

Posted by Nikita Amelchev <ns...@gmail.com>.

Igniters, thanks for comments.

From the discussion it can be seen that we need only two metrics for now:
- сacheOperationsBlockedDuration (long)
- totalCacheOperationsBlockedDuration (long)

I will prepare PR at the nearest time.

ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <ar...@mail.ru.invalid>:
>
> +1 with Anton decisions.
>
>
> >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <av...@apache.org>:
> >
> >Folks,
> >
> >It looks like we're trying to implement "extended debug" instead of
> >"monitoring".
> >It should not be interesting for real admin what phase of PME is in
> >progress and so on.
> >Interested metrics are
> >- total blocked time (will be used for real SLA counting)
> >- are we blocked right now (shows we have an SLA degradation right now)
> >Duration of the current blocking period can be easily presented using any
> >modern monitoring tool by regular checks.
> >Initial true will means "period start", precision will be a result of
> >checks frequency.
> >Anyway, I'm ok to have current metric presented with long, where long is a
> >duration, see no reason, but ok :)
> >
> >All other features you mentioned are useful for code or
> >deployment improving and can (should) be taken from logs at the analysis
> >phase.
> >
> >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glukos@gmail.com > wrote:
> >
> >> Folks, let me step in.
> >>
> >> Nikita, thanks for your suggestions!
> >>
> >> > 1. initialVersion. Topology version that initiates the exchange.
> >> > 2. initTime. Time PME was started.
> >> > 3. initEvent. Event that triggered PME.
> >> > 4. partitionReleaseTime. Time when a node has finished waiting for all
> >> > updates and translations on a previous topology.
> >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> >> > 6. recieveFullMessageTime. Time when a node received a full message.
> >> > 7. finishTime. Time PME was ended.
> >> >
> >> > When new PME started all these metrics resets.
> >> Every metric from Nikita's list looks useful and simple to implement.
> >> I think that it would be better to change format of metrics 4, 5, 6 and
> >> 7 a bit: we can keep only difference between time of previous event and
> >> time of corresponding event. Such metrics would be easier to perceive:
> >> they answer to specific questions "how much time did partition release
> >> take?" or "how much time did awaiting of distributed phase end take?".
> >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> >> graphs will show how different stages times change from one PME to another.
> >>
> >> > When PME cause no blocking, it's a good PME and I see no reason to have
> >> > monitoring related to it
> >> Agree with Anton here. These metrics should be measured only for true
> >> distributed exchange. Saving results for client leave/join PMEs will
> >> just complicate monitoring.
> >>
> >> > I agree with total blocking duration metric but
> >> > I still don't understand why instant value indicating that operations are
> >> > blocked should be boolean.
> >> > Duration time since blocking has started looks more appropriate and
> >> useful.
> >> > It gives more information while semantic is left the same.
> >> Totally agree with Pavel here. Both "accumulated block time" and
> >> "current PME block time" metrics are useful. Growth of accumulated
> >> metric for specific period of time (should be easy to check via
> >> monitoring system graph) will show for how much business operations were
> >> blocked in total, and non-zero current metric will show that we are
> >> experiencing issues right now. Boolean metric "are we blocked right now"
> >> is not needed as it's obviously can be inferred from "current PME block
> >> time".
> >>
> >> Best Regards,
> >> Ivan Rakov
> >>
> >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> >> > Nikita,
> >> >
> >> > I agree with total blocking duration metric but
> >> > I still don't understand why instant value indicating that operations are
> >> > blocked should be boolean.
> >> > Duration time since blocking has started looks more appropriate and
> >> useful.
> >> > It gives more information while semantic is left the same.
> >> >
> >> >
> >> >
> >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelchev@gmail.com >:
> >> >
> >> >> Folks,
> >> >>
> >> >> All previous suggestions have some disadvantages. It can be several
> >> >> exchanges between two metric updates and fast exchange can rewrite
> >> >> previous long exchange.
> >> >>
> >> >> We can introduce a metric of total blocking duration that will
> >> >> accumulate at the end of the exchange. So, users will get actual
> >> >> information about how long operations were blocked. Cluster metric
> >> >> will be a maximum of local nodes metrics. And we need a boolean metric
> >> >> that will indicate realtime status. It needs because of duration
> >> >> metric updates at the end of the exchange.
> >> >>
> >> >> So I propose to change the current metric that not released to the
> >> >> totalCacheOperationsBlockingDuration metric and to add the
> >> >> isCacheOperationsBlocked metric.
> >> >>
> >> >> WDYT?
> >> >>
> >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < av@apache.org >:
> >> >>> Nikolay,
> >> >>>
> >> >>> Still see no reason to replace boolean with long.
> >> >>>
> >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < nizhikov@apache.org >
> >> >> wrote:
> >> >>>> Anton.
> >> >>>>
> >> >>>> 1. Value exported based on SPI settings, not in the moment it changed.
> >> >>>>
> >> >>>> 2. Clock synchronisation - if we export start time, we should also
> >> >> export
> >> >>>> node local timestamp.
> >> >>>>
> >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < av@apache.org >:
> >> >>>>
> >> >>>>> Folks,
> >> >>>>>
> >> >>>>> What's the reason for duration counting?
> >> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> >> >>>>> Sine monitoring system checks metrics periodically it will know the
> >> >>>>> duration by its own log.
> >> >>>>>
> >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < jokserfn@gmail.com >
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Nikita,
> >> >>>>>>
> >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
> >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
> >> >> what
> >> >>>> is
> >> >>>>>> blocked during PME.
> >> >>>>>> We can also combine both timestamp
> >> >> "cacheOperationsBlockingStartTs" and
> >> >>>>>> duration to have better correlation when cache operations were
> >> >> blocked
> >> >>>>> and
> >> >>>>>> how much time it's taken.
> >> >>>>>> For instant view (like in JMX bean) a calculated value as you
> >> >> mentioned
> >> >>>>>> can be used.
> >> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be
> >> >>>> used.
> >> >>>>>> The counter is incremented by blocking time after blocking has
> >> >> ended.
> >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < nsamelchev@gmail.com
> >> >>> :
> >> >>>>>>> Pavel,
> >> >>>>>>>
> >> >>>>>>> The main purpose of this metric is
> >> >>>>>>>>> how much time we wait for resuming cache operations
> >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
> >> >>>>>>>>> What do you think if we change the boolean value of metric to a
> >> >>>> long
> >> >>>>>>> value that represents time in milliseconds when operations were
> >> >>>> blocked?
> >> >>>>>>> This time can be calculated as (currentTime -
> >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> >> >>>>>>>
> >> >>>>>>> Duration will be more understandable. It'll be something like
> >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
> >> >>>>>>> name yet.
> >> >>>>>>>
> >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < jokserfn@gmail.com
> >> >>> :
> >> >>>>>>>> Nikita,
> >> >>>>>>>>
> >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
> >> >> The
> >> >>>>> main
> >> >>>>>>> PME side effect for end-users is blocking cache operations. Not
> >> >> all
> >> >>>> PME
> >> >>>>>>> time blocks it.
> >> >>>>>>>> What information gives to an end-user timestamp of
> >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
> >> >>>> how?
> >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> >> >>  nsamelchev@gmail.com
> >> >>>>> :
> >> >>>>>>>>> Hi Pavel,
> >> >>>>>>>>>
> >> >>>>>>>>> This time already can be obtained from the
> >> >> getCurrentPmeDuration
> >> >>>> and
> >> >>>>>>>>> new isOperationsBlockedByPme metrics.
> >> >>>>>>>>>
> >> >>>>>>>>> As an alternative solution, I can rework recently added
> >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> >> >> users it
> >> >>>>>>>>> useless in case of non-blocking PME.
> >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> >> >> when
> >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> >> >> blocking
> >> >>>>>>>>> ends (there is no running PME).
> >> >>>>>>>>>
> >> >>>>>>>>> WDYT?
> >> >>>>>>>>>
> >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> >> >>  jokserfn@gmail.com >:
> >> >>>>>>>>>> Hi Nikita,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thank you for working on this. What do you think if we
> >> >> change the
> >> >>>>>>> boolean
> >> >>>>>>>>>> value of metric to a long value that represents time in
> >> >>>>> milliseconds
> >> >>>>>>> when
> >> >>>>>>>>>> operations were blocked?
> >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> >> >>>>> exported
> >> >>>>>>> to
> >> >>>>>>>>>> some backend it can give a more clear picture of how much
> >> >> time we
> >> >>>>>>> wait for
> >> >>>>>>>>>> resuming cache operations instead of instant boolean
> >> >> indicator.
> >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> >> >>>>  nsamelchev@gmail.com
> >> >>>>>> :
> >> >>>>>>>>>>> Anton, Nikolay,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks for the support.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> >> >> does
> >> >>>> not
> >> >>>>>>> show
> >> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> >> >> blocking
> >> >>>>>>>>>>> operations. For example, client node join/leave events.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> >> >>>> Together,
> >> >>>>>>> these
> >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> >> >>>>>>> operations.
> >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> >> >> anyone
> >> >>>>>>> take a
> >> >>>>>>>>>>> look?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> >> >>>>>  nizhikov@apache.org
> >> >>>>>>>> :
> >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> >> >>>>> monitor
> >> >>>>>>> all
> >> >>>>>>>>>>> Ignite process, including non blocking PME.
> >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> >> >>>>>>>>>>>>> BTW,
> >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> >> >> because
> >> >>>> of
> >> >>>>>>> this.
> >> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> >> >> no
> >> >>>>>>> reason to have
> >> >>>>>>>>>>>>> monitoring related to it :)
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> >> >>>>>>>  nizhikov@apache.org >
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>> Anton.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> >> >>>> metrics?
> >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> >> >>>>>>> contribution.
> >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> >> >> пишет:
> >> >>>>>>>>>>>>>>> Nikita,
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> >> >> are
> >> >>>>>>> operations
> >> >>>>>>>>>>> blocked?
> >> >>>>>>>>>>>>>>> Just a true or false.
> >> >>>>>>>>>>>>>>> Lest start from this.
> >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> >> >> and
> >> >>>> can
> >> >>>>> be
> >> >>>>>>>>>>> implemented
> >> >>>>>>>>>>>>>>> later.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> >> >>>>>>>>>>>  nizhikov@apache.org >
> >> >>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> +1.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> >> >>>>>>>  nsamelchev@gmail.com
> >> >>>>>>>>>>>> :
> >> >>>>>>>>>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> >> >>>>>>> partition map
> >> >>>>>>>>>>> exchange
> >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> >> >>>> available
> >> >>>>>>> only in
> >> >>>>>>>>>>> log
> >> >>>>>>>>>>>>>> files
> >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> >> >> external
> >> >>>>>>> tools. [1]
> >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> >> >> help to
> >> >>>>>>> understand
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> actual status of current PME:
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> >> >> initiates
> >> >>>>> the
> >> >>>>>>>>>>> exchange.
> >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> >> >>>>> finished
> >> >>>>>>> waiting
> >> >>>>>>>>>>> for
> >> >>>>>>>>>>>>>> all
> >> >>>>>>>>>>>>>>>>> updates and translations on a previous
> >> >> topology.
> >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> >> >> sent a
> >> >>>>>>> single
> >> >>>>>>>>>>> message.
> >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> >> >>>> received
> >> >>>>> a
> >> >>>>>>> full
> >> >>>>>>>>>>> message.
> >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> These metrics help to understand:
> >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> >> >> completed.
> >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> >> >>>> message)
> >> >>>>>>>>>>>>>>>>> - what triggered PME.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Thoughts?
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> [1]
> >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> >> >>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>> Best wishes,
> >> >>>>>>>>>>>>>>>>> Amelchev Nikita
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Best wishes,
> >> >>>>>>>>>>> Amelchev Nikita
> >> >>>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Best wishes,
> >> >>>>>>>>> Amelchev Nikita
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Best wishes,
> >> >>>>>>> Amelchev Nikita
> >> >>>>>>>
> >> >>
> >> >>
> >> >> --
> >> >> Best wishes,
> >> >> Amelchev Nikita
> >> >>
> >>
>
>
> --
> Zhenya Stanilovsky



-- 
Best wishes,
Amelchev Nikita