You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Николай Ижиков <ni...@apache.org> on 2019/12/16 07:12:12 UTC

Cache operations performance metrics

Hello, Igniters.

I want to provide the user answers to the following question: "How cache API operations perform?"
It seems, we need to implements metrics for basic cache API operations like get, put, remove for it. 

I think we should provide the following metrics:

* `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
    Implemented in [1], commit [2].

* `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].

* histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
    Ticket doesn't exist for it. 

What do you think?

[1] https://issues.apache.org/jira/browse/IGNITE-12219
[2] https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
[3] https://issues.apache.org/jira/browse/IGNITE-12450

Re: Re[2]: Cache operations performance metrics

Posted by Andrey Gura <ag...@apache.org>.
> but between to have something and have nothing i choose — something

We already have "something". put, get, etc. metrics. As I told early
it relatively useless. But the same metrics with histograms doesn't
add any value.

> i found 1 grid machine with very different io usage than others, «dig deeper» highlight cache with very different from other nodes cache put operations and final «dig deeper» help to found code bug

I believe the same could be noticed using PK index stats.

> if new one would be more useful — why not ?

If some particular value is relatively useless then the same histogram
will be still relatively useless :) It's my point. Stop adding a dozen
of metrics, start thinking about benefits and meaning. Discuss it with
community.


On Fri, Dec 20, 2019 at 4:59 PM Zhenya Stanilovsky
<ar...@mail.ru.invalid> wrote:
>
>
> >> Is it become slower or faster?
> >
> >Very correct question! User saw "cache put time" metric becomes x2
> >bigger. Does it become slower or faster? Or we just put into the cache
> >values that 4x bigger in size? Or all time before we put values
> >locally and now we put values on remote nodes. Or our operations
> >execute in transaction and then time will depend on transaction type,
> >actions in transaction and other transaction and actually will nothing
> >talk about real cache operation. We have more questions then answers.
>
> Andrey, i hope i understand your point of view here, but between to have something and have nothing i choose — something, it sometimes really helpful. From real life case: i found 1 grid machine with very different io usage than others, «dig deeper» highlight cache with very different from other nodes cache put operations and final «dig deeper» help to found code bug, but to be clear — old mechanism works ok for me here, if new one would be more useful — why not ?
>
> >> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
> >
> >Of course not :) See above.
> >
> >On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhikov@apache.org > wrote:
> >>
> >> > It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >> > 20 дек. 2019 г., в 15:07, Andrey Gura < agura@apache.org > написал(а):
> >> >
> >> >> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> >> >
> >> > It also will be visible on other metrics. So cache operations metrics
> >> > still useless because it transitive values.
> >> >
> >> >>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> >> >
> >> >> We already implement it.
> >> >
> >> > I don't talk that it isn't implemented. It is just example of things
> >> > that should be measured. All other metrics depends on internals.
> >> >
> >> >>> 2. Measure business operations in user context, not cache API operations.
> >> >
> >> >> Why do you think these approaches should exclude one another?
> >> >
> >> > Because one of them is useless.
> >> >
> >> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhikov@apache.org > wrote:
> >> >>
> >> >> Hello, Andrey.
> >> >>
> >> >>> Where the sense in this value? I explained why this metrics are relatively useless.
> >> >>
> >> >> I don’t agree with you.
> >> >> I believe they are not useless for a user.
> >> >> And I try to explain why I think so.
> >> >>
> >> >>> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
> >> >>
> >> >> Users shouldn’t distinguish.
> >> >> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> >> >>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> >> >>
> >> >> We already implement it.
> >> >> What metrics are missing for internal processes?
> >> >>
> >> >>> 2. Measure business operations in user context, not cache API operations.
> >> >>
> >> >> Why do you think these approaches should exclude one another?
> >> >> Users definitely should measure whole business transaction performance.
> >> >>
> >> >> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
> >> >>
> >> >>
> >> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < agura@apache.org > написал(а):
> >> >>>
> >> >>>> The goal of the proposed metrics is to measure whole cache operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>
> >> >>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
> >> >>>
> >> >>>> Yes, metrics will evaluate API call performance
> >> >>>
> >> >>> And what? Where the sense in this value? I explained why this metrics
> >> >>> are relatively useless.
> >> >>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>
> >> >>> Again. It's just a number without any sense.
> >> >>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>
> >> >>> May be. But user can't distinguish one transaction from another, so
> >> >>> his knowledge doesn't make sense definitely.
> >> >>>
> >> >>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >> >>>
> >> >>> Actually not. The same caches can be involved in a dozen of
> >> >>> transactions and there are no ways to understand what transactions are
> >> >>> slow or fast. It is useless.
> >> >>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations performance - please, share your vision.
> >> >>>
> >> >>> I already wrote about better approach. Two main points:
> >> >>>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint time,
> >> >>> etc) because it can talk about real problems.
> >> >>> 2. Measure business operations in user context, not cache API operations.
> >> >>>
> >> >>> So what we have? We have useless metrics that are doubled by useless
> >> >>> histograms.
> >> >>>
> >> >>> We should reconsider approach to metrics and performance measuring. It
> >> >>> is hard and long task. There are no need to commit tons of useless
> >> >>> metrics that just decrease performance.
> >> >>>
> >> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >> >>> problem exists very very long time and existing metrics discussed many
> >> >>> times. No one can explain this metrics to users because it requires
> >> >>> too many additional knowledge about internals. And metric value
> >> >>> itself depends on many aspects of internals. It leads to impossibility
> >> >>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> >> >>> backward compatibility).
> >> >>>
> >> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков < nizhikov.dev@gmail.com > wrote:
> >> >>>>
> >> >>>> Hello, Andrey.
> >> >>>>
> >> >>>> The goal of the proposed metrics is to measure whole cache operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
> >> >>>>
> >> >>>>>> Measured for API calls on the caller node side
> >> >>>>> Values will the same only for cases when node is remote relative to data
> >> >>>>
> >> >>>> Yes, metrics will evaluate API call performance.
> >> >>>> I think this is the most valuable information from a user's point of view.
> >> >>>>
> >> >>>> Regular user wants to know how fast his cache operation performs.
> >> >>>> And these metrics provide the answer.
> >> >>>>
> >> >>>>> For regular data node (server node) timing will depend on answers for question:
> >> >>>>
> >> >>>> I think these answers are always available.
> >> >>>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
> >> >>>> Even so, all answers are provided through system view we brought to the Ignite :)
> >> >>>>
> >> >>>>> What is transaction commit or rollback time?
> >> >>>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >> >>>> I think it’s very valuable knowledge.
> >> >>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>
> >> >>>> Good, let’s do it?
> >> >>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
> >> >>>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations performance - please, share your vision.
> >> >>>>
> >> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < agura@apache.org > написал(а):
> >> >>>>>
> >> >>>>> From my point of view, Ignite should provide meaningful metrics for
> >> >>>>> internal components that could be useful for monitoring and analysis.
> >> >>>>> All suggested options are meaningless in a sense. Below I'll try
> >> >>>>> explain why.
> >> >>>>>
> >> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
> >> >>>>>> Implemented in [1], commit [2].
> >> >>>>>
> >> >>>>> All cache operations in Ignite are distributed. So each value measured
> >> >>>>> for some cache operation will vary depending on where actually
> >> >>>>> operation is performed. Values will the same only for cases when node
> >> >>>>> is remote relative to data (e.g. client node).
> >> >>>>>
> >> >>>>> For regular data node (server node) timing will depend on answers for question:
> >> >>>>>
> >> >>>>> - is node primary for particular key or not? (for all operations)
> >> >>>>> - how many backups configured for the cache? (for put and remove)
> >> >>>>> - what write synchronization mode is configured for particular cache?
> >> >>>>> (for put and remove)
> >> >>>>> - is readFromBackup enabled for the cache? (for get)
> >> >>>>>
> >> >>>>> Both Ignite users and Ignite developers can't make any decision based
> >> >>>>> on this metrics.
> >> >>>>>
> >> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
> >> >>>>>
> >> >>>>> What is transaction commit or rollback time? How it calculates in
> >> >>>>> Ignite now? What actions included into transaction? What actions not
> >> >>>>> related with cache executed during transactions?
> >> >>>>>
> >> >>>>> There is no any sense in time of transaction commit or rollback
> >> >>>>> because there are no any way to understand what transaction was
> >> >>>>> performed in particular period of time. Usually a lot of transactions
> >> >>>>> and we can't to distinguish from each other.
> >> >>>>>
> >> >>>>> Moreover, transaction usually treats as business operation. So only
> >> >>>>> way to measure performance properly is measure business operation
> >> >>>>> time. That is user should create own metrics set for some business
> >> >>>>> API.
> >> >>>>>
> >> >>>>> Further. What about cross cache transactions? At the moment tx
> >> >>>>> commit/rollback time will be added to corresponding metrics per each
> >> >>>>> cache evolved to the transaction. The *same time* for *each cache*.
> >> >>>>> Absolutely meaningless.
> >> >>>>>
> >> >>>>> Again, both Ignite users and Ignite developers can't make any decision
> >> >>>>> based on this metrics. But users can create own metrics set.
> >> >>>>>
> >> >>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
> >> >>>>>> Ticket doesn't exist for it.
> >> >>>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>>
> >> >>>>> Metrics, application monitoring, performance analysis and measurement
> >> >>>>> are a a little harder than it sounds. Therefore, we must approach this
> >> >>>>> issue more carefully.
> >> >>>>> Blindly adding new types of metrics will not only not improve the
> >> >>>>> situation, but will also worsen the overall performance of the system
> >> >>>>> because metric calculation always on the hot path.
> >> >>>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> >> >>>>> commit/rollback should be reverted.
> >> >>>>>
> >> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev < nsamelchev@gmail.com > wrote:
> >> >>>>>>
> >> >>>>>> I think these metrics are useful.
> >> >>>>>>
> >> >>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >> >>>>>> Nikolay, could you take a look, please?
> >> >>>>>>
> >> >>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
> >> >>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
> >> >>>>>>
> >> >>>>>> I have filed a ticket for it. [3]
> >> >>>>>>
> >> >>>>>> [1]  https://github.com/apache/ignite/pull/7141
> >> >>>>>> [2]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12453
> >> >>>>>>
> >> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov < alexey.scherbakoff@gmail.com >:
> >> >>>>>>>
> >> >>>>>>> I think they are very useful.
> >> >>>>>>>
> >> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков < nizhikov@apache.org >:
> >> >>>>>>>
> >> >>>>>>>> Hello, Alexei.
> >> >>>>>>>>
> >> >>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
> >> >>>>>>>> What do you think about proposed metrics set?
> >> >>>>>>>>
> >> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >> >>>>>>>>  alexey.scherbakoff@gmail.com > написал(а):
> >> >>>>>>>>>
> >> >>>>>>>>> Nikolay,
> >> >>>>>>>>>
> >> >>>>>>>>> What about batch operations?
> >> >>>>>>>>>
> >> >>>>>>>>> For messages processing the ticket does exist and even has an
> >> >>>>>>>>> implementation from before new metrics API times [1]
> >> >>>>>>>>>
> >> >>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-10418
> >> >>>>>>>>>
> >> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков < nizhikov@apache.org >:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I want to provide the user answers to the following question: "How cache
> >> >>>>>>>>>> API operations perform?"
> >> >>>>>>>>>> It seems, we need to implements metrics for basic cache API operations
> >> >>>>>>>>>> like get, put, remove for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I think we should provide the following metrics:
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> >> >>>>>>>>>> caller node side.
> >> >>>>>>>>>> Implemented in [1], commit [2].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
> >> >>>>>>>>>> caller node side [3].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
> >> >>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
> >> >>>>>>>>>> Ticket doesn't exist for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> What do you think?
> >> >>>>>>>>>>
> >> >>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-12219
> >> >>>>>>>>>> [2]
> >> >>>>>>>>>>
> >> >>>>>>>>  https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >> >>>>>>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>> Alexei Scherbakov
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>>
> >> >>>>>>> Best regards,
> >> >>>>>>> Alexei Scherbakov
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best wishes,
> >> >>>>>> Amelchev Nikita
> >> >>>>
> >> >>
> >>
>
>
>
>

Re: Re[2]: Cache operations performance metrics

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

Let me chime in to this discussion.

If we are doing any new metrics, please make sure that they are accessible.

I would expect that metrics are printed to console from time to time, at
least when they deviate from norm. It would also help if they are available
as Web Console screen, a system SQL view, or command.sh command - in that
order.

It would be ideal to start discussion of any new metrics
Web-Console-screen-first. Much easier to sell to community.

Let me tell about my skin in the game: as you know, I answer a large number
of user questions. I ask users for logs often, so I potentially benefit
from anything which is printed to logs, and I have zero benefit of
something that needs extensive pre-configuration, since users would often
abandon their efforts before they set up any comprehensive monitoring
framework. So it would really help me as we remove useless messages from
logs and add insightful messages there.

This also goes for existing monitoring! If you think that we have enough
metrics available, please make them more accessible to remove the need for
discussion.

Regards,
-- 
Ilya Kasnacheev


пт, 20 дек. 2019 г. в 16:59, Zhenya Stanilovsky <arzamas123@mail.ru.invalid
>:

>
> >> Is it become slower or faster?
> >
> >Very correct question! User saw "cache put time" metric becomes x2
> >bigger. Does it become slower or faster? Or we just put into the cache
> >values that 4x bigger in size? Or all time before we put values
> >locally and now we put values on remote nodes. Or our operations
> >execute in transaction and then time will depend on transaction type,
> >actions in transaction and other transaction and actually will nothing
> >talk about real cache operation. We have more questions then answers.
>
> Andrey, i hope i understand your point of view here, but between to have
> something and have nothing i choose — something, it sometimes really
> helpful. From real life case: i found 1 grid machine with very different io
> usage than others, «dig deeper» highlight cache with very different
> from other nodes cache put operations and final «dig deeper» help to found
> code bug, but to be clear — old mechanism works ok for me here, if new one
> would be more useful — why not ?
>
> >> On the other hand - if `PuTime` increased - then we know for sure, all
> operation executing `put` becomes slower.
> >
> >Of course not :) See above.
> >
> >On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhikov@apache.org >
> wrote:
> >>
> >> > It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all
> operation executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >> > 20 дек. 2019 г., в 15:07, Andrey Gura < agura@apache.org >
> написал(а):
> >> >
> >> >> If a cache has some percent of the relatively slow transaction this
> is a trigger to make a deeper investigation.
> >> >
> >> > It also will be visible on other metrics. So cache operations metrics
> >> > still useless because it transitive values.
> >> >
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
> time, etc) because it can talk about real problems.
> >> >
> >> >> We already implement it.
> >> >
> >> > I don't talk that it isn't implemented. It is just example of things
> >> > that should be measured. All other metrics depends on internals.
> >> >
> >> >>> 2. Measure business operations in user context, not cache API
> operations.
> >> >
> >> >> Why do you think these approaches should exclude one another?
> >> >
> >> > Because one of them is useless.
> >> >
> >> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhikov@apache.org
> > wrote:
> >> >>
> >> >> Hello, Andrey.
> >> >>
> >> >>> Where the sense in this value? I explained why this metrics are
> relatively useless.
> >> >>
> >> >> I don’t agree with you.
> >> >> I believe they are not useless for a user.
> >> >> And I try to explain why I think so.
> >> >>
> >> >>> But user can't distinguish one transaction from another, so his
> knowledge doesn't make sense definitely.
> >> >>
> >> >> Users shouldn’t distinguish.
> >> >> If a cache has some percent of the relatively slow transaction this
> is a trigger to make a deeper investigation.
> >> >>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
> time, etc) because it can talk about real problems.
> >> >>
> >> >> We already implement it.
> >> >> What metrics are missing for internal processes?
> >> >>
> >> >>> 2. Measure business operations in user context, not cache API
> operations.
> >> >>
> >> >> Why do you think these approaches should exclude one another?
> >> >> Users definitely should measure whole business transaction
> performance.
> >> >>
> >> >> I think we should provide a way to measure part of the business
> transaction that relates to the Ignite.
> >> >>
> >> >>
> >> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < agura@apache.org >
> написал(а):
> >> >>>
> >> >>>> The goal of the proposed metrics is to measure whole cache
> operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>
> >> >>> Nikolay, reformulating doesn't make metrics more meaningful.
> Seriously :)
> >> >>>
> >> >>>> Yes, metrics will evaluate API call performance
> >> >>>
> >> >>> And what? Where the sense in this value? I explained why this
> metrics
> >> >>> are relatively useless.
> >> >>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>
> >> >>> Again. It's just a number without any sense.
> >> >>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>
> >> >>> May be. But user can't distinguish one transaction from another, so
> >> >>> his knowledge doesn't make sense definitely.
> >> >>>
> >> >>>> From these metrics it can answer on the question «If my
> transaction includes cacheXXX, how long it usually takes?»
> >> >>>
> >> >>> Actually not. The same caches can be involved in a dozen of
> >> >>> transactions and there are no ways to understand what transactions
> are
> >> >>> slow or fast. It is useless.
> >> >>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations
> performance - please, share your vision.
> >> >>>
> >> >>> I already wrote about better approach. Two main points:
> >> >>>
> >> >>> 1. Measure some important internals (WAL operations, checkpoint
> time,
> >> >>> etc) because it can talk about real problems.
> >> >>> 2. Measure business operations in user context, not cache API
> operations.
> >> >>>
> >> >>> So what we have? We have useless metrics that are doubled by useless
> >> >>> histograms.
> >> >>>
> >> >>> We should reconsider approach to metrics and performance measuring.
> It
> >> >>> is hard and long task. There are no need to commit tons of useless
> >> >>> metrics that just decrease performance.
> >> >>>
> >> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >> >>> problem exists very very long time and existing metrics discussed
> many
> >> >>> times. No one can explain this metrics to users because it requires
> >> >>> too many additional knowledge about internals. And metric value
> >> >>> itself depends on many aspects of internals. It leads to
> impossibility
> >> >>> of interpretation. And it's good time to remove it (in AI 3.0 due
> to a
> >> >>> backward compatibility).
> >> >>>
> >> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <
> nizhikov.dev@gmail.com > wrote:
> >> >>>>
> >> >>>> Hello, Andrey.
> >> >>>>
> >> >>>> The goal of the proposed metrics is to measure whole cache
> operations behavior.
> >> >>>> It provides some kind of statistics(histograms) for it.
> >> >>>> For more fine-grained analysis one will be use tracing or other
> «go deeper» tools.
> >> >>>>
> >> >>>>>> Measured for API calls on the caller node side
> >> >>>>> Values will the same only for cases when node is remote relative
> to data
> >> >>>>
> >> >>>> Yes, metrics will evaluate API call performance.
> >> >>>> I think this is the most valuable information from a user's point
> of view.
> >> >>>>
> >> >>>> Regular user wants to know how fast his cache operation performs.
> >> >>>> And these metrics provide the answer.
> >> >>>>
> >> >>>>> For regular data node (server node) timing will depend on answers
> for question:
> >> >>>>
> >> >>>> I think these answers are always available.
> >> >>>> I barely can imagine a scenario when one monitor «black box»
> cluster and don’t know it.
> >> >>>> Even so, all answers are provided through system view we brought
> to the Ignite :)
> >> >>>>
> >> >>>>> What is transaction commit or rollback time?
> >> >>>>
> >> >>>> These are metrics of client-side operation performance.
> >> >>>>
> >> >>>> I think a specific user has knowledge - what are his transactions.
> >> >>>> From these metrics it can answer on the question «If my
> transaction includes cacheXXX, how long it usually takes?»
> >> >>>> I think it’s very valuable knowledge.
> >> >>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>
> >> >>>> Good, let’s do it?
> >> >>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> commit/rollback should be reverted.
> >> >>>>
> >> >>>> I disagree here.
> >> >>>> If you have a better approach to measure cache operations
> performance - please, share your vision.
> >> >>>>
> >> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < agura@apache.org >
> написал(а):
> >> >>>>>
> >> >>>>> From my point of view, Ignite should provide meaningful metrics
> for
> >> >>>>> internal components that could be useful for monitoring and
> analysis.
> >> >>>>> All suggested options are meaningless in a sense. Below I'll try
> >> >>>>> explain why.
> >> >>>>>
> >> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls
> on the caller node side.
> >> >>>>>> Implemented in [1], commit [2].
> >> >>>>>
> >> >>>>> All cache operations in Ignite are distributed. So each value
> measured
> >> >>>>> for some cache operation will vary depending on where actually
> >> >>>>> operation is performed. Values will the same only for cases when
> node
> >> >>>>> is remote relative to data (e.g. client node).
> >> >>>>>
> >> >>>>> For regular data node (server node) timing will depend on answers
> for question:
> >> >>>>>
> >> >>>>> - is node primary for particular key or not? (for all operations)
> >> >>>>> - how many backups configured for the cache? (for put and remove)
> >> >>>>> - what write synchronization mode is configured for particular
> cache?
> >> >>>>> (for put and remove)
> >> >>>>> - is readFromBackup enabled for the cache? (for get)
> >> >>>>>
> >> >>>>> Both Ignite users and Ignite developers can't make any decision
> based
> >> >>>>> on this metrics.
> >> >>>>>
> >> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls
> on the caller node side [3].
> >> >>>>>
> >> >>>>> What is transaction commit or rollback time? How it calculates in
> >> >>>>> Ignite now? What actions included into transaction? What actions
> not
> >> >>>>> related with cache executed during transactions?
> >> >>>>>
> >> >>>>> There is no any sense in time of transaction commit or rollback
> >> >>>>> because there are no any way to understand what transaction was
> >> >>>>> performed in particular period of time. Usually a lot of
> transactions
> >> >>>>> and we can't to distinguish from each other.
> >> >>>>>
> >> >>>>> Moreover, transaction usually treats as business operation. So
> only
> >> >>>>> way to measure performance properly is measure business operation
> >> >>>>> time. That is user should create own metrics set for some business
> >> >>>>> API.
> >> >>>>>
> >> >>>>> Further. What about cross cache transactions? At the moment tx
> >> >>>>> commit/rollback time will be added to corresponding metrics per
> each
> >> >>>>> cache evolved to the transaction. The *same time* for *each
> cache*.
> >> >>>>> Absolutely meaningless.
> >> >>>>>
> >> >>>>> Again, both Ignite users and Ignite developers can't make any
> decision
> >> >>>>> based on this metrics. But users can create own metrics set.
> >> >>>>>
> >> >>>>>> * histograms that measure the time of processing `get`, `put`,
> `remove`, `commit`, `rollback` messages on affinity nodes(primary and
> backups).
> >> >>>>>> Ticket doesn't exist for it.
> >> >>>>>
> >> >>>>> It will be implemented for most types of messages.
> >> >>>>>
> >> >>>>> Metrics, application monitoring, performance analysis and
> measurement
> >> >>>>> are a a little harder than it sounds. Therefore, we must approach
> this
> >> >>>>> issue more carefully.
> >> >>>>> Blindly adding new types of metrics will not only not improve the
> >> >>>>> situation, but will also worsen the overall performance of the
> system
> >> >>>>> because metric calculation always on the hot path.
> >> >>>>>
> >> >>>>> So, from my point of view, commits for get/put/remove and
> >> >>>>> commit/rollback should be reverted.
> >> >>>>>
> >> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <
> nsamelchev@gmail.com > wrote:
> >> >>>>>>
> >> >>>>>> I think these metrics are useful.
> >> >>>>>>
> >> >>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >> >>>>>> Nikolay, could you take a look, please?
> >> >>>>>>
> >> >>>>>> If you do not mind, I will try to add affinity-nodes cache
> metrics:
> >> >>>>>>>> * histograms that measure the time of processing `get`, `put`,
> `remove`, `commit`, `rollback` messages on affinity nodes(primary and
> backups). Ticket doesn't exist for it.
> >> >>>>>>
> >> >>>>>> I have filed a ticket for it. [3]
> >> >>>>>>
> >> >>>>>> [1]  https://github.com/apache/ignite/pull/7141
> >> >>>>>> [2]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12453
> >> >>>>>>
> >> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <
> alexey.scherbakoff@gmail.com >:
> >> >>>>>>>
> >> >>>>>>> I think they are very useful.
> >> >>>>>>>
> >> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <
> nizhikov@apache.org >:
> >> >>>>>>>
> >> >>>>>>>> Hello, Alexei.
> >> >>>>>>>>
> >> >>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35
> label.
> >> >>>>>>>> What do you think about proposed metrics set?
> >> >>>>>>>>
> >> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >> >>>>>>>>  alexey.scherbakoff@gmail.com > написал(а):
> >> >>>>>>>>>
> >> >>>>>>>>> Nikolay,
> >> >>>>>>>>>
> >> >>>>>>>>> What about batch operations?
> >> >>>>>>>>>
> >> >>>>>>>>> For messages processing the ticket does exist and even has an
> >> >>>>>>>>> implementation from before new metrics API times [1]
> >> >>>>>>>>>
> >> >>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-10418
> >> >>>>>>>>>
> >> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <
> nizhikov@apache.org >:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I want to provide the user answers to the following
> question: "How cache
> >> >>>>>>>>>> API operations perform?"
> >> >>>>>>>>>> It seems, we need to implements metrics for basic cache API
> operations
> >> >>>>>>>>>> like get, put, remove for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I think we should provide the following metrics:
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API
> calls on the
> >> >>>>>>>>>> caller node side.
> >> >>>>>>>>>> Implemented in [1], commit [2].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API
> calls on the
> >> >>>>>>>>>> caller node side [3].
> >> >>>>>>>>>>
> >> >>>>>>>>>> * histograms that measure the time of processing `get`,
> `put`, `remove`,
> >> >>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and
> backups).
> >> >>>>>>>>>> Ticket doesn't exist for it.
> >> >>>>>>>>>>
> >> >>>>>>>>>> What do you think?
> >> >>>>>>>>>>
> >> >>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-12219
> >> >>>>>>>>>> [2]
> >> >>>>>>>>>>
> >> >>>>>>>>
> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >> >>>>>>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12450
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>>
> >> >>>>>>>>> Best regards,
> >> >>>>>>>>> Alexei Scherbakov
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>>
> >> >>>>>>> Best regards,
> >> >>>>>>> Alexei Scherbakov
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>> Best wishes,
> >> >>>>>> Amelchev Nikita
> >> >>>>
> >> >>
> >>
>
>
>
>

Re[2]: Cache operations performance metrics

Posted by Zhenya Stanilovsky <ar...@mail.ru.INVALID>.
>> Is it become slower or faster?
>
>Very correct question! User saw "cache put time" metric becomes x2
>bigger. Does it become slower or faster? Or we just put into the cache
>values that 4x bigger in size? Or all time before we put values
>locally and now we put values on remote nodes. Or our operations
>execute in transaction and then time will depend on transaction type,
>actions in transaction and other transaction and actually will nothing
>talk about real cache operation. We have more questions then answers.
 
Andrey, i hope i understand your point of view here, but between to have something and have nothing i choose — something, it sometimes really helpful. From real life case: i found 1 grid machine with very different io usage than others, «dig deeper» highlight cache with very different from other nodes cache put operations and final «dig deeper» help to found code bug, but to be clear — old mechanism works ok for me here, if new one would be more useful — why not ?
 
>> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
>
>Of course not :) See above.
>
>On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков < nizhikov@apache.org > wrote:
>>
>> > It also will be visible on other metrics
>>
>> How will it be visible?
>>
>> For example, the user saw «checkpoint time» metric becomes x2 bigger.
>> How it relates to business operations? Is it become slower or faster?
>> What does it mean for an application performance?
>>
>> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
>>
>> *Why* it’s become slower - is the essence of «go deeper» investigation.
>>
>> > 20 дек. 2019 г., в 15:07, Andrey Gura < agura@apache.org > написал(а):
>> >
>> >> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
>> >
>> > It also will be visible on other metrics. So cache operations metrics
>> > still useless because it transitive values.
>> >
>> >>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
>> >
>> >> We already implement it.
>> >
>> > I don't talk that it isn't implemented. It is just example of things
>> > that should be measured. All other metrics depends on internals.
>> >
>> >>> 2. Measure business operations in user context, not cache API operations.
>> >
>> >> Why do you think these approaches should exclude one another?
>> >
>> > Because one of them is useless.
>> >
>> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков < nizhikov@apache.org > wrote:
>> >>
>> >> Hello, Andrey.
>> >>
>> >>> Where the sense in this value? I explained why this metrics are relatively useless.
>> >>
>> >> I don’t agree with you.
>> >> I believe they are not useless for a user.
>> >> And I try to explain why I think so.
>> >>
>> >>> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
>> >>
>> >> Users shouldn’t distinguish.
>> >> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
>> >>
>> >>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
>> >>
>> >> We already implement it.
>> >> What metrics are missing for internal processes?
>> >>
>> >>> 2. Measure business operations in user context, not cache API operations.
>> >>
>> >> Why do you think these approaches should exclude one another?
>> >> Users definitely should measure whole business transaction performance.
>> >>
>> >> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
>> >>
>> >>
>> >>> 20 дек. 2019 г., в 13:02, Andrey Gura < agura@apache.org > написал(а):
>> >>>
>> >>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>> >>>> It provides some kind of statistics(histograms) for it.
>> >>>
>> >>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
>> >>>
>> >>>> Yes, metrics will evaluate API call performance
>> >>>
>> >>> And what? Where the sense in this value? I explained why this metrics
>> >>> are relatively useless.
>> >>>
>> >>>> These are metrics of client-side operation performance.
>> >>>
>> >>> Again. It's just a number without any sense.
>> >>>
>> >>>> I think a specific user has knowledge - what are his transactions.
>> >>>
>> >>> May be. But user can't distinguish one transaction from another, so
>> >>> his knowledge doesn't make sense definitely.
>> >>>
>> >>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>> >>>
>> >>> Actually not. The same caches can be involved in a dozen of
>> >>> transactions and there are no ways to understand what transactions are
>> >>> slow or fast. It is useless.
>> >>>
>> >>>> I disagree here.
>> >>>> If you have a better approach to measure cache operations performance - please, share your vision.
>> >>>
>> >>> I already wrote about better approach. Two main points:
>> >>>
>> >>> 1. Measure some important internals (WAL operations, checkpoint time,
>> >>> etc) because it can talk about real problems.
>> >>> 2. Measure business operations in user context, not cache API operations.
>> >>>
>> >>> So what we have? We have useless metrics that are doubled by useless
>> >>> histograms.
>> >>>
>> >>> We should reconsider approach to metrics and performance measuring. It
>> >>> is hard and long task. There are no need to commit tons of useless
>> >>> metrics that just decrease performance.
>> >>>
>> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
>> >>> problem exists very very long time and existing metrics discussed many
>> >>> times. No one can explain this metrics to users because it requires
>> >>> too many additional knowledge about internals. And metric value
>> >>> itself depends on many aspects of internals. It leads to impossibility
>> >>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
>> >>> backward compatibility).
>> >>>
>> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков < nizhikov.dev@gmail.com > wrote:
>> >>>>
>> >>>> Hello, Andrey.
>> >>>>
>> >>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>> >>>> It provides some kind of statistics(histograms) for it.
>> >>>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
>> >>>>
>> >>>>>> Measured for API calls on the caller node side
>> >>>>> Values will the same only for cases when node is remote relative to data
>> >>>>
>> >>>> Yes, metrics will evaluate API call performance.
>> >>>> I think this is the most valuable information from a user's point of view.
>> >>>>
>> >>>> Regular user wants to know how fast his cache operation performs.
>> >>>> And these metrics provide the answer.
>> >>>>
>> >>>>> For regular data node (server node) timing will depend on answers for question:
>> >>>>
>> >>>> I think these answers are always available.
>> >>>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
>> >>>> Even so, all answers are provided through system view we brought to the Ignite :)
>> >>>>
>> >>>>> What is transaction commit or rollback time?
>> >>>>
>> >>>> These are metrics of client-side operation performance.
>> >>>>
>> >>>> I think a specific user has knowledge - what are his transactions.
>> >>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>> >>>> I think it’s very valuable knowledge.
>> >>>>
>> >>>>> It will be implemented for most types of messages.
>> >>>>
>> >>>> Good, let’s do it?
>> >>>>
>> >>>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
>> >>>>
>> >>>> I disagree here.
>> >>>> If you have a better approach to measure cache operations performance - please, share your vision.
>> >>>>
>> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura < agura@apache.org > написал(а):
>> >>>>>
>> >>>>> From my point of view, Ignite should provide meaningful metrics for
>> >>>>> internal components that could be useful for monitoring and analysis.
>> >>>>> All suggested options are meaningless in a sense. Below I'll try
>> >>>>> explain why.
>> >>>>>
>> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
>> >>>>>> Implemented in [1], commit [2].
>> >>>>>
>> >>>>> All cache operations in Ignite are distributed. So each value measured
>> >>>>> for some cache operation will vary depending on where actually
>> >>>>> operation is performed. Values will the same only for cases when node
>> >>>>> is remote relative to data (e.g. client node).
>> >>>>>
>> >>>>> For regular data node (server node) timing will depend on answers for question:
>> >>>>>
>> >>>>> - is node primary for particular key or not? (for all operations)
>> >>>>> - how many backups configured for the cache? (for put and remove)
>> >>>>> - what write synchronization mode is configured for particular cache?
>> >>>>> (for put and remove)
>> >>>>> - is readFromBackup enabled for the cache? (for get)
>> >>>>>
>> >>>>> Both Ignite users and Ignite developers can't make any decision based
>> >>>>> on this metrics.
>> >>>>>
>> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
>> >>>>>
>> >>>>> What is transaction commit or rollback time? How it calculates in
>> >>>>> Ignite now? What actions included into transaction? What actions not
>> >>>>> related with cache executed during transactions?
>> >>>>>
>> >>>>> There is no any sense in time of transaction commit or rollback
>> >>>>> because there are no any way to understand what transaction was
>> >>>>> performed in particular period of time. Usually a lot of transactions
>> >>>>> and we can't to distinguish from each other.
>> >>>>>
>> >>>>> Moreover, transaction usually treats as business operation. So only
>> >>>>> way to measure performance properly is measure business operation
>> >>>>> time. That is user should create own metrics set for some business
>> >>>>> API.
>> >>>>>
>> >>>>> Further. What about cross cache transactions? At the moment tx
>> >>>>> commit/rollback time will be added to corresponding metrics per each
>> >>>>> cache evolved to the transaction. The *same time* for *each cache*.
>> >>>>> Absolutely meaningless.
>> >>>>>
>> >>>>> Again, both Ignite users and Ignite developers can't make any decision
>> >>>>> based on this metrics. But users can create own metrics set.
>> >>>>>
>> >>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
>> >>>>>> Ticket doesn't exist for it.
>> >>>>>
>> >>>>> It will be implemented for most types of messages.
>> >>>>>
>> >>>>> Metrics, application monitoring, performance analysis and measurement
>> >>>>> are a a little harder than it sounds. Therefore, we must approach this
>> >>>>> issue more carefully.
>> >>>>> Blindly adding new types of metrics will not only not improve the
>> >>>>> situation, but will also worsen the overall performance of the system
>> >>>>> because metric calculation always on the hot path.
>> >>>>>
>> >>>>> So, from my point of view, commits for get/put/remove and
>> >>>>> commit/rollback should be reverted.
>> >>>>>
>> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev < nsamelchev@gmail.com > wrote:
>> >>>>>>
>> >>>>>> I think these metrics are useful.
>> >>>>>>
>> >>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
>> >>>>>> Nikolay, could you take a look, please?
>> >>>>>>
>> >>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
>> >>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
>> >>>>>>
>> >>>>>> I have filed a ticket for it. [3]
>> >>>>>>
>> >>>>>> [1]  https://github.com/apache/ignite/pull/7141
>> >>>>>> [2]  https://issues.apache.org/jira/browse/IGNITE-12450
>> >>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12453
>> >>>>>>
>> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov < alexey.scherbakoff@gmail.com >:
>> >>>>>>>
>> >>>>>>> I think they are very useful.
>> >>>>>>>
>> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков < nizhikov@apache.org >:
>> >>>>>>>
>> >>>>>>>> Hello, Alexei.
>> >>>>>>>>
>> >>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
>> >>>>>>>> What do you think about proposed metrics set?
>> >>>>>>>>
>> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>> >>>>>>>>  alexey.scherbakoff@gmail.com > написал(а):
>> >>>>>>>>>
>> >>>>>>>>> Nikolay,
>> >>>>>>>>>
>> >>>>>>>>> What about batch operations?
>> >>>>>>>>>
>> >>>>>>>>> For messages processing the ticket does exist and even has an
>> >>>>>>>>> implementation from before new metrics API times [1]
>> >>>>>>>>>
>> >>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-10418
>> >>>>>>>>>
>> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков < nizhikov@apache.org >:
>> >>>>>>>>>
>> >>>>>>>>>> Hello, Igniters.
>> >>>>>>>>>>
>> >>>>>>>>>> I want to provide the user answers to the following question: "How cache
>> >>>>>>>>>> API operations perform?"
>> >>>>>>>>>> It seems, we need to implements metrics for basic cache API operations
>> >>>>>>>>>> like get, put, remove for it.
>> >>>>>>>>>>
>> >>>>>>>>>> I think we should provide the following metrics:
>> >>>>>>>>>>
>> >>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>> >>>>>>>>>> caller node side.
>> >>>>>>>>>> Implemented in [1], commit [2].
>> >>>>>>>>>>
>> >>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
>> >>>>>>>>>> caller node side [3].
>> >>>>>>>>>>
>> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
>> >>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>> >>>>>>>>>> Ticket doesn't exist for it.
>> >>>>>>>>>>
>> >>>>>>>>>> What do you think?
>> >>>>>>>>>>
>> >>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-12219
>> >>>>>>>>>> [2]
>> >>>>>>>>>>
>> >>>>>>>>  https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>> >>>>>>>>>> [3]  https://issues.apache.org/jira/browse/IGNITE-12450
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>>
>> >>>>>>>>> Best regards,
>> >>>>>>>>> Alexei Scherbakov
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>>
>> >>>>>>> Best regards,
>> >>>>>>> Alexei Scherbakov
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best wishes,
>> >>>>>> Amelchev Nikita
>> >>>>
>> >>
>> 
 
 
 
 

Re: Cache operations performance metrics

Posted by Andrey Gura <ag...@apache.org>.
>> And if we have two metrics that are triggered for the same then one of them is useless.
> I don't understand what is the two metrics you are talking about.

Please, don't loose context. In your example it was checkpoint time
and some cache operation time.

> A business transaction includes work with several data sources, sending network requests, executing some remote services.
> If it becomes slower then we should know - what basic API operations become slower.

No, we should now what basic operations become slower. It is problem
with network (net io), with disk (disk io), JVM (VM internal metrics),
etc. All this operations are bricks of API operations.

> So we have to measure  `PutTime` from Ignite, `InsertTime` from RDBMS and other parts of a transaction.

We can't do it properly due to a transactions implementation specific.
I already wrote about it. It doesn't mean that we must not fix it. But
it also means that we should reconsider approach to metrics in Ignite.
Bigger doesn't mean better.

> Ignite cache operations obviously becomes 2 times slower.
> *Why* they become slower is the question of an ongoing investigation.

But business operations metrics will show the same. And many other
internals related metrics will show the same. It is transitive,
redundant and relatively useless metric if it doesn't bring something
new in information. 500 caches with similar configurations (the same
nodes, the same data region, the same affinity, etc) and 500 metrics
like put time will show the same trend.And the same trend will show a
couple of system internal metrics. A couple vs hundreds. Doesn't make
sense and useless.

> I tried to look at other open-source products.
> Here is an example of metrics provided by Apache Kafka [1] [2]

If somebody do something it doesn't mean that they do it properly.

On Fri, Dec 20, 2019 at 4:28 PM Николай Ижиков <ni...@apache.org> wrote:
>
> > And if we have two metrics that are triggered for the same then one of them is useless.
>
> I don't understand what is the two metrics you are talking about.
> I wrote about a single metric for a single cache operation.
>
> > Obviously if you want know how fast or slow your business operation you must measure latency of your business operation. What could be easier?
>
> A business transaction includes work with several data sources, sending network requests, executing some remote services.
> If it becomes slower then we should know - what basic API operations become slower.
> So we have to measure  `PutTime` from Ignite, `InsertTime` from RDBMS and other parts of a transaction.
>
> Ignite will provide this kind of value out of the box.
> I think it’s useful values.
>
> > User saw "cache put time" metric becomes x2 bigger. Does it become slower or faster? Or we just put into the cache values that 4x bigger in size?
>
> Ignite cache operations obviously becomes 2 times slower.
> *Why* they become slower is the question of an ongoing investigation.
>
> I tried to look at other open-source products.
> Here is an example of metrics provided by Apache Kafka [1] [2]
>
> `request-latency-avg` - The average request latency in ms.
> `records-lag-max` - The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
> `fetch-latency-avg` - The average time taken for a fetch request.
>
> It seems, they implement a similar approach to what I proposed.
>
>
> [1] https://docs.confluent.io/current/kafka/monitoring.html#producer-metrics
> [2] https://docs.confluent.io/current/kafka/monitoring.html#new-consumer-metrics
>
> > 20 дек. 2019 г., в 15:53, Andrey Gura <ag...@apache.org> написал(а):
> >
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >
> > I just quote your words: " this is a trigger to make a deeper
> > investigation". And if we have two metrics that are triggered for the
> > same then one of them is useless.
> >
> >> How it relates to business operations?
> >
> > Why it should be related with business operation? It is concrete
> > metrics for concrete process which can slowdown all operations in the
> > system. Obviously if you want know how fast or slow your business
> > operation you must measure latency of your business operation. What
> > could be easier?
> >
> >> Is it become slower or faster?
> >
> > Very correct question! User saw "cache put time" metric becomes x2
> > bigger. Does it become slower or faster? Or we just put into the cache
> > values that 4x bigger in size? Or all time before we put values
> > locally and now we put values on remote nodes. Or our operations
> > execute in transaction and then time will depend on transaction type,
> > actions in transaction and other transaction and actually will nothing
> > talk about real cache operation. We have more questions then answers.
> >
> >> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
> >
> > Of course not :) See above.
> >
> > On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков <ni...@apache.org> wrote:
> >>
> >>> It also will be visible on other metrics
> >>
> >> How will it be visible?
> >>
> >> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> >> How it relates to business operations? Is it become slower or faster?
> >> What does it mean for an application performance?
> >>
> >> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
> >>
> >> *Why* it’s become slower - is the essence of «go deeper» investigation.
> >>
> >>> 20 дек. 2019 г., в 15:07, Andrey Gura <ag...@apache.org> написал(а):
> >>>
> >>>> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> >>>
> >>> It also will be visible on other metrics. So cache operations metrics
> >>> still useless because it transitive values.
> >>>
> >>>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> >>>
> >>>> We already implement it.
> >>>
> >>> I don't talk that it isn't implemented. It is just example of things
> >>> that should be measured. All other metrics depends on internals.
> >>>
> >>>>> 2. Measure business operations in user context, not cache API operations.
> >>>
> >>>> Why do you think these approaches should exclude one another?
> >>>
> >>> Because one of them is useless.
> >>>
> >>> On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <ni...@apache.org> wrote:
> >>>>
> >>>> Hello, Andrey.
> >>>>
> >>>>> Where the sense in this value? I explained why this metrics are relatively useless.
> >>>>
> >>>> I don’t agree with you.
> >>>> I believe they are not useless for a user.
> >>>> And I try to explain why I think so.
> >>>>
> >>>>> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
> >>>>
> >>>> Users shouldn’t distinguish.
> >>>> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> >>>>
> >>>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> >>>>
> >>>> We already implement it.
> >>>> What metrics are missing for internal processes?
> >>>>
> >>>>> 2. Measure business operations in user context, not cache API operations.
> >>>>
> >>>> Why do you think these approaches should exclude one another?
> >>>> Users definitely should measure whole business transaction performance.
> >>>>
> >>>> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
> >>>>
> >>>>
> >>>>> 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
> >>>>>
> >>>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
> >>>>>> It provides some kind of statistics(histograms) for it.
> >>>>>
> >>>>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
> >>>>>
> >>>>>> Yes, metrics will evaluate API call performance
> >>>>>
> >>>>> And what? Where the sense in this value? I explained why this metrics
> >>>>> are relatively useless.
> >>>>>
> >>>>>> These are metrics of client-side operation performance.
> >>>>>
> >>>>> Again. It's just a number without any sense.
> >>>>>
> >>>>>> I think a specific user has knowledge - what are his transactions.
> >>>>>
> >>>>> May be. But user can't distinguish one transaction from another, so
> >>>>> his knowledge doesn't make sense definitely.
> >>>>>
> >>>>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >>>>>
> >>>>> Actually not. The same caches can be involved  in a dozen of
> >>>>> transactions and there are no ways to understand what transactions are
> >>>>> slow or fast. It is useless.
> >>>>>
> >>>>>> I disagree here.
> >>>>>> If you have a better approach to measure cache operations performance - please, share your vision.
> >>>>>
> >>>>> I already wrote about better approach. Two main points:
> >>>>>
> >>>>> 1. Measure some important internals (WAL operations, checkpoint time,
> >>>>> etc) because it can talk about real problems.
> >>>>> 2. Measure business operations in user context, not cache API operations.
> >>>>>
> >>>>> So  what we have? We have useless metrics that are doubled by useless
> >>>>> histograms.
> >>>>>
> >>>>> We should reconsider approach to metrics and performance measuring. It
> >>>>> is hard and long task. There are no need to commit tons of useless
> >>>>> metrics that just decrease performance.
> >>>>>
> >>>>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >>>>> problem exists very very long time and existing metrics discussed many
> >>>>> times. No one can explain this metrics to users because it requires
> >>>>> too many additional knowledge about internals. And metric  value
> >>>>> itself depends on many aspects of internals. It leads to impossibility
> >>>>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> >>>>> backward compatibility).
> >>>>>
> >>>>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hello, Andrey.
> >>>>>>
> >>>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
> >>>>>> It provides some kind of statistics(histograms) for it.
> >>>>>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
> >>>>>>
> >>>>>>>> Measured for API calls on the caller node side
> >>>>>>> Values will the same only for cases when node is remote relative to data
> >>>>>>
> >>>>>> Yes, metrics will evaluate API call performance.
> >>>>>> I think this is the most valuable information from a user's point of view.
> >>>>>>
> >>>>>> Regular user wants to know how fast his cache operation performs.
> >>>>>> And these metrics provide the answer.
> >>>>>>
> >>>>>>> For regular data node (server node) timing will depend on answers for question:
> >>>>>>
> >>>>>> I think these answers are always available.
> >>>>>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
> >>>>>> Even so, all answers are provided through system view we brought to the Ignite :)
> >>>>>>
> >>>>>>> What is transaction commit or rollback time?
> >>>>>>
> >>>>>> These are metrics of client-side operation performance.
> >>>>>>
> >>>>>> I think a specific user has knowledge - what are his transactions.
> >>>>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >>>>>> I think it’s very valuable knowledge.
> >>>>>>
> >>>>>>> It will be implemented for most types of messages.
> >>>>>>
> >>>>>> Good, let’s do it?
> >>>>>>
> >>>>>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
> >>>>>>
> >>>>>> I disagree here.
> >>>>>> If you have a better approach to measure cache operations performance - please, share your vision.
> >>>>>>
> >>>>>>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
> >>>>>>>
> >>>>>>> From my point of view, Ignite should provide meaningful metrics for
> >>>>>>> internal components that could be useful for monitoring and analysis.
> >>>>>>> All suggested options are meaningless in a sense. Below I'll try
> >>>>>>> explain why.
> >>>>>>>
> >>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
> >>>>>>>> Implemented in [1], commit [2].
> >>>>>>>
> >>>>>>> All cache operations in Ignite are distributed. So each value measured
> >>>>>>> for some cache operation will vary depending on where actually
> >>>>>>> operation is performed. Values will the same only for cases when node
> >>>>>>> is remote relative to data (e.g. client node).
> >>>>>>>
> >>>>>>> For regular data node (server node) timing will depend on answers for question:
> >>>>>>>
> >>>>>>> - is node primary for particular key or not? (for all operations)
> >>>>>>> - how many backups configured for the cache? (for put and remove)
> >>>>>>> - what write synchronization mode is configured for particular cache?
> >>>>>>> (for put and remove)
> >>>>>>> - is readFromBackup enabled for the cache? (for get)
> >>>>>>>
> >>>>>>> Both Ignite users and Ignite developers can't make any decision based
> >>>>>>> on this metrics.
> >>>>>>>
> >>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
> >>>>>>>
> >>>>>>> What is transaction commit or rollback time? How it calculates in
> >>>>>>> Ignite now? What actions included into transaction? What actions not
> >>>>>>> related with cache executed during transactions?
> >>>>>>>
> >>>>>>> There is no any sense in time of transaction commit or rollback
> >>>>>>> because there are no any way to understand what transaction was
> >>>>>>> performed in particular period of time. Usually a lot of transactions
> >>>>>>> and we can't to distinguish from each other.
> >>>>>>>
> >>>>>>> Moreover, transaction usually treats as business operation. So only
> >>>>>>> way to measure performance properly is measure business operation
> >>>>>>> time. That is user should create own metrics set for some business
> >>>>>>> API.
> >>>>>>>
> >>>>>>> Further. What about cross cache transactions? At the moment tx
> >>>>>>> commit/rollback time will be added to corresponding metrics per each
> >>>>>>> cache evolved to the transaction. The *same time* for *each cache*.
> >>>>>>> Absolutely meaningless.
> >>>>>>>
> >>>>>>> Again, both Ignite users and Ignite developers can't make any decision
> >>>>>>> based on this metrics. But users can create own metrics set.
> >>>>>>>
> >>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>>>>> Ticket doesn't exist for it.
> >>>>>>>
> >>>>>>> It will be implemented for most types of messages.
> >>>>>>>
> >>>>>>> Metrics, application monitoring, performance analysis and measurement
> >>>>>>> are a a little harder than it sounds. Therefore, we must approach this
> >>>>>>> issue more carefully.
> >>>>>>> Blindly adding new types of metrics will not only not improve the
> >>>>>>> situation, but will also worsen the overall performance of the system
> >>>>>>> because metric calculation always on the hot path.
> >>>>>>>
> >>>>>>> So, from my point of view, commits for get/put/remove and
> >>>>>>> commit/rollback should be reverted.
> >>>>>>>
> >>>>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> I think these metrics are useful.
> >>>>>>>>
> >>>>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >>>>>>>> Nikolay, could you take a look, please?
> >>>>>>>>
> >>>>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
> >>>>>>>>
> >>>>>>>> I have filed a ticket for it. [3]
> >>>>>>>>
> >>>>>>>> [1] https://github.com/apache/ignite/pull/7141
> >>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
> >>>>>>>>
> >>>>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
> >>>>>>>>>
> >>>>>>>>> I think they are very useful.
> >>>>>>>>>
> >>>>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
> >>>>>>>>>
> >>>>>>>>>> Hello, Alexei.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
> >>>>>>>>>> What do you think about proposed metrics set?
> >>>>>>>>>>
> >>>>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >>>>>>>>>> alexey.scherbakoff@gmail.com> написал(а):
> >>>>>>>>>>>
> >>>>>>>>>>> Nikolay,
> >>>>>>>>>>>
> >>>>>>>>>>> What about batch operations?
> >>>>>>>>>>>
> >>>>>>>>>>> For messages processing the ticket does exist and even has an
> >>>>>>>>>>> implementation from before new metrics API times [1]
> >>>>>>>>>>>
> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >>>>>>>>>>>
> >>>>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello, Igniters.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I want to provide the user answers to the following question: "How cache
> >>>>>>>>>>>> API operations perform?"
> >>>>>>>>>>>> It seems, we need to implements metrics for basic cache API operations
> >>>>>>>>>>>> like get, put, remove for it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think we should provide the following metrics:
> >>>>>>>>>>>>
> >>>>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> >>>>>>>>>>>> caller node side.
> >>>>>>>>>>>> Implemented in [1], commit [2].
> >>>>>>>>>>>>
> >>>>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
> >>>>>>>>>>>> caller node side [3].
> >>>>>>>>>>>>
> >>>>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
> >>>>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>>>>>>>>> Ticket doesn't exist for it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> >>>>>>>>>>>> [2]
> >>>>>>>>>>>>
> >>>>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>>
> >>>>>>>>>>> Best regards,
> >>>>>>>>>>> Alexei Scherbakov
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Alexei Scherbakov
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best wishes,
> >>>>>>>> Amelchev Nikita
> >>>>>>
> >>>>
> >>
>

Re: Cache operations performance metrics

Posted by Николай Ижиков <ni...@apache.org>.
> And if we have two metrics that are triggered for the same then one of them is useless.

I don't understand what is the two metrics you are talking about.
I wrote about a single metric for a single cache operation.

> Obviously if you want know how fast or slow your business operation you must measure latency of your business operation. What could be easier?

A business transaction includes work with several data sources, sending network requests, executing some remote services.
If it becomes slower then we should know - what basic API operations become slower.
So we have to measure  `PutTime` from Ignite, `InsertTime` from RDBMS and other parts of a transaction.

Ignite will provide this kind of value out of the box.
I think it’s useful values.

> User saw "cache put time" metric becomes x2 bigger. Does it become slower or faster? Or we just put into the cache values that 4x bigger in size?

Ignite cache operations obviously becomes 2 times slower. 
*Why* they become slower is the question of an ongoing investigation.

I tried to look at other open-source products.
Here is an example of metrics provided by Apache Kafka [1] [2]

`request-latency-avg` - The average request latency in ms.
`records-lag-max` - The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
`fetch-latency-avg` - The average time taken for a fetch request.

It seems, they implement a similar approach to what I proposed.


[1] https://docs.confluent.io/current/kafka/monitoring.html#producer-metrics
[2] https://docs.confluent.io/current/kafka/monitoring.html#new-consumer-metrics

> 20 дек. 2019 г., в 15:53, Andrey Gura <ag...@apache.org> написал(а):
> 
>> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> 
> I just quote your words: " this is a trigger to make a deeper
> investigation". And if we have two metrics that are triggered for the
> same then one of them is useless.
> 
>> How it relates to business operations?
> 
> Why it should be related with business operation? It is concrete
> metrics for concrete process which can slowdown all operations in the
> system. Obviously if you want know how fast or slow your business
> operation you must measure latency of your business operation. What
> could be easier?
> 
>> Is it become slower or faster?
> 
> Very correct question! User saw "cache put time" metric becomes x2
> bigger. Does it become slower or faster? Or we just put into the cache
> values that 4x bigger in size? Or all time before we put values
> locally and now we put values on remote nodes. Or our operations
> execute in transaction and then time will depend on transaction type,
> actions in transaction and other transaction and actually will nothing
> talk about real cache operation. We have more questions then answers.
> 
>> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
> 
> Of course not :) See above.
> 
> On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков <ni...@apache.org> wrote:
>> 
>>> It also will be visible on other metrics
>> 
>> How will it be visible?
>> 
>> For example, the user saw «checkpoint time» metric becomes x2 bigger.
>> How it relates to business operations? Is it become slower or faster?
>> What does it mean for an application performance?
>> 
>> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
>> 
>> *Why* it’s become slower - is the essence of «go deeper» investigation.
>> 
>>> 20 дек. 2019 г., в 15:07, Andrey Gura <ag...@apache.org> написал(а):
>>> 
>>>> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
>>> 
>>> It also will be visible on other metrics. So cache operations metrics
>>> still useless because it transitive values.
>>> 
>>>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
>>> 
>>>> We already implement it.
>>> 
>>> I don't talk that it isn't implemented. It is just example of things
>>> that should be measured. All other metrics depends on internals.
>>> 
>>>>> 2. Measure business operations in user context, not cache API operations.
>>> 
>>>> Why do you think these approaches should exclude one another?
>>> 
>>> Because one of them is useless.
>>> 
>>> On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <ni...@apache.org> wrote:
>>>> 
>>>> Hello, Andrey.
>>>> 
>>>>> Where the sense in this value? I explained why this metrics are relatively useless.
>>>> 
>>>> I don’t agree with you.
>>>> I believe they are not useless for a user.
>>>> And I try to explain why I think so.
>>>> 
>>>>> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
>>>> 
>>>> Users shouldn’t distinguish.
>>>> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
>>>> 
>>>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
>>>> 
>>>> We already implement it.
>>>> What metrics are missing for internal processes?
>>>> 
>>>>> 2. Measure business operations in user context, not cache API operations.
>>>> 
>>>> Why do you think these approaches should exclude one another?
>>>> Users definitely should measure whole business transaction performance.
>>>> 
>>>> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
>>>> 
>>>> 
>>>>> 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
>>>>> 
>>>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>>>>>> It provides some kind of statistics(histograms) for it.
>>>>> 
>>>>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
>>>>> 
>>>>>> Yes, metrics will evaluate API call performance
>>>>> 
>>>>> And what? Where the sense in this value? I explained why this metrics
>>>>> are relatively useless.
>>>>> 
>>>>>> These are metrics of client-side operation performance.
>>>>> 
>>>>> Again. It's just a number without any sense.
>>>>> 
>>>>>> I think a specific user has knowledge - what are his transactions.
>>>>> 
>>>>> May be. But user can't distinguish one transaction from another, so
>>>>> his knowledge doesn't make sense definitely.
>>>>> 
>>>>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>>>>> 
>>>>> Actually not. The same caches can be involved  in a dozen of
>>>>> transactions and there are no ways to understand what transactions are
>>>>> slow or fast. It is useless.
>>>>> 
>>>>>> I disagree here.
>>>>>> If you have a better approach to measure cache operations performance - please, share your vision.
>>>>> 
>>>>> I already wrote about better approach. Two main points:
>>>>> 
>>>>> 1. Measure some important internals (WAL operations, checkpoint time,
>>>>> etc) because it can talk about real problems.
>>>>> 2. Measure business operations in user context, not cache API operations.
>>>>> 
>>>>> So  what we have? We have useless metrics that are doubled by useless
>>>>> histograms.
>>>>> 
>>>>> We should reconsider approach to metrics and performance measuring. It
>>>>> is hard and long task. There are no need to commit tons of useless
>>>>> metrics that just decrease performance.
>>>>> 
>>>>> Sorry for some sarcasm but I really believe in my opinion. Metrics
>>>>> problem exists very very long time and existing metrics discussed many
>>>>> times. No one can explain this metrics to users because it requires
>>>>> too many additional knowledge about internals. And metric  value
>>>>> itself depends on many aspects of internals. It leads to impossibility
>>>>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
>>>>> backward compatibility).
>>>>> 
>>>>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
>>>>>> 
>>>>>> Hello, Andrey.
>>>>>> 
>>>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>>>>>> It provides some kind of statistics(histograms) for it.
>>>>>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
>>>>>> 
>>>>>>>> Measured for API calls on the caller node side
>>>>>>> Values will the same only for cases when node is remote relative to data
>>>>>> 
>>>>>> Yes, metrics will evaluate API call performance.
>>>>>> I think this is the most valuable information from a user's point of view.
>>>>>> 
>>>>>> Regular user wants to know how fast his cache operation performs.
>>>>>> And these metrics provide the answer.
>>>>>> 
>>>>>>> For regular data node (server node) timing will depend on answers for question:
>>>>>> 
>>>>>> I think these answers are always available.
>>>>>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
>>>>>> Even so, all answers are provided through system view we brought to the Ignite :)
>>>>>> 
>>>>>>> What is transaction commit or rollback time?
>>>>>> 
>>>>>> These are metrics of client-side operation performance.
>>>>>> 
>>>>>> I think a specific user has knowledge - what are his transactions.
>>>>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>>>>>> I think it’s very valuable knowledge.
>>>>>> 
>>>>>>> It will be implemented for most types of messages.
>>>>>> 
>>>>>> Good, let’s do it?
>>>>>> 
>>>>>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
>>>>>> 
>>>>>> I disagree here.
>>>>>> If you have a better approach to measure cache operations performance - please, share your vision.
>>>>>> 
>>>>>>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
>>>>>>> 
>>>>>>> From my point of view, Ignite should provide meaningful metrics for
>>>>>>> internal components that could be useful for monitoring and analysis.
>>>>>>> All suggested options are meaningless in a sense. Below I'll try
>>>>>>> explain why.
>>>>>>> 
>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
>>>>>>>> Implemented in [1], commit [2].
>>>>>>> 
>>>>>>> All cache operations in Ignite are distributed. So each value measured
>>>>>>> for some cache operation will vary depending on where actually
>>>>>>> operation is performed. Values will the same only for cases when node
>>>>>>> is remote relative to data (e.g. client node).
>>>>>>> 
>>>>>>> For regular data node (server node) timing will depend on answers for question:
>>>>>>> 
>>>>>>> - is node primary for particular key or not? (for all operations)
>>>>>>> - how many backups configured for the cache? (for put and remove)
>>>>>>> - what write synchronization mode is configured for particular cache?
>>>>>>> (for put and remove)
>>>>>>> - is readFromBackup enabled for the cache? (for get)
>>>>>>> 
>>>>>>> Both Ignite users and Ignite developers can't make any decision based
>>>>>>> on this metrics.
>>>>>>> 
>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
>>>>>>> 
>>>>>>> What is transaction commit or rollback time? How it calculates in
>>>>>>> Ignite now? What actions included into transaction? What actions not
>>>>>>> related with cache executed during transactions?
>>>>>>> 
>>>>>>> There is no any sense in time of transaction commit or rollback
>>>>>>> because there are no any way to understand what transaction was
>>>>>>> performed in particular period of time. Usually a lot of transactions
>>>>>>> and we can't to distinguish from each other.
>>>>>>> 
>>>>>>> Moreover, transaction usually treats as business operation. So only
>>>>>>> way to measure performance properly is measure business operation
>>>>>>> time. That is user should create own metrics set for some business
>>>>>>> API.
>>>>>>> 
>>>>>>> Further. What about cross cache transactions? At the moment tx
>>>>>>> commit/rollback time will be added to corresponding metrics per each
>>>>>>> cache evolved to the transaction. The *same time* for *each cache*.
>>>>>>> Absolutely meaningless.
>>>>>>> 
>>>>>>> Again, both Ignite users and Ignite developers can't make any decision
>>>>>>> based on this metrics. But users can create own metrics set.
>>>>>>> 
>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>>>> Ticket doesn't exist for it.
>>>>>>> 
>>>>>>> It will be implemented for most types of messages.
>>>>>>> 
>>>>>>> Metrics, application monitoring, performance analysis and measurement
>>>>>>> are a a little harder than it sounds. Therefore, we must approach this
>>>>>>> issue more carefully.
>>>>>>> Blindly adding new types of metrics will not only not improve the
>>>>>>> situation, but will also worsen the overall performance of the system
>>>>>>> because metric calculation always on the hot path.
>>>>>>> 
>>>>>>> So, from my point of view, commits for get/put/remove and
>>>>>>> commit/rollback should be reverted.
>>>>>>> 
>>>>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> I think these metrics are useful.
>>>>>>>> 
>>>>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
>>>>>>>> Nikolay, could you take a look, please?
>>>>>>>> 
>>>>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
>>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
>>>>>>>> 
>>>>>>>> I have filed a ticket for it. [3]
>>>>>>>> 
>>>>>>>> [1] https://github.com/apache/ignite/pull/7141
>>>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>>>>>>>> 
>>>>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
>>>>>>>>> 
>>>>>>>>> I think they are very useful.
>>>>>>>>> 
>>>>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
>>>>>>>>> 
>>>>>>>>>> Hello, Alexei.
>>>>>>>>>> 
>>>>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
>>>>>>>>>> What do you think about proposed metrics set?
>>>>>>>>>> 
>>>>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>>>>>>>>>> alexey.scherbakoff@gmail.com> написал(а):
>>>>>>>>>>> 
>>>>>>>>>>> Nikolay,
>>>>>>>>>>> 
>>>>>>>>>>> What about batch operations?
>>>>>>>>>>> 
>>>>>>>>>>> For messages processing the ticket does exist and even has an
>>>>>>>>>>> implementation from before new metrics API times [1]
>>>>>>>>>>> 
>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>>>>>>>>>>> 
>>>>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
>>>>>>>>>>> 
>>>>>>>>>>>> Hello, Igniters.
>>>>>>>>>>>> 
>>>>>>>>>>>> I want to provide the user answers to the following question: "How cache
>>>>>>>>>>>> API operations perform?"
>>>>>>>>>>>> It seems, we need to implements metrics for basic cache API operations
>>>>>>>>>>>> like get, put, remove for it.
>>>>>>>>>>>> 
>>>>>>>>>>>> I think we should provide the following metrics:
>>>>>>>>>>>> 
>>>>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>>>>>>>>>>>> caller node side.
>>>>>>>>>>>> Implemented in [1], commit [2].
>>>>>>>>>>>> 
>>>>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
>>>>>>>>>>>> caller node side [3].
>>>>>>>>>>>> 
>>>>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
>>>>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>>>>>>>> Ticket doesn't exist for it.
>>>>>>>>>>>> 
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>> 
>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>>>>>>>>>>>> [2]
>>>>>>>>>>>> 
>>>>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Alexei Scherbakov
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Alexei Scherbakov
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best wishes,
>>>>>>>> Amelchev Nikita
>>>>>> 
>>>> 
>> 


Re: Cache operations performance metrics

Posted by Andrey Gura <ag...@apache.org>.
> For example, the user saw «checkpoint time» metric becomes x2 bigger.

I just quote your words: " this is a trigger to make a deeper
investigation". And if we have two metrics that are triggered for the
same then one of them is useless.

> How it relates to business operations?

Why it should be related with business operation? It is concrete
metrics for concrete process which can slowdown all operations in the
system. Obviously if you want know how fast or slow your business
operation you must measure latency of your business operation. What
could be easier?

> Is it become slower or faster?

Very correct question! User saw "cache put time" metric becomes x2
bigger. Does it become slower or faster? Or we just put into the cache
values that 4x bigger in size? Or all time before we put values
locally and now we put values on remote nodes. Or our operations
execute in transaction and then time will depend on transaction type,
actions in transaction and other transaction and actually will nothing
talk about real cache operation. We have more questions then answers.

> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.

Of course not :) See above.

On Fri, Dec 20, 2019 at 3:20 PM Николай Ижиков <ni...@apache.org> wrote:
>
> > It also will be visible on other metrics
>
> How will it be visible?
>
> For example, the user saw «checkpoint time» metric becomes x2 bigger.
> How it relates to business operations? Is it become slower or faster?
> What does it mean for an application performance?
>
> On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.
>
> *Why* it’s become slower - is the essence of «go deeper» investigation.
>
> > 20 дек. 2019 г., в 15:07, Andrey Gura <ag...@apache.org> написал(а):
> >
> >> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> >
> > It also will be visible on other metrics. So cache operations metrics
> > still useless because it transitive values.
> >
> >>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> >
> >> We already implement it.
> >
> > I don't talk that it isn't implemented. It is just example of things
> > that should be measured. All other metrics depends on internals.
> >
> >>> 2. Measure business operations in user context, not cache API operations.
> >
> >> Why do you think these approaches should exclude one another?
> >
> > Because one of them is useless.
> >
> > On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <ni...@apache.org> wrote:
> >>
> >> Hello, Andrey.
> >>
> >>> Where the sense in this value? I explained why this metrics are relatively useless.
> >>
> >> I don’t agree with you.
> >> I believe they are not useless for a user.
> >> And I try to explain why I think so.
> >>
> >>> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
> >>
> >> Users shouldn’t distinguish.
> >> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> >>
> >>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> >>
> >> We already implement it.
> >> What metrics are missing for internal processes?
> >>
> >>> 2. Measure business operations in user context, not cache API operations.
> >>
> >> Why do you think these approaches should exclude one another?
> >> Users definitely should measure whole business transaction performance.
> >>
> >> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
> >>
> >>
> >>> 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
> >>>
> >>>> The goal of the proposed metrics is to measure whole cache operations behavior.
> >>>> It provides some kind of statistics(histograms) for it.
> >>>
> >>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
> >>>
> >>>> Yes, metrics will evaluate API call performance
> >>>
> >>> And what? Where the sense in this value? I explained why this metrics
> >>> are relatively useless.
> >>>
> >>>> These are metrics of client-side operation performance.
> >>>
> >>> Again. It's just a number without any sense.
> >>>
> >>>> I think a specific user has knowledge - what are his transactions.
> >>>
> >>> May be. But user can't distinguish one transaction from another, so
> >>> his knowledge doesn't make sense definitely.
> >>>
> >>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >>>
> >>> Actually not. The same caches can be involved  in a dozen of
> >>> transactions and there are no ways to understand what transactions are
> >>> slow or fast. It is useless.
> >>>
> >>>> I disagree here.
> >>>> If you have a better approach to measure cache operations performance - please, share your vision.
> >>>
> >>> I already wrote about better approach. Two main points:
> >>>
> >>> 1. Measure some important internals (WAL operations, checkpoint time,
> >>> etc) because it can talk about real problems.
> >>> 2. Measure business operations in user context, not cache API operations.
> >>>
> >>> So  what we have? We have useless metrics that are doubled by useless
> >>> histograms.
> >>>
> >>> We should reconsider approach to metrics and performance measuring. It
> >>> is hard and long task. There are no need to commit tons of useless
> >>> metrics that just decrease performance.
> >>>
> >>> Sorry for some sarcasm but I really believe in my opinion. Metrics
> >>> problem exists very very long time and existing metrics discussed many
> >>> times. No one can explain this metrics to users because it requires
> >>> too many additional knowledge about internals. And metric  value
> >>> itself depends on many aspects of internals. It leads to impossibility
> >>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> >>> backward compatibility).
> >>>
> >>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
> >>>>
> >>>> Hello, Andrey.
> >>>>
> >>>> The goal of the proposed metrics is to measure whole cache operations behavior.
> >>>> It provides some kind of statistics(histograms) for it.
> >>>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
> >>>>
> >>>>>> Measured for API calls on the caller node side
> >>>>> Values will the same only for cases when node is remote relative to data
> >>>>
> >>>> Yes, metrics will evaluate API call performance.
> >>>> I think this is the most valuable information from a user's point of view.
> >>>>
> >>>> Regular user wants to know how fast his cache operation performs.
> >>>> And these metrics provide the answer.
> >>>>
> >>>>> For regular data node (server node) timing will depend on answers for question:
> >>>>
> >>>> I think these answers are always available.
> >>>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
> >>>> Even so, all answers are provided through system view we brought to the Ignite :)
> >>>>
> >>>>> What is transaction commit or rollback time?
> >>>>
> >>>> These are metrics of client-side operation performance.
> >>>>
> >>>> I think a specific user has knowledge - what are his transactions.
> >>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >>>> I think it’s very valuable knowledge.
> >>>>
> >>>>> It will be implemented for most types of messages.
> >>>>
> >>>> Good, let’s do it?
> >>>>
> >>>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
> >>>>
> >>>> I disagree here.
> >>>> If you have a better approach to measure cache operations performance - please, share your vision.
> >>>>
> >>>>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
> >>>>>
> >>>>> From my point of view, Ignite should provide meaningful metrics for
> >>>>> internal components that could be useful for monitoring and analysis.
> >>>>> All suggested options are meaningless in a sense. Below I'll try
> >>>>> explain why.
> >>>>>
> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
> >>>>>> Implemented in [1], commit [2].
> >>>>>
> >>>>> All cache operations in Ignite are distributed. So each value measured
> >>>>> for some cache operation will vary depending on where actually
> >>>>> operation is performed. Values will the same only for cases when node
> >>>>> is remote relative to data (e.g. client node).
> >>>>>
> >>>>> For regular data node (server node) timing will depend on answers for question:
> >>>>>
> >>>>> - is node primary for particular key or not? (for all operations)
> >>>>> - how many backups configured for the cache? (for put and remove)
> >>>>> - what write synchronization mode is configured for particular cache?
> >>>>> (for put and remove)
> >>>>> - is readFromBackup enabled for the cache? (for get)
> >>>>>
> >>>>> Both Ignite users and Ignite developers can't make any decision based
> >>>>> on this metrics.
> >>>>>
> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
> >>>>>
> >>>>> What is transaction commit or rollback time? How it calculates in
> >>>>> Ignite now? What actions included into transaction? What actions not
> >>>>> related with cache executed during transactions?
> >>>>>
> >>>>> There is no any sense in time of transaction commit or rollback
> >>>>> because there are no any way to understand what transaction was
> >>>>> performed in particular period of time. Usually a lot of transactions
> >>>>> and we can't to distinguish from each other.
> >>>>>
> >>>>> Moreover, transaction usually treats as business operation. So only
> >>>>> way to measure performance properly is measure business operation
> >>>>> time. That is user should create own metrics set for some business
> >>>>> API.
> >>>>>
> >>>>> Further. What about cross cache transactions? At the moment tx
> >>>>> commit/rollback time will be added to corresponding metrics per each
> >>>>> cache evolved to the transaction. The *same time* for *each cache*.
> >>>>> Absolutely meaningless.
> >>>>>
> >>>>> Again, both Ignite users and Ignite developers can't make any decision
> >>>>> based on this metrics. But users can create own metrics set.
> >>>>>
> >>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>>> Ticket doesn't exist for it.
> >>>>>
> >>>>> It will be implemented for most types of messages.
> >>>>>
> >>>>> Metrics, application monitoring, performance analysis and measurement
> >>>>> are a a little harder than it sounds. Therefore, we must approach this
> >>>>> issue more carefully.
> >>>>> Blindly adding new types of metrics will not only not improve the
> >>>>> situation, but will also worsen the overall performance of the system
> >>>>> because metric calculation always on the hot path.
> >>>>>
> >>>>> So, from my point of view, commits for get/put/remove and
> >>>>> commit/rollback should be reverted.
> >>>>>
> >>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
> >>>>>>
> >>>>>> I think these metrics are useful.
> >>>>>>
> >>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >>>>>> Nikolay, could you take a look, please?
> >>>>>>
> >>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
> >>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
> >>>>>>
> >>>>>> I have filed a ticket for it. [3]
> >>>>>>
> >>>>>> [1] https://github.com/apache/ignite/pull/7141
> >>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
> >>>>>>
> >>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
> >>>>>>>
> >>>>>>> I think they are very useful.
> >>>>>>>
> >>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
> >>>>>>>
> >>>>>>>> Hello, Alexei.
> >>>>>>>>
> >>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
> >>>>>>>> What do you think about proposed metrics set?
> >>>>>>>>
> >>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >>>>>>>> alexey.scherbakoff@gmail.com> написал(а):
> >>>>>>>>>
> >>>>>>>>> Nikolay,
> >>>>>>>>>
> >>>>>>>>> What about batch operations?
> >>>>>>>>>
> >>>>>>>>> For messages processing the ticket does exist and even has an
> >>>>>>>>> implementation from before new metrics API times [1]
> >>>>>>>>>
> >>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >>>>>>>>>
> >>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> >>>>>>>>>
> >>>>>>>>>> Hello, Igniters.
> >>>>>>>>>>
> >>>>>>>>>> I want to provide the user answers to the following question: "How cache
> >>>>>>>>>> API operations perform?"
> >>>>>>>>>> It seems, we need to implements metrics for basic cache API operations
> >>>>>>>>>> like get, put, remove for it.
> >>>>>>>>>>
> >>>>>>>>>> I think we should provide the following metrics:
> >>>>>>>>>>
> >>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> >>>>>>>>>> caller node side.
> >>>>>>>>>> Implemented in [1], commit [2].
> >>>>>>>>>>
> >>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
> >>>>>>>>>> caller node side [3].
> >>>>>>>>>>
> >>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
> >>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>>>>>>> Ticket doesn't exist for it.
> >>>>>>>>>>
> >>>>>>>>>> What do you think?
> >>>>>>>>>>
> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> >>>>>>>>>> [2]
> >>>>>>>>>>
> >>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> Best regards,
> >>>>>>>>> Alexei Scherbakov
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Alexei Scherbakov
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best wishes,
> >>>>>> Amelchev Nikita
> >>>>
> >>
>

Re: Cache operations performance metrics

Posted by Николай Ижиков <ni...@apache.org>.
> It also will be visible on other metrics

How will it be visible?

For example, the user saw «checkpoint time» metric becomes x2 bigger.
How it relates to business operations? Is it become slower or faster?
What does it mean for an application performance?

On the other hand - if `PuTime` increased - then we know for sure, all operation executing `put` becomes slower.

*Why* it’s become slower - is the essence of «go deeper» investigation.

> 20 дек. 2019 г., в 15:07, Andrey Gura <ag...@apache.org> написал(а):
> 
>> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
> 
> It also will be visible on other metrics. So cache operations metrics
> still useless because it transitive values.
> 
>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
> 
>> We already implement it.
> 
> I don't talk that it isn't implemented. It is just example of things
> that should be measured. All other metrics depends on internals.
> 
>>> 2. Measure business operations in user context, not cache API operations.
> 
>> Why do you think these approaches should exclude one another?
> 
> Because one of them is useless.
> 
> On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <ni...@apache.org> wrote:
>> 
>> Hello, Andrey.
>> 
>>> Where the sense in this value? I explained why this metrics are relatively useless.
>> 
>> I don’t agree with you.
>> I believe they are not useless for a user.
>> And I try to explain why I think so.
>> 
>>> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
>> 
>> Users shouldn’t distinguish.
>> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
>> 
>>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
>> 
>> We already implement it.
>> What metrics are missing for internal processes?
>> 
>>> 2. Measure business operations in user context, not cache API operations.
>> 
>> Why do you think these approaches should exclude one another?
>> Users definitely should measure whole business transaction performance.
>> 
>> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
>> 
>> 
>>> 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
>>> 
>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>>>> It provides some kind of statistics(histograms) for it.
>>> 
>>> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
>>> 
>>>> Yes, metrics will evaluate API call performance
>>> 
>>> And what? Where the sense in this value? I explained why this metrics
>>> are relatively useless.
>>> 
>>>> These are metrics of client-side operation performance.
>>> 
>>> Again. It's just a number without any sense.
>>> 
>>>> I think a specific user has knowledge - what are his transactions.
>>> 
>>> May be. But user can't distinguish one transaction from another, so
>>> his knowledge doesn't make sense definitely.
>>> 
>>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>>> 
>>> Actually not. The same caches can be involved  in a dozen of
>>> transactions and there are no ways to understand what transactions are
>>> slow or fast. It is useless.
>>> 
>>>> I disagree here.
>>>> If you have a better approach to measure cache operations performance - please, share your vision.
>>> 
>>> I already wrote about better approach. Two main points:
>>> 
>>> 1. Measure some important internals (WAL operations, checkpoint time,
>>> etc) because it can talk about real problems.
>>> 2. Measure business operations in user context, not cache API operations.
>>> 
>>> So  what we have? We have useless metrics that are doubled by useless
>>> histograms.
>>> 
>>> We should reconsider approach to metrics and performance measuring. It
>>> is hard and long task. There are no need to commit tons of useless
>>> metrics that just decrease performance.
>>> 
>>> Sorry for some sarcasm but I really believe in my opinion. Metrics
>>> problem exists very very long time and existing metrics discussed many
>>> times. No one can explain this metrics to users because it requires
>>> too many additional knowledge about internals. And metric  value
>>> itself depends on many aspects of internals. It leads to impossibility
>>> of interpretation. And it's good time to remove it (in AI 3.0 due to a
>>> backward compatibility).
>>> 
>>> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
>>>> 
>>>> Hello, Andrey.
>>>> 
>>>> The goal of the proposed metrics is to measure whole cache operations behavior.
>>>> It provides some kind of statistics(histograms) for it.
>>>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
>>>> 
>>>>>> Measured for API calls on the caller node side
>>>>> Values will the same only for cases when node is remote relative to data
>>>> 
>>>> Yes, metrics will evaluate API call performance.
>>>> I think this is the most valuable information from a user's point of view.
>>>> 
>>>> Regular user wants to know how fast his cache operation performs.
>>>> And these metrics provide the answer.
>>>> 
>>>>> For regular data node (server node) timing will depend on answers for question:
>>>> 
>>>> I think these answers are always available.
>>>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
>>>> Even so, all answers are provided through system view we brought to the Ignite :)
>>>> 
>>>>> What is transaction commit or rollback time?
>>>> 
>>>> These are metrics of client-side operation performance.
>>>> 
>>>> I think a specific user has knowledge - what are his transactions.
>>>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>>>> I think it’s very valuable knowledge.
>>>> 
>>>>> It will be implemented for most types of messages.
>>>> 
>>>> Good, let’s do it?
>>>> 
>>>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
>>>> 
>>>> I disagree here.
>>>> If you have a better approach to measure cache operations performance - please, share your vision.
>>>> 
>>>>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
>>>>> 
>>>>> From my point of view, Ignite should provide meaningful metrics for
>>>>> internal components that could be useful for monitoring and analysis.
>>>>> All suggested options are meaningless in a sense. Below I'll try
>>>>> explain why.
>>>>> 
>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
>>>>>> Implemented in [1], commit [2].
>>>>> 
>>>>> All cache operations in Ignite are distributed. So each value measured
>>>>> for some cache operation will vary depending on where actually
>>>>> operation is performed. Values will the same only for cases when node
>>>>> is remote relative to data (e.g. client node).
>>>>> 
>>>>> For regular data node (server node) timing will depend on answers for question:
>>>>> 
>>>>> - is node primary for particular key or not? (for all operations)
>>>>> - how many backups configured for the cache? (for put and remove)
>>>>> - what write synchronization mode is configured for particular cache?
>>>>> (for put and remove)
>>>>> - is readFromBackup enabled for the cache? (for get)
>>>>> 
>>>>> Both Ignite users and Ignite developers can't make any decision based
>>>>> on this metrics.
>>>>> 
>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
>>>>> 
>>>>> What is transaction commit or rollback time? How it calculates in
>>>>> Ignite now? What actions included into transaction? What actions not
>>>>> related with cache executed during transactions?
>>>>> 
>>>>> There is no any sense in time of transaction commit or rollback
>>>>> because there are no any way to understand what transaction was
>>>>> performed in particular period of time. Usually a lot of transactions
>>>>> and we can't to distinguish from each other.
>>>>> 
>>>>> Moreover, transaction usually treats as business operation. So only
>>>>> way to measure performance properly is measure business operation
>>>>> time. That is user should create own metrics set for some business
>>>>> API.
>>>>> 
>>>>> Further. What about cross cache transactions? At the moment tx
>>>>> commit/rollback time will be added to corresponding metrics per each
>>>>> cache evolved to the transaction. The *same time* for *each cache*.
>>>>> Absolutely meaningless.
>>>>> 
>>>>> Again, both Ignite users and Ignite developers can't make any decision
>>>>> based on this metrics. But users can create own metrics set.
>>>>> 
>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>> Ticket doesn't exist for it.
>>>>> 
>>>>> It will be implemented for most types of messages.
>>>>> 
>>>>> Metrics, application monitoring, performance analysis and measurement
>>>>> are a a little harder than it sounds. Therefore, we must approach this
>>>>> issue more carefully.
>>>>> Blindly adding new types of metrics will not only not improve the
>>>>> situation, but will also worsen the overall performance of the system
>>>>> because metric calculation always on the hot path.
>>>>> 
>>>>> So, from my point of view, commits for get/put/remove and
>>>>> commit/rollback should be reverted.
>>>>> 
>>>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
>>>>>> 
>>>>>> I think these metrics are useful.
>>>>>> 
>>>>>> I have prepared PR [1] for commit and rollback histograms. [2]
>>>>>> Nikolay, could you take a look, please?
>>>>>> 
>>>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
>>>>>> 
>>>>>> I have filed a ticket for it. [3]
>>>>>> 
>>>>>> [1] https://github.com/apache/ignite/pull/7141
>>>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>>>>>> 
>>>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
>>>>>>> 
>>>>>>> I think they are very useful.
>>>>>>> 
>>>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
>>>>>>> 
>>>>>>>> Hello, Alexei.
>>>>>>>> 
>>>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
>>>>>>>> What do you think about proposed metrics set?
>>>>>>>> 
>>>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>>>>>>>> alexey.scherbakoff@gmail.com> написал(а):
>>>>>>>>> 
>>>>>>>>> Nikolay,
>>>>>>>>> 
>>>>>>>>> What about batch operations?
>>>>>>>>> 
>>>>>>>>> For messages processing the ticket does exist and even has an
>>>>>>>>> implementation from before new metrics API times [1]
>>>>>>>>> 
>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>>>>>>>>> 
>>>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
>>>>>>>>> 
>>>>>>>>>> Hello, Igniters.
>>>>>>>>>> 
>>>>>>>>>> I want to provide the user answers to the following question: "How cache
>>>>>>>>>> API operations perform?"
>>>>>>>>>> It seems, we need to implements metrics for basic cache API operations
>>>>>>>>>> like get, put, remove for it.
>>>>>>>>>> 
>>>>>>>>>> I think we should provide the following metrics:
>>>>>>>>>> 
>>>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>>>>>>>>>> caller node side.
>>>>>>>>>> Implemented in [1], commit [2].
>>>>>>>>>> 
>>>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
>>>>>>>>>> caller node side [3].
>>>>>>>>>> 
>>>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
>>>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>>>>>> Ticket doesn't exist for it.
>>>>>>>>>> 
>>>>>>>>>> What do you think?
>>>>>>>>>> 
>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>>>>>>>>>> [2]
>>>>>>>>>> 
>>>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>>>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Alexei Scherbakov
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Alexei Scherbakov
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best wishes,
>>>>>> Amelchev Nikita
>>>> 
>> 


Re: Cache operations performance metrics

Posted by Andrey Gura <ag...@apache.org>.
> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.

It also will be visible on other metrics. So cache operations metrics
still useless because it transitive values.

>> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.

> We already implement it.

I don't talk that it isn't implemented. It is just example of things
that should be measured. All other metrics depends on internals.

>> 2. Measure business operations in user context, not cache API operations.

>Why do you think these approaches should exclude one another?

Because one of them is useless.

On Fri, Dec 20, 2019 at 1:43 PM Николай Ижиков <ni...@apache.org> wrote:
>
> Hello, Andrey.
>
> > Where the sense in this value? I explained why this metrics are relatively useless.
>
> I don’t agree with you.
> I believe they are not useless for a user.
> And I try to explain why I think so.
>
> > But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.
>
> Users shouldn’t distinguish.
> If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.
>
> > 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.
>
> We already implement it.
> What metrics are missing for internal processes?
>
> > 2. Measure business operations in user context, not cache API operations.
>
> Why do you think these approaches should exclude one another?
> Users definitely should measure whole business transaction performance.
>
> I think we should provide a way to measure part of the business transaction that relates to the Ignite.
>
>
> > 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
> >
> >> The goal of the proposed metrics is to measure whole cache operations behavior.
> >> It provides some kind of statistics(histograms) for it.
> >
> > Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
> >
> >> Yes, metrics will evaluate API call performance
> >
> > And what? Where the sense in this value? I explained why this metrics
> > are relatively useless.
> >
> >> These are metrics of client-side operation performance.
> >
> > Again. It's just a number without any sense.
> >
> >> I think a specific user has knowledge - what are his transactions.
> >
> > May be. But user can't distinguish one transaction from another, so
> > his knowledge doesn't make sense definitely.
> >
> >> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >
> > Actually not. The same caches can be involved  in a dozen of
> > transactions and there are no ways to understand what transactions are
> > slow or fast. It is useless.
> >
> >> I disagree here.
> >> If you have a better approach to measure cache operations performance - please, share your vision.
> >
> > I already wrote about better approach. Two main points:
> >
> > 1. Measure some important internals (WAL operations, checkpoint time,
> > etc) because it can talk about real problems.
> > 2. Measure business operations in user context, not cache API operations.
> >
> > So  what we have? We have useless metrics that are doubled by useless
> > histograms.
> >
> > We should reconsider approach to metrics and performance measuring. It
> > is hard and long task. There are no need to commit tons of useless
> > metrics that just decrease performance.
> >
> > Sorry for some sarcasm but I really believe in my opinion. Metrics
> > problem exists very very long time and existing metrics discussed many
> > times. No one can explain this metrics to users because it requires
> > too many additional knowledge about internals. And metric  value
> > itself depends on many aspects of internals. It leads to impossibility
> > of interpretation. And it's good time to remove it (in AI 3.0 due to a
> > backward compatibility).
> >
> > On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
> >>
> >> Hello, Andrey.
> >>
> >> The goal of the proposed metrics is to measure whole cache operations behavior.
> >> It provides some kind of statistics(histograms) for it.
> >> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
> >>
> >>>> Measured for API calls on the caller node side
> >>> Values will the same only for cases when node is remote relative to data
> >>
> >> Yes, metrics will evaluate API call performance.
> >> I think this is the most valuable information from a user's point of view.
> >>
> >> Regular user wants to know how fast his cache operation performs.
> >> And these metrics provide the answer.
> >>
> >>> For regular data node (server node) timing will depend on answers for question:
> >>
> >> I think these answers are always available.
> >> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
> >> Even so, all answers are provided through system view we brought to the Ignite :)
> >>
> >>> What is transaction commit or rollback time?
> >>
> >> These are metrics of client-side operation performance.
> >>
> >> I think a specific user has knowledge - what are his transactions.
> >> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> >> I think it’s very valuable knowledge.
> >>
> >>> It will be implemented for most types of messages.
> >>
> >> Good, let’s do it?
> >>
> >>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
> >>
> >> I disagree here.
> >> If you have a better approach to measure cache operations performance - please, share your vision.
> >>
> >>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
> >>>
> >>> From my point of view, Ignite should provide meaningful metrics for
> >>> internal components that could be useful for monitoring and analysis.
> >>> All suggested options are meaningless in a sense. Below I'll try
> >>> explain why.
> >>>
> >>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
> >>>>  Implemented in [1], commit [2].
> >>>
> >>> All cache operations in Ignite are distributed. So each value measured
> >>> for some cache operation will vary depending on where actually
> >>> operation is performed. Values will the same only for cases when node
> >>> is remote relative to data (e.g. client node).
> >>>
> >>> For regular data node (server node) timing will depend on answers for question:
> >>>
> >>> - is node primary for particular key or not? (for all operations)
> >>> - how many backups configured for the cache? (for put and remove)
> >>> - what write synchronization mode is configured for particular cache?
> >>> (for put and remove)
> >>> - is readFromBackup enabled for the cache? (for get)
> >>>
> >>> Both Ignite users and Ignite developers can't make any decision based
> >>> on this metrics.
> >>>
> >>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
> >>>
> >>> What is transaction commit or rollback time? How it calculates in
> >>> Ignite now? What actions included into transaction? What actions not
> >>> related with cache executed during transactions?
> >>>
> >>> There is no any sense in time of transaction commit or rollback
> >>> because there are no any way to understand what transaction was
> >>> performed in particular period of time. Usually a lot of transactions
> >>> and we can't to distinguish from each other.
> >>>
> >>> Moreover, transaction usually treats as business operation. So only
> >>> way to measure performance properly is measure business operation
> >>> time. That is user should create own metrics set for some business
> >>> API.
> >>>
> >>> Further. What about cross cache transactions? At the moment tx
> >>> commit/rollback time will be added to corresponding metrics per each
> >>> cache evolved to the transaction. The *same time* for *each cache*.
> >>> Absolutely meaningless.
> >>>
> >>> Again, both Ignite users and Ignite developers can't make any decision
> >>> based on this metrics. But users can create own metrics set.
> >>>
> >>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>  Ticket doesn't exist for it.
> >>>
> >>> It will be implemented for most types of messages.
> >>>
> >>> Metrics, application monitoring, performance analysis and measurement
> >>> are a a little harder than it sounds. Therefore, we must approach this
> >>> issue more carefully.
> >>> Blindly adding new types of metrics will not only not improve the
> >>> situation, but will also worsen the overall performance of the system
> >>> because metric calculation always on the hot path.
> >>>
> >>> So, from my point of view, commits for get/put/remove and
> >>> commit/rollback should be reverted.
> >>>
> >>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
> >>>>
> >>>> I think these metrics are useful.
> >>>>
> >>>> I have prepared PR [1] for commit and rollback histograms. [2]
> >>>> Nikolay, could you take a look, please?
> >>>>
> >>>> If you do not mind, I will try to add affinity-nodes cache metrics:
> >>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
> >>>>
> >>>> I have filed a ticket for it. [3]
> >>>>
> >>>> [1] https://github.com/apache/ignite/pull/7141
> >>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
> >>>>
> >>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
> >>>>>
> >>>>> I think they are very useful.
> >>>>>
> >>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
> >>>>>
> >>>>>> Hello, Alexei.
> >>>>>>
> >>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
> >>>>>> What do you think about proposed metrics set?
> >>>>>>
> >>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >>>>>> alexey.scherbakoff@gmail.com> написал(а):
> >>>>>>>
> >>>>>>> Nikolay,
> >>>>>>>
> >>>>>>> What about batch operations?
> >>>>>>>
> >>>>>>> For messages processing the ticket does exist and even has an
> >>>>>>> implementation from before new metrics API times [1]
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >>>>>>>
> >>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> >>>>>>>
> >>>>>>>> Hello, Igniters.
> >>>>>>>>
> >>>>>>>> I want to provide the user answers to the following question: "How cache
> >>>>>>>> API operations perform?"
> >>>>>>>> It seems, we need to implements metrics for basic cache API operations
> >>>>>>>> like get, put, remove for it.
> >>>>>>>>
> >>>>>>>> I think we should provide the following metrics:
> >>>>>>>>
> >>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> >>>>>>>> caller node side.
> >>>>>>>>  Implemented in [1], commit [2].
> >>>>>>>>
> >>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
> >>>>>>>> caller node side [3].
> >>>>>>>>
> >>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
> >>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>>>>>  Ticket doesn't exist for it.
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> >>>>>>>> [2]
> >>>>>>>>
> >>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>> Alexei Scherbakov
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Best regards,
> >>>>> Alexei Scherbakov
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best wishes,
> >>>> Amelchev Nikita
> >>
>

Re: Cache operations performance metrics

Posted by Николай Ижиков <ni...@apache.org>.
Hello, Andrey.

> Where the sense in this value? I explained why this metrics are relatively useless.

I don’t agree with you.
I believe they are not useless for a user.
And I try to explain why I think so.

> But user can't distinguish one transaction from another, so his knowledge doesn't make sense definitely.

Users shouldn’t distinguish.
If a cache has some percent of the relatively slow transaction this is a trigger to make a deeper investigation.

> 1. Measure some important internals (WAL operations, checkpoint time, etc) because it can talk about real problems.

We already implement it.
What metrics are missing for internal processes?

> 2. Measure business operations in user context, not cache API operations.

Why do you think these approaches should exclude one another?
Users definitely should measure whole business transaction performance.

I think we should provide a way to measure part of the business transaction that relates to the Ignite.


> 20 дек. 2019 г., в 13:02, Andrey Gura <ag...@apache.org> написал(а):
> 
>> The goal of the proposed metrics is to measure whole cache operations behavior.
>> It provides some kind of statistics(histograms) for it.
> 
> Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)
> 
>> Yes, metrics will evaluate API call performance
> 
> And what? Where the sense in this value? I explained why this metrics
> are relatively useless.
> 
>> These are metrics of client-side operation performance.
> 
> Again. It's just a number without any sense.
> 
>> I think a specific user has knowledge - what are his transactions.
> 
> May be. But user can't distinguish one transaction from another, so
> his knowledge doesn't make sense definitely.
> 
>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> 
> Actually not. The same caches can be involved  in a dozen of
> transactions and there are no ways to understand what transactions are
> slow or fast. It is useless.
> 
>> I disagree here.
>> If you have a better approach to measure cache operations performance - please, share your vision.
> 
> I already wrote about better approach. Two main points:
> 
> 1. Measure some important internals (WAL operations, checkpoint time,
> etc) because it can talk about real problems.
> 2. Measure business operations in user context, not cache API operations.
> 
> So  what we have? We have useless metrics that are doubled by useless
> histograms.
> 
> We should reconsider approach to metrics and performance measuring. It
> is hard and long task. There are no need to commit tons of useless
> metrics that just decrease performance.
> 
> Sorry for some sarcasm but I really believe in my opinion. Metrics
> problem exists very very long time and existing metrics discussed many
> times. No one can explain this metrics to users because it requires
> too many additional knowledge about internals. And metric  value
> itself depends on many aspects of internals. It leads to impossibility
> of interpretation. And it's good time to remove it (in AI 3.0 due to a
> backward compatibility).
> 
> On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
>> 
>> Hello, Andrey.
>> 
>> The goal of the proposed metrics is to measure whole cache operations behavior.
>> It provides some kind of statistics(histograms) for it.
>> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
>> 
>>>> Measured for API calls on the caller node side
>>> Values will the same only for cases when node is remote relative to data
>> 
>> Yes, metrics will evaluate API call performance.
>> I think this is the most valuable information from a user's point of view.
>> 
>> Regular user wants to know how fast his cache operation performs.
>> And these metrics provide the answer.
>> 
>>> For regular data node (server node) timing will depend on answers for question:
>> 
>> I think these answers are always available.
>> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
>> Even so, all answers are provided through system view we brought to the Ignite :)
>> 
>>> What is transaction commit or rollback time?
>> 
>> These are metrics of client-side operation performance.
>> 
>> I think a specific user has knowledge - what are his transactions.
>> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
>> I think it’s very valuable knowledge.
>> 
>>> It will be implemented for most types of messages.
>> 
>> Good, let’s do it?
>> 
>>> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
>> 
>> I disagree here.
>> If you have a better approach to measure cache operations performance - please, share your vision.
>> 
>>> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
>>> 
>>> From my point of view, Ignite should provide meaningful metrics for
>>> internal components that could be useful for monitoring and analysis.
>>> All suggested options are meaningless in a sense. Below I'll try
>>> explain why.
>>> 
>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
>>>>  Implemented in [1], commit [2].
>>> 
>>> All cache operations in Ignite are distributed. So each value measured
>>> for some cache operation will vary depending on where actually
>>> operation is performed. Values will the same only for cases when node
>>> is remote relative to data (e.g. client node).
>>> 
>>> For regular data node (server node) timing will depend on answers for question:
>>> 
>>> - is node primary for particular key or not? (for all operations)
>>> - how many backups configured for the cache? (for put and remove)
>>> - what write synchronization mode is configured for particular cache?
>>> (for put and remove)
>>> - is readFromBackup enabled for the cache? (for get)
>>> 
>>> Both Ignite users and Ignite developers can't make any decision based
>>> on this metrics.
>>> 
>>>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
>>> 
>>> What is transaction commit or rollback time? How it calculates in
>>> Ignite now? What actions included into transaction? What actions not
>>> related with cache executed during transactions?
>>> 
>>> There is no any sense in time of transaction commit or rollback
>>> because there are no any way to understand what transaction was
>>> performed in particular period of time. Usually a lot of transactions
>>> and we can't to distinguish from each other.
>>> 
>>> Moreover, transaction usually treats as business operation. So only
>>> way to measure performance properly is measure business operation
>>> time. That is user should create own metrics set for some business
>>> API.
>>> 
>>> Further. What about cross cache transactions? At the moment tx
>>> commit/rollback time will be added to corresponding metrics per each
>>> cache evolved to the transaction. The *same time* for *each cache*.
>>> Absolutely meaningless.
>>> 
>>> Again, both Ignite users and Ignite developers can't make any decision
>>> based on this metrics. But users can create own metrics set.
>>> 
>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>  Ticket doesn't exist for it.
>>> 
>>> It will be implemented for most types of messages.
>>> 
>>> Metrics, application monitoring, performance analysis and measurement
>>> are a a little harder than it sounds. Therefore, we must approach this
>>> issue more carefully.
>>> Blindly adding new types of metrics will not only not improve the
>>> situation, but will also worsen the overall performance of the system
>>> because metric calculation always on the hot path.
>>> 
>>> So, from my point of view, commits for get/put/remove and
>>> commit/rollback should be reverted.
>>> 
>>> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
>>>> 
>>>> I think these metrics are useful.
>>>> 
>>>> I have prepared PR [1] for commit and rollback histograms. [2]
>>>> Nikolay, could you take a look, please?
>>>> 
>>>> If you do not mind, I will try to add affinity-nodes cache metrics:
>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
>>>> 
>>>> I have filed a ticket for it. [3]
>>>> 
>>>> [1] https://github.com/apache/ignite/pull/7141
>>>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>>>> 
>>>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
>>>>> 
>>>>> I think they are very useful.
>>>>> 
>>>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
>>>>> 
>>>>>> Hello, Alexei.
>>>>>> 
>>>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
>>>>>> What do you think about proposed metrics set?
>>>>>> 
>>>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>>>>>> alexey.scherbakoff@gmail.com> написал(а):
>>>>>>> 
>>>>>>> Nikolay,
>>>>>>> 
>>>>>>> What about batch operations?
>>>>>>> 
>>>>>>> For messages processing the ticket does exist and even has an
>>>>>>> implementation from before new metrics API times [1]
>>>>>>> 
>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>>>>>>> 
>>>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
>>>>>>> 
>>>>>>>> Hello, Igniters.
>>>>>>>> 
>>>>>>>> I want to provide the user answers to the following question: "How cache
>>>>>>>> API operations perform?"
>>>>>>>> It seems, we need to implements metrics for basic cache API operations
>>>>>>>> like get, put, remove for it.
>>>>>>>> 
>>>>>>>> I think we should provide the following metrics:
>>>>>>>> 
>>>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>>>>>>>> caller node side.
>>>>>>>>  Implemented in [1], commit [2].
>>>>>>>> 
>>>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
>>>>>>>> caller node side [3].
>>>>>>>> 
>>>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
>>>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>>>>  Ticket doesn't exist for it.
>>>>>>>> 
>>>>>>>> What do you think?
>>>>>>>> 
>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>>>>>>>> [2]
>>>>>>>> 
>>>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>>>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Best regards,
>>>>>>> Alexei Scherbakov
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Best regards,
>>>>> Alexei Scherbakov
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Best wishes,
>>>> Amelchev Nikita
>> 


Re: Cache operations performance metrics

Posted by Andrey Gura <ag...@apache.org>.
> The goal of the proposed metrics is to measure whole cache operations behavior.
> It provides some kind of statistics(histograms) for it.

Nikolay, reformulating doesn't make metrics more meaningful. Seriously :)

> Yes, metrics will evaluate API call performance

And what? Where the sense in this value? I explained why this metrics
are relatively useless.

> These are metrics of client-side operation performance.

Again. It's just a number without any sense.

> I think a specific user has knowledge - what are his transactions.

May be. But user can't distinguish one transaction from another, so
his knowledge doesn't make sense definitely.

> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»

Actually not. The same caches can be involved  in a dozen of
transactions and there are no ways to understand what transactions are
slow or fast. It is useless.

> I disagree here.
> If you have a better approach to measure cache operations performance - please, share your vision.

I already wrote about better approach. Two main points:

1. Measure some important internals (WAL operations, checkpoint time,
etc) because it can talk about real problems.
2. Measure business operations in user context, not cache API operations.

So  what we have? We have useless metrics that are doubled by useless
histograms.

We should reconsider approach to metrics and performance measuring. It
is hard and long task. There are no need to commit tons of useless
metrics that just decrease performance.

Sorry for some sarcasm but I really believe in my opinion. Metrics
problem exists very very long time and existing metrics discussed many
times. No one can explain this metrics to users because it requires
too many additional knowledge about internals. And metric  value
itself depends on many aspects of internals. It leads to impossibility
of interpretation. And it's good time to remove it (in AI 3.0 due to a
backward compatibility).

On Thu, Dec 19, 2019 at 9:09 PM Николай Ижиков <ni...@gmail.com> wrote:
>
> Hello, Andrey.
>
> The goal of the proposed metrics is to measure whole cache operations behavior.
> It provides some kind of statistics(histograms) for it.
> For more fine-grained analysis one will be use tracing or other «go deeper» tools.
>
> > > Measured for API calls on the caller node side
> > Values will the same only for cases when node is remote relative to data
>
> Yes, metrics will evaluate API call performance.
> I think this is the most valuable information from a user's point of view.
>
> Regular user wants to know how fast his cache operation performs.
> And these metrics provide the answer.
>
> > For regular data node (server node) timing will depend on answers for question:
>
> I think these answers are always available.
> I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
> Even so, all answers are provided through system view we brought to the Ignite :)
>
> > What is transaction commit or rollback time?
>
> These are metrics of client-side operation performance.
>
> I think a specific user has knowledge - what are his transactions.
> From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?»
> I think it’s very valuable knowledge.
>
> > It will be implemented for most types of messages.
>
> Good, let’s do it?
>
> > So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.
>
> I disagree here.
> If you have a better approach to measure cache operations performance - please, share your vision.
>
> > 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
> >
> > From my point of view, Ignite should provide meaningful metrics for
> > internal components that could be useful for monitoring and analysis.
> > All suggested options are meaningless in a sense. Below I'll try
> > explain why.
> >
> >> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
> >>   Implemented in [1], commit [2].
> >
> > All cache operations in Ignite are distributed. So each value measured
> > for some cache operation will vary depending on where actually
> > operation is performed. Values will the same only for cases when node
> > is remote relative to data (e.g. client node).
> >
> > For regular data node (server node) timing will depend on answers for question:
> >
> > - is node primary for particular key or not? (for all operations)
> > - how many backups configured for the cache? (for put and remove)
> > - what write synchronization mode is configured for particular cache?
> > (for put and remove)
> > - is readFromBackup enabled for the cache? (for get)
> >
> > Both Ignite users and Ignite developers can't make any decision based
> > on this metrics.
> >
> >> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
> >
> > What is transaction commit or rollback time? How it calculates in
> > Ignite now? What actions included into transaction? What actions not
> > related with cache executed during transactions?
> >
> > There is no any sense in time of transaction commit or rollback
> > because there are no any way to understand what transaction was
> > performed in particular period of time. Usually a lot of transactions
> > and we can't to distinguish from each other.
> >
> > Moreover, transaction usually treats as business operation. So only
> > way to measure performance properly is measure business operation
> > time. That is user should create own metrics set for some business
> > API.
> >
> > Further. What about cross cache transactions? At the moment tx
> > commit/rollback time will be added to corresponding metrics per each
> > cache evolved to the transaction. The *same time* for *each cache*.
> > Absolutely meaningless.
> >
> > Again, both Ignite users and Ignite developers can't make any decision
> > based on this metrics. But users can create own metrics set.
> >
> >> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>   Ticket doesn't exist for it.
> >
> > It will be implemented for most types of messages.
> >
> > Metrics, application monitoring, performance analysis and measurement
> > are a a little harder than it sounds. Therefore, we must approach this
> > issue more carefully.
> > Blindly adding new types of metrics will not only not improve the
> > situation, but will also worsen the overall performance of the system
> > because metric calculation always on the hot path.
> >
> > So, from my point of view, commits for get/put/remove and
> > commit/rollback should be reverted.
> >
> > On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
> >>
> >> I think these metrics are useful.
> >>
> >> I have prepared PR [1] for commit and rollback histograms. [2]
> >> Nikolay, could you take a look, please?
> >>
> >> If you do not mind, I will try to add affinity-nodes cache metrics:
> >>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
> >>
> >> I have filed a ticket for it. [3]
> >>
> >> [1] https://github.com/apache/ignite/pull/7141
> >> [2] https://issues.apache.org/jira/browse/IGNITE-12450
> >> [3] https://issues.apache.org/jira/browse/IGNITE-12453
> >>
> >> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
> >>>
> >>> I think they are very useful.
> >>>
> >>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
> >>>
> >>>> Hello, Alexei.
> >>>>
> >>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
> >>>> What do you think about proposed metrics set?
> >>>>
> >>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> >>>> alexey.scherbakoff@gmail.com> написал(а):
> >>>>>
> >>>>> Nikolay,
> >>>>>
> >>>>> What about batch operations?
> >>>>>
> >>>>> For messages processing the ticket does exist and even has an
> >>>>> implementation from before new metrics API times [1]
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >>>>>
> >>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> >>>>>
> >>>>>> Hello, Igniters.
> >>>>>>
> >>>>>> I want to provide the user answers to the following question: "How cache
> >>>>>> API operations perform?"
> >>>>>> It seems, we need to implements metrics for basic cache API operations
> >>>>>> like get, put, remove for it.
> >>>>>>
> >>>>>> I think we should provide the following metrics:
> >>>>>>
> >>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> >>>>>> caller node side.
> >>>>>>   Implemented in [1], commit [2].
> >>>>>>
> >>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
> >>>>>> caller node side [3].
> >>>>>>
> >>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
> >>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>>>>>   Ticket doesn't exist for it.
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> >>>>>> [2]
> >>>>>>
> >>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Best regards,
> >>>>> Alexei Scherbakov
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>> Best regards,
> >>> Alexei Scherbakov
> >>
> >>
> >>
> >> --
> >> Best wishes,
> >> Amelchev Nikita
>

Re: Cache operations performance metrics

Posted by Николай Ижиков <ni...@gmail.com>.
Hello, Andrey.

The goal of the proposed metrics is to measure whole cache operations behavior.
It provides some kind of statistics(histograms) for it.
For more fine-grained analysis one will be use tracing or other «go deeper» tools.

> > Measured for API calls on the caller node side
> Values will the same only for cases when node is remote relative to data

Yes, metrics will evaluate API call performance.
I think this is the most valuable information from a user's point of view.

Regular user wants to know how fast his cache operation performs.
And these metrics provide the answer.

> For regular data node (server node) timing will depend on answers for question:

I think these answers are always available.
I barely can imagine a scenario when one monitor «black box» cluster and don’t know it.
Even so, all answers are provided through system view we brought to the Ignite :)

> What is transaction commit or rollback time?

These are metrics of client-side operation performance.

I think a specific user has knowledge - what are his transactions.
From these metrics it can answer on the question «If my transaction includes cacheXXX, how long it usually takes?» 
I think it’s very valuable knowledge.

> It will be implemented for most types of messages.

Good, let’s do it?

> So, from my point of view, commits for get/put/remove and commit/rollback should be reverted.

I disagree here.
If you have a better approach to measure cache operations performance - please, share your vision.

> 19 дек. 2019 г., в 16:03, Andrey Gura <ag...@apache.org> написал(а):
> 
> From my point of view, Ignite should provide meaningful metrics for
> internal components that could be useful for monitoring and analysis.
> All suggested options are meaningless in a sense. Below I'll try
> explain why.
> 
>> * `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
>>   Implemented in [1], commit [2].
> 
> All cache operations in Ignite are distributed. So each value measured
> for some cache operation will vary depending on where actually
> operation is performed. Values will the same only for cases when node
> is remote relative to data (e.g. client node).
> 
> For regular data node (server node) timing will depend on answers for question:
> 
> - is node primary for particular key or not? (for all operations)
> - how many backups configured for the cache? (for put and remove)
> - what write synchronization mode is configured for particular cache?
> (for put and remove)
> - is readFromBackup enabled for the cache? (for get)
> 
> Both Ignite users and Ignite developers can't make any decision based
> on this metrics.
> 
>> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].
> 
> What is transaction commit or rollback time? How it calculates in
> Ignite now? What actions included into transaction? What actions not
> related with cache executed during transactions?
> 
> There is no any sense in time of transaction commit or rollback
> because there are no any way to understand what transaction was
> performed in particular period of time. Usually a lot of transactions
> and we can't to distinguish from each other.
> 
> Moreover, transaction usually treats as business operation. So only
> way to measure performance properly is measure business operation
> time. That is user should create own metrics set for some business
> API.
> 
> Further. What about cross cache transactions? At the moment tx
> commit/rollback time will be added to corresponding metrics per each
> cache evolved to the transaction. The *same time* for *each cache*.
> Absolutely meaningless.
> 
> Again, both Ignite users and Ignite developers can't make any decision
> based on this metrics. But users can create own metrics set.
> 
>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
>>   Ticket doesn't exist for it.
> 
> It will be implemented for most types of messages.
> 
> Metrics, application monitoring, performance analysis and measurement
> are a a little harder than it sounds. Therefore, we must approach this
> issue more carefully.
> Blindly adding new types of metrics will not only not improve the
> situation, but will also worsen the overall performance of the system
> because metric calculation always on the hot path.
> 
> So, from my point of view, commits for get/put/remove and
> commit/rollback should be reverted.
> 
> On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
>> 
>> I think these metrics are useful.
>> 
>> I have prepared PR [1] for commit and rollback histograms. [2]
>> Nikolay, could you take a look, please?
>> 
>> If you do not mind, I will try to add affinity-nodes cache metrics:
>>>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
>> 
>> I have filed a ticket for it. [3]
>> 
>> [1] https://github.com/apache/ignite/pull/7141
>> [2] https://issues.apache.org/jira/browse/IGNITE-12450
>> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>> 
>> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
>>> 
>>> I think they are very useful.
>>> 
>>> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
>>> 
>>>> Hello, Alexei.
>>>> 
>>>> Thanks for the link on the ticket, lableled it with the IEP-35 label.
>>>> What do you think about proposed metrics set?
>>>> 
>>>>> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
>>>> alexey.scherbakoff@gmail.com> написал(а):
>>>>> 
>>>>> Nikolay,
>>>>> 
>>>>> What about batch operations?
>>>>> 
>>>>> For messages processing the ticket does exist and even has an
>>>>> implementation from before new metrics API times [1]
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-10418
>>>>> 
>>>>> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
>>>>> 
>>>>>> Hello, Igniters.
>>>>>> 
>>>>>> I want to provide the user answers to the following question: "How cache
>>>>>> API operations perform?"
>>>>>> It seems, we need to implements metrics for basic cache API operations
>>>>>> like get, put, remove for it.
>>>>>> 
>>>>>> I think we should provide the following metrics:
>>>>>> 
>>>>>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>>>>>> caller node side.
>>>>>>   Implemented in [1], commit [2].
>>>>>> 
>>>>>> * `commit`, `rollback` time histograms. Measured for API calls on the
>>>>>> caller node side [3].
>>>>>> 
>>>>>> * histograms that measure the time of processing `get`, `put`, `remove`,
>>>>>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>>>>>   Ticket doesn't exist for it.
>>>>>> 
>>>>>> What do you think?
>>>>>> 
>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>>>>>> [2]
>>>>>> 
>>>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>>>>>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Best regards,
>>>>> Alexei Scherbakov
>>>> 
>>>> 
>>> 
>>> --
>>> 
>>> Best regards,
>>> Alexei Scherbakov
>> 
>> 
>> 
>> --
>> Best wishes,
>> Amelchev Nikita


Re: Cache operations performance metrics

Posted by Andrey Gura <ag...@apache.org>.
From my point of view, Ignite should provide meaningful metrics for
internal components that could be useful for monitoring and analysis.
All suggested options are meaningless in a sense. Below I'll try
explain why.

>* `get`, `put`, `remove` time histograms. Measured for API calls on the caller node side.
>    Implemented in [1], commit [2].

All cache operations in Ignite are distributed. So each value measured
for some cache operation will vary depending on where actually
operation is performed. Values will the same only for cases when node
is remote relative to data (e.g. client node).

For regular data node (server node) timing will depend on answers for question:

- is node primary for particular key or not? (for all operations)
- how many backups configured for the cache? (for put and remove)
- what write synchronization mode is configured for particular cache?
(for put and remove)
- is readFromBackup enabled for the cache? (for get)

Both Ignite users and Ignite developers can't make any decision based
on this metrics.

> * `commit`, `rollback` time histograms. Measured for API calls on the caller node side [3].

What is transaction commit or rollback time? How it calculates in
Ignite now? What actions included into transaction? What actions not
related with cache executed during transactions?

There is no any sense in time of transaction commit or rollback
because there are no any way to understand what transaction was
performed in particular period of time. Usually a lot of transactions
and we can't to distinguish from each other.

Moreover, transaction usually treats as business operation. So only
way to measure performance properly is measure business operation
time. That is user should create own metrics set for some business
API.

Further. What about cross cache transactions? At the moment tx
commit/rollback time will be added to corresponding metrics per each
cache evolved to the transaction. The *same time* for *each cache*.
Absolutely meaningless.

Again, both Ignite users and Ignite developers can't make any decision
based on this metrics. But users can create own metrics set.

>* histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups).
>    Ticket doesn't exist for it.

It will be implemented for most types of messages.

Metrics, application monitoring, performance analysis and measurement
are a a little harder than it sounds. Therefore, we must approach this
issue more carefully.
Blindly adding new types of metrics will not only not improve the
situation, but will also worsen the overall performance of the system
because metric calculation always on the hot path.

So, from my point of view, commits for get/put/remove and
commit/rollback should be reverted.

On Mon, Dec 16, 2019 at 5:39 PM Nikita Amelchev <ns...@gmail.com> wrote:
>
> I think these metrics are useful.
>
> I have prepared PR [1] for commit and rollback histograms. [2]
> Nikolay, could you take a look, please?
>
> If you do not mind, I will try to add affinity-nodes cache metrics:
> >> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.
>
> I have filed a ticket for it. [3]
>
> [1] https://github.com/apache/ignite/pull/7141
> [2] https://issues.apache.org/jira/browse/IGNITE-12450
> [3] https://issues.apache.org/jira/browse/IGNITE-12453
>
> пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
> >
> > I think they are very useful.
> >
> > пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
> >
> > > Hello, Alexei.
> > >
> > > Thanks for the link on the ticket, lableled it with the IEP-35 label.
> > > What do you think about proposed metrics set?
> > >
> > > > 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> > > alexey.scherbakoff@gmail.com> написал(а):
> > > >
> > > > Nikolay,
> > > >
> > > > What about batch operations?
> > > >
> > > > For messages processing the ticket does exist and even has an
> > > > implementation from before new metrics API times [1]
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-10418
> > > >
> > > > пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> > > >
> > > >> Hello, Igniters.
> > > >>
> > > >> I want to provide the user answers to the following question: "How cache
> > > >> API operations perform?"
> > > >> It seems, we need to implements metrics for basic cache API operations
> > > >> like get, put, remove for it.
> > > >>
> > > >> I think we should provide the following metrics:
> > > >>
> > > >> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> > > >> caller node side.
> > > >>    Implemented in [1], commit [2].
> > > >>
> > > >> * `commit`, `rollback` time histograms. Measured for API calls on the
> > > >> caller node side [3].
> > > >>
> > > >> * histograms that measure the time of processing `get`, `put`, `remove`,
> > > >> `commit`, `rollback` messages on affinity nodes(primary and backups).
> > > >>    Ticket doesn't exist for it.
> > > >>
> > > >> What do you think?
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> > > >> [2]
> > > >>
> > > https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> > > >> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > Best regards,
> > > > Alexei Scherbakov
> > >
> > >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>
>
>
> --
> Best wishes,
> Amelchev Nikita

Re: Cache operations performance metrics

Posted by Nikita Amelchev <ns...@gmail.com>.
I think these metrics are useful.

I have prepared PR [1] for commit and rollback histograms. [2]
Nikolay, could you take a look, please?

If you do not mind, I will try to add affinity-nodes cache metrics:
>> * histograms that measure the time of processing `get`, `put`, `remove`, `commit`, `rollback` messages on affinity nodes(primary and backups). Ticket doesn't exist for it.

I have filed a ticket for it. [3]

[1] https://github.com/apache/ignite/pull/7141
[2] https://issues.apache.org/jira/browse/IGNITE-12450
[3] https://issues.apache.org/jira/browse/IGNITE-12453

пн, 16 дек. 2019 г. в 11:07, Alexei Scherbakov <al...@gmail.com>:
>
> I think they are very useful.
>
> пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:
>
> > Hello, Alexei.
> >
> > Thanks for the link on the ticket, lableled it with the IEP-35 label.
> > What do you think about proposed metrics set?
> >
> > > 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> > alexey.scherbakoff@gmail.com> написал(а):
> > >
> > > Nikolay,
> > >
> > > What about batch operations?
> > >
> > > For messages processing the ticket does exist and even has an
> > > implementation from before new metrics API times [1]
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-10418
> > >
> > > пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> > >
> > >> Hello, Igniters.
> > >>
> > >> I want to provide the user answers to the following question: "How cache
> > >> API operations perform?"
> > >> It seems, we need to implements metrics for basic cache API operations
> > >> like get, put, remove for it.
> > >>
> > >> I think we should provide the following metrics:
> > >>
> > >> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> > >> caller node side.
> > >>    Implemented in [1], commit [2].
> > >>
> > >> * `commit`, `rollback` time histograms. Measured for API calls on the
> > >> caller node side [3].
> > >>
> > >> * histograms that measure the time of processing `get`, `put`, `remove`,
> > >> `commit`, `rollback` messages on affinity nodes(primary and backups).
> > >>    Ticket doesn't exist for it.
> > >>
> > >> What do you think?
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> > >> [2]
> > >>
> > https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> > >> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> > >>
> > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> >
> >
>
> --
>
> Best regards,
> Alexei Scherbakov



-- 
Best wishes,
Amelchev Nikita

Re: Cache operations performance metrics

Posted by Alexei Scherbakov <al...@gmail.com>.
I think they are very useful.

пн, 16 дек. 2019 г. в 10:51, Николай Ижиков <ni...@apache.org>:

> Hello, Alexei.
>
> Thanks for the link on the ticket, lableled it with the IEP-35 label.
> What do you think about proposed metrics set?
>
> > 16 дек. 2019 г., в 10:29, Alexei Scherbakov <
> alexey.scherbakoff@gmail.com> написал(а):
> >
> > Nikolay,
> >
> > What about batch operations?
> >
> > For messages processing the ticket does exist and even has an
> > implementation from before new metrics API times [1]
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-10418
> >
> > пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> >
> >> Hello, Igniters.
> >>
> >> I want to provide the user answers to the following question: "How cache
> >> API operations perform?"
> >> It seems, we need to implements metrics for basic cache API operations
> >> like get, put, remove for it.
> >>
> >> I think we should provide the following metrics:
> >>
> >> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> >> caller node side.
> >>    Implemented in [1], commit [2].
> >>
> >> * `commit`, `rollback` time histograms. Measured for API calls on the
> >> caller node side [3].
> >>
> >> * histograms that measure the time of processing `get`, `put`, `remove`,
> >> `commit`, `rollback` messages on affinity nodes(primary and backups).
> >>    Ticket doesn't exist for it.
> >>
> >> What do you think?
> >>
> >> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> >> [2]
> >>
> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> >> [3] https://issues.apache.org/jira/browse/IGNITE-12450
> >>
> >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>
>

-- 

Best regards,
Alexei Scherbakov

Re: Cache operations performance metrics

Posted by Николай Ижиков <ni...@apache.org>.
Hello, Alexei.

Thanks for the link on the ticket, lableled it with the IEP-35 label.
What do you think about proposed metrics set?

> 16 дек. 2019 г., в 10:29, Alexei Scherbakov <al...@gmail.com> написал(а):
> 
> Nikolay,
> 
> What about batch operations?
> 
> For messages processing the ticket does exist and even has an
> implementation from before new metrics API times [1]
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-10418
> 
> пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:
> 
>> Hello, Igniters.
>> 
>> I want to provide the user answers to the following question: "How cache
>> API operations perform?"
>> It seems, we need to implements metrics for basic cache API operations
>> like get, put, remove for it.
>> 
>> I think we should provide the following metrics:
>> 
>> * `get`, `put`, `remove` time histograms. Measured for API calls on the
>> caller node side.
>>    Implemented in [1], commit [2].
>> 
>> * `commit`, `rollback` time histograms. Measured for API calls on the
>> caller node side [3].
>> 
>> * histograms that measure the time of processing `get`, `put`, `remove`,
>> `commit`, `rollback` messages on affinity nodes(primary and backups).
>>    Ticket doesn't exist for it.
>> 
>> What do you think?
>> 
>> [1] https://issues.apache.org/jira/browse/IGNITE-12219
>> [2]
>> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
>> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>> 
> 
> 
> -- 
> 
> Best regards,
> Alexei Scherbakov


Re: Cache operations performance metrics

Posted by Alexei Scherbakov <al...@gmail.com>.
Nikolay,

What about batch operations?

For messages processing the ticket does exist and even has an
implementation from before new metrics API times [1]

[1] https://issues.apache.org/jira/browse/IGNITE-10418

пн, 16 дек. 2019 г. в 10:12, Николай Ижиков <ni...@apache.org>:

> Hello, Igniters.
>
> I want to provide the user answers to the following question: "How cache
> API operations perform?"
> It seems, we need to implements metrics for basic cache API operations
> like get, put, remove for it.
>
> I think we should provide the following metrics:
>
> * `get`, `put`, `remove` time histograms. Measured for API calls on the
> caller node side.
>     Implemented in [1], commit [2].
>
> * `commit`, `rollback` time histograms. Measured for API calls on the
> caller node side [3].
>
> * histograms that measure the time of processing `get`, `put`, `remove`,
> `commit`, `rollback` messages on affinity nodes(primary and backups).
>     Ticket doesn't exist for it.
>
> What do you think?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12219
> [2]
> https://github.com/apache/ignite/commit/e66bbef97b2cef73a533ce8a506ec479852cb364
> [3] https://issues.apache.org/jira/browse/IGNITE-12450
>


-- 

Best regards,
Alexei Scherbakov