You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Michael Burman <mi...@redhat.com> on 2018/02/22 14:10:33 UTC

Expensive metrics?

Hi,

I wanted to get some input from the mailing list before making a JIRA 
and potential fixes. I'll touch the performance more on latter part, but 
there's one important question regarding the write latency metric 
recording place. Currently we measure the writeLatency (and metric write 
sampler..) in ColumnFamilyStore.apply() and this is also the metric we 
then replicate to Keyspace metrics etc.

This is an odd place for writeLatency. Not to mention it is in a 
hot-path of Memtable-modifications, but it also does not measure the 
real write latency, since it completely ignores the CommitLog latency in 
that same process. Is the intention really to measure 
Memtable-modification latency only or the actual write latencies?

Then the real issue.. this single metric is a cause of huge overhead in 
Memtable processing. There are several metrics / events in the CFS apply 
method, including metric sampler, storageHook reportWrite, 
colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free 
at all when it comes to the processing. I made a small JMH benchmark 
here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade 
that I'll be referring to.

The most offending of all these metrics is the writeLatency metric. What 
it does is update the latency in codahale's timer, doing a histogram 
update and then going through all the parent metrics also which update 
the keyspace writeLatency and globalWriteLatency. When measuring the 
performance of Memtable.put with parameter of 1 partition (to reduce the 
ConcurrentSkipListMap search speed impact - that's separate issue and 
takes a little bit longer to solve although I've started to prototype 
something..) on my machine I see 1.3M/s performance with the metric and 
when it is disabled the performance climbs to 4M/s. So the overhead for 
this single metric is ~2/3 of total performance. That's insane. My perf 
stats indicate that the CPU is starved as it can't get enough data in.

Removing the replication from TableMetrics to the Keyspace & global 
latencies in the write time (and doing this when metrics are requested 
instead) improves the performance to 2.1M/s on my machine. It's an 
improvement, but it's still huge amount. Even when we pressure the 
ConcurrentSkipListMap with 100 000 partitions in one active Memtable, 
the performance drops by about ~40% due to this metric, so it's never free.

i did not find any discussion replacing the metric processing with 
something faster, so has this been considered before? At least for these 
performance sensitive ones. The other issue is obviously the use of 
System.nanotime() which by itself is very slow (two System.nanotime() 
calls eat another ~1M/s from the performance)

My personal quick fix would be to move writeLatency to Keyspace.apply, 
change write time aggregates to read time processing (metrics are read 
less often than we write data) and maybe even reduce the nanotime -> 
currentTimeMillis (even given it's relative lack of precision). That is 
- if these metrics make any sense at all at CFS level? Maybe these 
should be measured from the network processing time (including all the 
deserializations and such) ? Especially if at some point the smarter 
threading / eventlooping changes go forward (in which case they might 
sleep at some "queue" for a while).

   - Micke


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Expensive metrics?

Posted by Chris Lohfink <cl...@apple.com>.
> our histograms when built with a Meter use a ExponentiallyDecayingReservoir but our histograms built directly use DecayingEstimatedHistogramReservoir algorithm

Meters dont use a decaying reservoir, they use EMWA <https://github.com/dropwizard/metrics/blob/4.1-development/metrics-core/src/main/java/com/codahale/metrics/EWMA.java>. Did you mean Timers? They use the DecayingEH <https://github.com/apache/cassandra/blob/9a47974d6378b62dd2cdb2d3e509374b002caa2c/src/java/org/apache/cassandra/metrics/CassandraMetricsRegistry.java#L94>'s as well, nothing should be using the decaying reservoir atm afaik beyond the dynamic snitch.

values and recentValues gives access to the EH beneath (lower memory, slightly faster than HDR but with less features and possibly worse error bounds) the Decaying EH. The decaying EH was added in CASSANDRA-11752 <https://issues.apache.org/jira/browse/CASSANDRA-11752> because after moving from reservoir (randomly throws out data means it could lose everything in 99th percentile for example - in CASSANDRA-5667) to histograms, it did not provide any recency in the tools output which was difficult for adhoc analysis. The inverse is getting an "approximate last 1 min" is lossy and inaccurate for tooling/graphing/SLAs so both are supported.

A little bit of history in the metrics I talked about in https://www.youtube.com/watch?v=vcniEFmFY0E <https://www.youtube.com/watch?v=vcniEFmFY0E> (includes histogram storage issues).

I think metrics drastically changing every other year has made things increasingly complex while trying to maintain backwards support and not wanting to break everyones tooling at same time. If we could improve the performance of the metrics without changing everything (again) I think that would be ideal for us operations folk who consume the metrics. Could an attempt at tuning existing be made first before another rewrite?

Chris

> On Feb 28, 2018, at 7:47 AM, Michael Burman <mi...@redhat.com> wrote:
> 
> Hi,
> 
> I wrote CASSANDRA-14281 for the initial idea and where I ended up with my current prototype. This maintains the current layout of the JMX metrics so it shouldn't be visible to users. Should, because I couldn't really find any definition of our metrics. For example, our histograms when built with a Meter use a ExponentiallyDecayingReservoir but our histograms built directly use DecayingEstimatedHistogramReservoir algorithm. So these two histograms behave differently, but the JMX requester has no idea which way they're built.
> 
> Also, recentValues() for example does not use either behavior, but actually follows two different behaviors again (!). CASSANDRA-13642 added the recentValues() which uses values() underneath. The ticket itself did not really describe what's the behavior that should be exposed (luckily it was in 4.0, so it technically is not in any released version yet).
> 
> I doubt anyone really knows how our metrics are supposed to work? What if I change how they decay? Or should we use a single decay strategy for all our histograms (and related) ?
> 
> I guess my ticket became suddenly slightly more complex ;)
> 
>   - Micke
> 
> On 02/28/2018 12:25 AM, Nate McCall wrote:
>> Hi Micke,
>> There is some good research in here - have you had a chance to create
>> some issues in Jira from this?
>> 
>> On Fri, Feb 23, 2018 at 6:28 AM, Michael Burman <mi...@redhat.com> wrote:
>>> Hi,
>>> 
>>> I was referring to this article by Shipilev (there are few small issues
>>> forgotten in that url you pasted):
>>> 
>>> https://shipilev.net/blog/2014/nanotrusting-nanotime/
>>> 
>>> And his lovely recommendation on it: "System.nanoTime is as bad as
>>> String.intern now: you can use it, but use it wisely. ". And Cassandra uses
>>> it quite a lot in the write path at least. There isn't necessarily a better
>>> option in Java for it, but for that reason we shouldn't push them everywhere
>>> in the code "for fun".
>>> 
>>>   - Micke
>>> 
>>> 
>>> 
>>> On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:
>>>> re: nanoTime vs currentTimeMillis there is a good blog post here about the
>>>> timing of both and how your choice of Linux clock source can drastically
>>>> effect the speed of the calls, and also showing that in general on linux
>>>> there is no perf improvement for one over the other.
>>>> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
>>>> 
>>>>> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com>
>>>>> wrote:
>>>>> 
>>>>> Hi Micke,
>>>>> 
>>>>> This is really cool, thanks for taking the time to investigate this. I
>>>>> believe the metrics around memtable insert time come in handy in identifying
>>>>> high partition contention in the memtable. I know I've been involved in a
>>>>> situation over the past year where we got actionable info from this metric.
>>>>> Reducing resolution to milliseconds is probably a no go since most things in
>>>>> this path should complete in less than a millisecond.
>>>>> 
>>>>> Revisiting the use of the codahale metrics in the hot path like this
>>>>> definitely seems like a good idea though. I don't think it's been something
>>>>> we've talked about a lot, and it definitely looks like we could benefit from
>>>>> using something more specialized here. I think it's worth doing, especially
>>>>> since there won't be any major changes to how we do threading in 4.0. It's
>>>>> probably also worth opening a JIRA and investigating the calls to nano time.
>>>>> We at least need microsecond resolution here, and there could be something
>>>>> we haven't thought of? It's worth a look at least.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Blake
>>>>> 
>>>>> On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
>>>>> 
>>>>>     Hi,
>>>>> 
>>>>>     I wanted to get some input from the mailing list before making a JIRA
>>>>>     and potential fixes. I'll touch the performance more on latter part,
>>>>> but
>>>>>     there's one important question regarding the write latency metric
>>>>>     recording place. Currently we measure the writeLatency (and metric
>>>>> write
>>>>>     sampler..) in ColumnFamilyStore.apply() and this is also the metric
>>>>> we
>>>>>     then replicate to Keyspace metrics etc.
>>>>> 
>>>>>     This is an odd place for writeLatency. Not to mention it is in a
>>>>>     hot-path of Memtable-modifications, but it also does not measure the
>>>>>     real write latency, since it completely ignores the CommitLog latency
>>>>> in
>>>>>     that same process. Is the intention really to measure
>>>>>     Memtable-modification latency only or the actual write latencies?
>>>>> 
>>>>>     Then the real issue.. this single metric is a cause of huge overhead
>>>>> in
>>>>>     Memtable processing. There are several metrics / events in the CFS
>>>>> apply
>>>>>     method, including metric sampler, storageHook reportWrite,
>>>>>     colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
>>>>> free
>>>>>     at all when it comes to the processing. I made a small JMH benchmark
>>>>>     here:
>>>>> https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
>>>>>     that I'll be referring to.
>>>>> 
>>>>>     The most offending of all these metrics is the writeLatency metric.
>>>>> What
>>>>>     it does is update the latency in codahale's timer, doing a histogram
>>>>>     update and then going through all the parent metrics also which
>>>>> update
>>>>>     the keyspace writeLatency and globalWriteLatency. When measuring the
>>>>>     performance of Memtable.put with parameter of 1 partition (to reduce
>>>>> the
>>>>>     ConcurrentSkipListMap search speed impact - that's separate issue and
>>>>>     takes a little bit longer to solve although I've started to prototype
>>>>>     something..) on my machine I see 1.3M/s performance with the metric
>>>>> and
>>>>>     when it is disabled the performance climbs to 4M/s. So the overhead
>>>>> for
>>>>>     this single metric is ~2/3 of total performance. That's insane. My
>>>>> perf
>>>>>     stats indicate that the CPU is starved as it can't get enough data
>>>>> in.
>>>>> 
>>>>>     Removing the replication from TableMetrics to the Keyspace & global
>>>>>     latencies in the write time (and doing this when metrics are
>>>>> requested
>>>>>     instead) improves the performance to 2.1M/s on my machine. It's an
>>>>>     improvement, but it's still huge amount. Even when we pressure the
>>>>>     ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
>>>>>     the performance drops by about ~40% due to this metric, so it's never
>>>>> free.
>>>>> 
>>>>>     i did not find any discussion replacing the metric processing with
>>>>>     something faster, so has this been considered before? At least for
>>>>> these
>>>>>     performance sensitive ones. The other issue is obviously the use of
>>>>>     System.nanotime() which by itself is very slow (two System.nanotime()
>>>>>     calls eat another ~1M/s from the performance)
>>>>> 
>>>>>     My personal quick fix would be to move writeLatency to
>>>>> Keyspace.apply,
>>>>>     change write time aggregates to read time processing (metrics are
>>>>> read
>>>>>     less often than we write data) and maybe even reduce the nanotime ->
>>>>>     currentTimeMillis (even given it's relative lack of precision). That
>>>>> is
>>>>>     - if these metrics make any sense at all at CFS level? Maybe these
>>>>>     should be measured from the network processing time (including all
>>>>> the
>>>>>     deserializations and such) ? Especially if at some point the smarter
>>>>>     threading / eventlooping changes go forward (in which case they might
>>>>>     sleep at some "queue" for a while).
>>>>> 
>>>>>        - Micke
>>>>> 
>>>>> 
>>>>>     ---------------------------------------------------------------------
>>>>>     To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>>     For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


Re: Expensive metrics?

Posted by Michael Burman <mi...@redhat.com>.
Hi,

I wrote CASSANDRA-14281 for the initial idea and where I ended up with 
my current prototype. This maintains the current layout of the JMX 
metrics so it shouldn't be visible to users. Should, because I couldn't 
really find any definition of our metrics. For example, our histograms 
when built with a Meter use a ExponentiallyDecayingReservoir but our 
histograms built directly use DecayingEstimatedHistogramReservoir 
algorithm. So these two histograms behave differently, but the JMX 
requester has no idea which way they're built.

Also, recentValues() for example does not use either behavior, but 
actually follows two different behaviors again (!). CASSANDRA-13642 
added the recentValues() which uses values() underneath. The ticket 
itself did not really describe what's the behavior that should be 
exposed (luckily it was in 4.0, so it technically is not in any released 
version yet).

I doubt anyone really knows how our metrics are supposed to work? What 
if I change how they decay? Or should we use a single decay strategy for 
all our histograms (and related) ?

I guess my ticket became suddenly slightly more complex ;)

   - Micke

On 02/28/2018 12:25 AM, Nate McCall wrote:
> Hi Micke,
> There is some good research in here - have you had a chance to create
> some issues in Jira from this?
>
> On Fri, Feb 23, 2018 at 6:28 AM, Michael Burman <mi...@redhat.com> wrote:
>> Hi,
>>
>> I was referring to this article by Shipilev (there are few small issues
>> forgotten in that url you pasted):
>>
>> https://shipilev.net/blog/2014/nanotrusting-nanotime/
>>
>> And his lovely recommendation on it: "System.nanoTime is as bad as
>> String.intern now: you can use it, but use it wisely. ". And Cassandra uses
>> it quite a lot in the write path at least. There isn't necessarily a better
>> option in Java for it, but for that reason we shouldn't push them everywhere
>> in the code "for fun".
>>
>>    - Micke
>>
>>
>>
>> On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:
>>> re: nanoTime vs currentTimeMillis there is a good blog post here about the
>>> timing of both and how your choice of Linux clock source can drastically
>>> effect the speed of the calls, and also showing that in general on linux
>>> there is no perf improvement for one over the other.
>>> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
>>>
>>>> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com>
>>>> wrote:
>>>>
>>>> Hi Micke,
>>>>
>>>> This is really cool, thanks for taking the time to investigate this. I
>>>> believe the metrics around memtable insert time come in handy in identifying
>>>> high partition contention in the memtable. I know I've been involved in a
>>>> situation over the past year where we got actionable info from this metric.
>>>> Reducing resolution to milliseconds is probably a no go since most things in
>>>> this path should complete in less than a millisecond.
>>>>
>>>> Revisiting the use of the codahale metrics in the hot path like this
>>>> definitely seems like a good idea though. I don't think it's been something
>>>> we've talked about a lot, and it definitely looks like we could benefit from
>>>> using something more specialized here. I think it's worth doing, especially
>>>> since there won't be any major changes to how we do threading in 4.0. It's
>>>> probably also worth opening a JIRA and investigating the calls to nano time.
>>>> We at least need microsecond resolution here, and there could be something
>>>> we haven't thought of? It's worth a look at least.
>>>>
>>>> Thanks,
>>>>
>>>> Blake
>>>>
>>>> On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
>>>>
>>>>      Hi,
>>>>
>>>>      I wanted to get some input from the mailing list before making a JIRA
>>>>      and potential fixes. I'll touch the performance more on latter part,
>>>> but
>>>>      there's one important question regarding the write latency metric
>>>>      recording place. Currently we measure the writeLatency (and metric
>>>> write
>>>>      sampler..) in ColumnFamilyStore.apply() and this is also the metric
>>>> we
>>>>      then replicate to Keyspace metrics etc.
>>>>
>>>>      This is an odd place for writeLatency. Not to mention it is in a
>>>>      hot-path of Memtable-modifications, but it also does not measure the
>>>>      real write latency, since it completely ignores the CommitLog latency
>>>> in
>>>>      that same process. Is the intention really to measure
>>>>      Memtable-modification latency only or the actual write latencies?
>>>>
>>>>      Then the real issue.. this single metric is a cause of huge overhead
>>>> in
>>>>      Memtable processing. There are several metrics / events in the CFS
>>>> apply
>>>>      method, including metric sampler, storageHook reportWrite,
>>>>      colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
>>>> free
>>>>      at all when it comes to the processing. I made a small JMH benchmark
>>>>      here:
>>>> https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
>>>>      that I'll be referring to.
>>>>
>>>>      The most offending of all these metrics is the writeLatency metric.
>>>> What
>>>>      it does is update the latency in codahale's timer, doing a histogram
>>>>      update and then going through all the parent metrics also which
>>>> update
>>>>      the keyspace writeLatency and globalWriteLatency. When measuring the
>>>>      performance of Memtable.put with parameter of 1 partition (to reduce
>>>> the
>>>>      ConcurrentSkipListMap search speed impact - that's separate issue and
>>>>      takes a little bit longer to solve although I've started to prototype
>>>>      something..) on my machine I see 1.3M/s performance with the metric
>>>> and
>>>>      when it is disabled the performance climbs to 4M/s. So the overhead
>>>> for
>>>>      this single metric is ~2/3 of total performance. That's insane. My
>>>> perf
>>>>      stats indicate that the CPU is starved as it can't get enough data
>>>> in.
>>>>
>>>>      Removing the replication from TableMetrics to the Keyspace & global
>>>>      latencies in the write time (and doing this when metrics are
>>>> requested
>>>>      instead) improves the performance to 2.1M/s on my machine. It's an
>>>>      improvement, but it's still huge amount. Even when we pressure the
>>>>      ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
>>>>      the performance drops by about ~40% due to this metric, so it's never
>>>> free.
>>>>
>>>>      i did not find any discussion replacing the metric processing with
>>>>      something faster, so has this been considered before? At least for
>>>> these
>>>>      performance sensitive ones. The other issue is obviously the use of
>>>>      System.nanotime() which by itself is very slow (two System.nanotime()
>>>>      calls eat another ~1M/s from the performance)
>>>>
>>>>      My personal quick fix would be to move writeLatency to
>>>> Keyspace.apply,
>>>>      change write time aggregates to read time processing (metrics are
>>>> read
>>>>      less often than we write data) and maybe even reduce the nanotime ->
>>>>      currentTimeMillis (even given it's relative lack of precision). That
>>>> is
>>>>      - if these metrics make any sense at all at CFS level? Maybe these
>>>>      should be measured from the network processing time (including all
>>>> the
>>>>      deserializations and such) ? Especially if at some point the smarter
>>>>      threading / eventlooping changes go forward (in which case they might
>>>>      sleep at some "queue" for a while).
>>>>
>>>>         - Micke
>>>>
>>>>
>>>>      ---------------------------------------------------------------------
>>>>      To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>>      For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Expensive metrics?

Posted by Nate McCall <zz...@gmail.com>.
Hi Micke,
There is some good research in here - have you had a chance to create
some issues in Jira from this?

On Fri, Feb 23, 2018 at 6:28 AM, Michael Burman <mi...@redhat.com> wrote:
> Hi,
>
> I was referring to this article by Shipilev (there are few small issues
> forgotten in that url you pasted):
>
> https://shipilev.net/blog/2014/nanotrusting-nanotime/
>
> And his lovely recommendation on it: "System.nanoTime is as bad as
> String.intern now: you can use it, but use it wisely. ". And Cassandra uses
> it quite a lot in the write path at least. There isn't necessarily a better
> option in Java for it, but for that reason we shouldn't push them everywhere
> in the code "for fun".
>
>   - Micke
>
>
>
> On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:
>>
>> re: nanoTime vs currentTimeMillis there is a good blog post here about the
>> timing of both and how your choice of Linux clock source can drastically
>> effect the speed of the calls, and also showing that in general on linux
>> there is no perf improvement for one over the other.
>> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
>>
>>> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com>
>>> wrote:
>>>
>>> Hi Micke,
>>>
>>> This is really cool, thanks for taking the time to investigate this. I
>>> believe the metrics around memtable insert time come in handy in identifying
>>> high partition contention in the memtable. I know I've been involved in a
>>> situation over the past year where we got actionable info from this metric.
>>> Reducing resolution to milliseconds is probably a no go since most things in
>>> this path should complete in less than a millisecond.
>>>
>>> Revisiting the use of the codahale metrics in the hot path like this
>>> definitely seems like a good idea though. I don't think it's been something
>>> we've talked about a lot, and it definitely looks like we could benefit from
>>> using something more specialized here. I think it's worth doing, especially
>>> since there won't be any major changes to how we do threading in 4.0. It's
>>> probably also worth opening a JIRA and investigating the calls to nano time.
>>> We at least need microsecond resolution here, and there could be something
>>> we haven't thought of? It's worth a look at least.
>>>
>>> Thanks,
>>>
>>> Blake
>>>
>>> On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
>>>
>>>     Hi,
>>>
>>>     I wanted to get some input from the mailing list before making a JIRA
>>>     and potential fixes. I'll touch the performance more on latter part,
>>> but
>>>     there's one important question regarding the write latency metric
>>>     recording place. Currently we measure the writeLatency (and metric
>>> write
>>>     sampler..) in ColumnFamilyStore.apply() and this is also the metric
>>> we
>>>     then replicate to Keyspace metrics etc.
>>>
>>>     This is an odd place for writeLatency. Not to mention it is in a
>>>     hot-path of Memtable-modifications, but it also does not measure the
>>>     real write latency, since it completely ignores the CommitLog latency
>>> in
>>>     that same process. Is the intention really to measure
>>>     Memtable-modification latency only or the actual write latencies?
>>>
>>>     Then the real issue.. this single metric is a cause of huge overhead
>>> in
>>>     Memtable processing. There are several metrics / events in the CFS
>>> apply
>>>     method, including metric sampler, storageHook reportWrite,
>>>     colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
>>> free
>>>     at all when it comes to the processing. I made a small JMH benchmark
>>>     here:
>>> https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
>>>     that I'll be referring to.
>>>
>>>     The most offending of all these metrics is the writeLatency metric.
>>> What
>>>     it does is update the latency in codahale's timer, doing a histogram
>>>     update and then going through all the parent metrics also which
>>> update
>>>     the keyspace writeLatency and globalWriteLatency. When measuring the
>>>     performance of Memtable.put with parameter of 1 partition (to reduce
>>> the
>>>     ConcurrentSkipListMap search speed impact - that's separate issue and
>>>     takes a little bit longer to solve although I've started to prototype
>>>     something..) on my machine I see 1.3M/s performance with the metric
>>> and
>>>     when it is disabled the performance climbs to 4M/s. So the overhead
>>> for
>>>     this single metric is ~2/3 of total performance. That's insane. My
>>> perf
>>>     stats indicate that the CPU is starved as it can't get enough data
>>> in.
>>>
>>>     Removing the replication from TableMetrics to the Keyspace & global
>>>     latencies in the write time (and doing this when metrics are
>>> requested
>>>     instead) improves the performance to 2.1M/s on my machine. It's an
>>>     improvement, but it's still huge amount. Even when we pressure the
>>>     ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
>>>     the performance drops by about ~40% due to this metric, so it's never
>>> free.
>>>
>>>     i did not find any discussion replacing the metric processing with
>>>     something faster, so has this been considered before? At least for
>>> these
>>>     performance sensitive ones. The other issue is obviously the use of
>>>     System.nanotime() which by itself is very slow (two System.nanotime()
>>>     calls eat another ~1M/s from the performance)
>>>
>>>     My personal quick fix would be to move writeLatency to
>>> Keyspace.apply,
>>>     change write time aggregates to read time processing (metrics are
>>> read
>>>     less often than we write data) and maybe even reduce the nanotime ->
>>>     currentTimeMillis (even given it's relative lack of precision). That
>>> is
>>>     - if these metrics make any sense at all at CFS level? Maybe these
>>>     should be measured from the network processing time (including all
>>> the
>>>     deserializations and such) ? Especially if at some point the smarter
>>>     threading / eventlooping changes go forward (in which case they might
>>>     sleep at some "queue" for a while).
>>>
>>>        - Micke
>>>
>>>
>>>     ---------------------------------------------------------------------
>>>     To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>     For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Expensive metrics?

Posted by Michael Burman <mi...@redhat.com>.
Hi,

I was referring to this article by Shipilev (there are few small issues 
forgotten in that url you pasted):

https://shipilev.net/blog/2014/nanotrusting-nanotime/

And his lovely recommendation on it: "System.nanoTime is as bad as 
String.intern now: you can use it, but use it wisely. ". And Cassandra 
uses it quite a lot in the write path at least. There isn't necessarily 
a better option in Java for it, but for that reason we shouldn't push 
them everywhere in the code "for fun".

   - Micke


On 02/22/2018 06:08 PM, Jeremiah D Jordan wrote:
> re: nanoTime vs currentTimeMillis there is a good blog post here about the timing of both and how your choice of Linux clock source can drastically effect the speed of the calls, and also showing that in general on linux there is no perf improvement for one over the other.
> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
>
>> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com> wrote:
>>
>> Hi Micke,
>>
>> This is really cool, thanks for taking the time to investigate this. I believe the metrics around memtable insert time come in handy in identifying high partition contention in the memtable. I know I've been involved in a situation over the past year where we got actionable info from this metric. Reducing resolution to milliseconds is probably a no go since most things in this path should complete in less than a millisecond.
>>
>> Revisiting the use of the codahale metrics in the hot path like this definitely seems like a good idea though. I don't think it's been something we've talked about a lot, and it definitely looks like we could benefit from using something more specialized here. I think it's worth doing, especially since there won't be any major changes to how we do threading in 4.0. It's probably also worth opening a JIRA and investigating the calls to nano time. We at least need microsecond resolution here, and there could be something we haven't thought of? It's worth a look at least.
>>
>> Thanks,
>>
>> Blake
>>
>> On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
>>
>>     Hi,
>>
>>     I wanted to get some input from the mailing list before making a JIRA
>>     and potential fixes. I'll touch the performance more on latter part, but
>>     there's one important question regarding the write latency metric
>>     recording place. Currently we measure the writeLatency (and metric write
>>     sampler..) in ColumnFamilyStore.apply() and this is also the metric we
>>     then replicate to Keyspace metrics etc.
>>
>>     This is an odd place for writeLatency. Not to mention it is in a
>>     hot-path of Memtable-modifications, but it also does not measure the
>>     real write latency, since it completely ignores the CommitLog latency in
>>     that same process. Is the intention really to measure
>>     Memtable-modification latency only or the actual write latencies?
>>
>>     Then the real issue.. this single metric is a cause of huge overhead in
>>     Memtable processing. There are several metrics / events in the CFS apply
>>     method, including metric sampler, storageHook reportWrite,
>>     colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free
>>     at all when it comes to the processing. I made a small JMH benchmark
>>     here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
>>     that I'll be referring to.
>>
>>     The most offending of all these metrics is the writeLatency metric. What
>>     it does is update the latency in codahale's timer, doing a histogram
>>     update and then going through all the parent metrics also which update
>>     the keyspace writeLatency and globalWriteLatency. When measuring the
>>     performance of Memtable.put with parameter of 1 partition (to reduce the
>>     ConcurrentSkipListMap search speed impact - that's separate issue and
>>     takes a little bit longer to solve although I've started to prototype
>>     something..) on my machine I see 1.3M/s performance with the metric and
>>     when it is disabled the performance climbs to 4M/s. So the overhead for
>>     this single metric is ~2/3 of total performance. That's insane. My perf
>>     stats indicate that the CPU is starved as it can't get enough data in.
>>
>>     Removing the replication from TableMetrics to the Keyspace & global
>>     latencies in the write time (and doing this when metrics are requested
>>     instead) improves the performance to 2.1M/s on my machine. It's an
>>     improvement, but it's still huge amount. Even when we pressure the
>>     ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
>>     the performance drops by about ~40% due to this metric, so it's never free.
>>
>>     i did not find any discussion replacing the metric processing with
>>     something faster, so has this been considered before? At least for these
>>     performance sensitive ones. The other issue is obviously the use of
>>     System.nanotime() which by itself is very slow (two System.nanotime()
>>     calls eat another ~1M/s from the performance)
>>
>>     My personal quick fix would be to move writeLatency to Keyspace.apply,
>>     change write time aggregates to read time processing (metrics are read
>>     less often than we write data) and maybe even reduce the nanotime ->
>>     currentTimeMillis (even given it's relative lack of precision). That is
>>     - if these metrics make any sense at all at CFS level? Maybe these
>>     should be measured from the network processing time (including all the
>>     deserializations and such) ? Especially if at some point the smarter
>>     threading / eventlooping changes go forward (in which case they might
>>     sleep at some "queue" for a while).
>>
>>        - Micke
>>
>>
>>     ---------------------------------------------------------------------
>>     To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>     For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Expensive metrics?

Posted by Michael Burman <mi...@redhat.com>.
Hi,

I've looked at the high level the metrics' expense. It's around ~4% of 
the total CPU time in my machine. But the problem with that higher level 
measurement is that it does not show waits. When I push writes to the 
Cassandra (through CQL) I'm mostly getting stalls according to the 
kernel level measurements - that is, the CPU can't do work. IIRC (I'm on 
the wrong computer now - can't check exact numbers) my CPU reports that 
about ~70% of the CPU cycles are wasted when running write test on live 
Cassandra instance. But at that level there's too much data to actually 
get into the details and what is blocking the CPU from getting things to 
execute, are they cache misses, branch misses (these are actually quite 
high) and where they do happen.

That's why I'm trying to look at the smaller parts first because it is 
easier to measure and debug (while that approach certainly can't solve 
so easily architectural issues). I do have some more higher level 
versions of that gist-microbench also (but I omitted for clarity) and I 
intend to create more of them.

   - Micke


On 02/22/2018 06:32 PM, Jonathan Haddad wrote:
> Hey Micke, very cool you're looking to improve C*'s performance, we would
> absolutely benefit from it.
>
> Have you done any other benchmarks beside the micro one to determine the
> total effect of these metrics on the system overall?  Microbenchmarks are a
> great way to tune small sections of code but they aren't a great starting
> point.  It would be good if we could put some context around the idea by
> benchmarking a tuned, single node (so there's less network overhead)
> running on fast disks with compaction disabled so we can see what kind of
> impact these metrics are adding.  Ideally we'd look at GC promotion and CPU
> time using something like YourKit to identify the overall effect of the
> metrics, so we can set our expectations and goals in a reasonable manner.
> Happy to coordinate with you on this!
>
> On Thu, Feb 22, 2018 at 8:08 AM Jeremiah D Jordan <je...@gmail.com>
> wrote:
>
>> re: nanoTime vs currentTimeMillis there is a good blog post here about the
>> timing of both and how your choice of Linux clock source can drastically
>> effect the speed of the calls, and also showing that in general on linux
>> there is no perf improvement for one over the other.
>> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
>>
>>> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com>
>> wrote:
>>> Hi Micke,
>>>
>>> This is really cool, thanks for taking the time to investigate this. I
>> believe the metrics around memtable insert time come in handy in
>> identifying high partition contention in the memtable. I know I've been
>> involved in a situation over the past year where we got actionable info
>> from this metric. Reducing resolution to milliseconds is probably a no go
>> since most things in this path should complete in less than a millisecond.
>>> Revisiting the use of the codahale metrics in the hot path like this
>> definitely seems like a good idea though. I don't think it's been something
>> we've talked about a lot, and it definitely looks like we could benefit
>> from using something more specialized here. I think it's worth doing,
>> especially since there won't be any major changes to how we do threading in
>> 4.0. It's probably also worth opening a JIRA and investigating the calls to
>> nano time. We at least need microsecond resolution here, and there could be
>> something we haven't thought of? It's worth a look at least.
>>> Thanks,
>>>
>>> Blake
>>>
>>> On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
>>>
>>>     Hi,
>>>
>>>     I wanted to get some input from the mailing list before making a JIRA
>>>     and potential fixes. I'll touch the performance more on latter part,
>> but
>>>     there's one important question regarding the write latency metric
>>>     recording place. Currently we measure the writeLatency (and metric
>> write
>>>     sampler..) in ColumnFamilyStore.apply() and this is also the metric we
>>>     then replicate to Keyspace metrics etc.
>>>
>>>     This is an odd place for writeLatency. Not to mention it is in a
>>>     hot-path of Memtable-modifications, but it also does not measure the
>>>     real write latency, since it completely ignores the CommitLog latency
>> in
>>>     that same process. Is the intention really to measure
>>>     Memtable-modification latency only or the actual write latencies?
>>>
>>>     Then the real issue.. this single metric is a cause of huge overhead
>> in
>>>     Memtable processing. There are several metrics / events in the CFS
>> apply
>>>     method, including metric sampler, storageHook reportWrite,
>>>     colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
>> free
>>>     at all when it comes to the processing. I made a small JMH benchmark
>>>     here:
>> https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
>>>     that I'll be referring to.
>>>
>>>     The most offending of all these metrics is the writeLatency metric.
>> What
>>>     it does is update the latency in codahale's timer, doing a histogram
>>>     update and then going through all the parent metrics also which update
>>>     the keyspace writeLatency and globalWriteLatency. When measuring the
>>>     performance of Memtable.put with parameter of 1 partition (to reduce
>> the
>>>     ConcurrentSkipListMap search speed impact - that's separate issue and
>>>     takes a little bit longer to solve although I've started to prototype
>>>     something..) on my machine I see 1.3M/s performance with the metric
>> and
>>>     when it is disabled the performance climbs to 4M/s. So the overhead
>> for
>>>     this single metric is ~2/3 of total performance. That's insane. My
>> perf
>>>     stats indicate that the CPU is starved as it can't get enough data in.
>>>
>>>     Removing the replication from TableMetrics to the Keyspace & global
>>>     latencies in the write time (and doing this when metrics are requested
>>>     instead) improves the performance to 2.1M/s on my machine. It's an
>>>     improvement, but it's still huge amount. Even when we pressure the
>>>     ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
>>>     the performance drops by about ~40% due to this metric, so it's never
>> free.
>>>     i did not find any discussion replacing the metric processing with
>>>     something faster, so has this been considered before? At least for
>> these
>>>     performance sensitive ones. The other issue is obviously the use of
>>>     System.nanotime() which by itself is very slow (two System.nanotime()
>>>     calls eat another ~1M/s from the performance)
>>>
>>>     My personal quick fix would be to move writeLatency to Keyspace.apply,
>>>     change write time aggregates to read time processing (metrics are read
>>>     less often than we write data) and maybe even reduce the nanotime ->
>>>     currentTimeMillis (even given it's relative lack of precision). That
>> is
>>>     - if these metrics make any sense at all at CFS level? Maybe these
>>>     should be measured from the network processing time (including all the
>>>     deserializations and such) ? Especially if at some point the smarter
>>>     threading / eventlooping changes go forward (in which case they might
>>>     sleep at some "queue" for a while).
>>>
>>>        - Micke
>>>
>>>
>>>     ---------------------------------------------------------------------
>>>     To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>>     For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Expensive metrics?

Posted by Jonathan Haddad <jo...@jonhaddad.com>.
Hey Micke, very cool you're looking to improve C*'s performance, we would
absolutely benefit from it.

Have you done any other benchmarks beside the micro one to determine the
total effect of these metrics on the system overall?  Microbenchmarks are a
great way to tune small sections of code but they aren't a great starting
point.  It would be good if we could put some context around the idea by
benchmarking a tuned, single node (so there's less network overhead)
running on fast disks with compaction disabled so we can see what kind of
impact these metrics are adding.  Ideally we'd look at GC promotion and CPU
time using something like YourKit to identify the overall effect of the
metrics, so we can set our expectations and goals in a reasonable manner.
Happy to coordinate with you on this!

On Thu, Feb 22, 2018 at 8:08 AM Jeremiah D Jordan <je...@gmail.com>
wrote:

> re: nanoTime vs currentTimeMillis there is a good blog post here about the
> timing of both and how your choice of Linux clock source can drastically
> effect the speed of the calls, and also showing that in general on linux
> there is no perf improvement for one over the other.
> http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html
>
> > On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com>
> wrote:
> >
> > Hi Micke,
> >
> > This is really cool, thanks for taking the time to investigate this. I
> believe the metrics around memtable insert time come in handy in
> identifying high partition contention in the memtable. I know I've been
> involved in a situation over the past year where we got actionable info
> from this metric. Reducing resolution to milliseconds is probably a no go
> since most things in this path should complete in less than a millisecond.
> >
> > Revisiting the use of the codahale metrics in the hot path like this
> definitely seems like a good idea though. I don't think it's been something
> we've talked about a lot, and it definitely looks like we could benefit
> from using something more specialized here. I think it's worth doing,
> especially since there won't be any major changes to how we do threading in
> 4.0. It's probably also worth opening a JIRA and investigating the calls to
> nano time. We at least need microsecond resolution here, and there could be
> something we haven't thought of? It's worth a look at least.
> >
> > Thanks,
> >
> > Blake
> >
> > On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
> >
> >    Hi,
> >
> >    I wanted to get some input from the mailing list before making a JIRA
> >    and potential fixes. I'll touch the performance more on latter part,
> but
> >    there's one important question regarding the write latency metric
> >    recording place. Currently we measure the writeLatency (and metric
> write
> >    sampler..) in ColumnFamilyStore.apply() and this is also the metric we
> >    then replicate to Keyspace metrics etc.
> >
> >    This is an odd place for writeLatency. Not to mention it is in a
> >    hot-path of Memtable-modifications, but it also does not measure the
> >    real write latency, since it completely ignores the CommitLog latency
> in
> >    that same process. Is the intention really to measure
> >    Memtable-modification latency only or the actual write latencies?
> >
> >    Then the real issue.. this single metric is a cause of huge overhead
> in
> >    Memtable processing. There are several metrics / events in the CFS
> apply
> >    method, including metric sampler, storageHook reportWrite,
> >    colUpdateTimeDeltaHistogram and metric.writeLatency. These are not
> free
> >    at all when it comes to the processing. I made a small JMH benchmark
> >    here:
> https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade
> >    that I'll be referring to.
> >
> >    The most offending of all these metrics is the writeLatency metric.
> What
> >    it does is update the latency in codahale's timer, doing a histogram
> >    update and then going through all the parent metrics also which update
> >    the keyspace writeLatency and globalWriteLatency. When measuring the
> >    performance of Memtable.put with parameter of 1 partition (to reduce
> the
> >    ConcurrentSkipListMap search speed impact - that's separate issue and
> >    takes a little bit longer to solve although I've started to prototype
> >    something..) on my machine I see 1.3M/s performance with the metric
> and
> >    when it is disabled the performance climbs to 4M/s. So the overhead
> for
> >    this single metric is ~2/3 of total performance. That's insane. My
> perf
> >    stats indicate that the CPU is starved as it can't get enough data in.
> >
> >    Removing the replication from TableMetrics to the Keyspace & global
> >    latencies in the write time (and doing this when metrics are requested
> >    instead) improves the performance to 2.1M/s on my machine. It's an
> >    improvement, but it's still huge amount. Even when we pressure the
> >    ConcurrentSkipListMap with 100 000 partitions in one active Memtable,
> >    the performance drops by about ~40% due to this metric, so it's never
> free.
> >
> >    i did not find any discussion replacing the metric processing with
> >    something faster, so has this been considered before? At least for
> these
> >    performance sensitive ones. The other issue is obviously the use of
> >    System.nanotime() which by itself is very slow (two System.nanotime()
> >    calls eat another ~1M/s from the performance)
> >
> >    My personal quick fix would be to move writeLatency to Keyspace.apply,
> >    change write time aggregates to read time processing (metrics are read
> >    less often than we write data) and maybe even reduce the nanotime ->
> >    currentTimeMillis (even given it's relative lack of precision). That
> is
> >    - if these metrics make any sense at all at CFS level? Maybe these
> >    should be measured from the network processing time (including all the
> >    deserializations and such) ? Especially if at some point the smarter
> >    threading / eventlooping changes go forward (in which case they might
> >    sleep at some "queue" for a while).
> >
> >       - Micke
> >
> >
> >    ---------------------------------------------------------------------
> >    To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> >    For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Expensive metrics?

Posted by Jeremiah D Jordan <je...@gmail.com>.
re: nanoTime vs currentTimeMillis there is a good blog post here about the timing of both and how your choice of Linux clock source can drastically effect the speed of the calls, and also showing that in general on linux there is no perf improvement for one over the other.
http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html

> On Feb 22, 2018, at 11:01 AM, Blake Eggleston <be...@apple.com> wrote:
> 
> Hi Micke,
> 
> This is really cool, thanks for taking the time to investigate this. I believe the metrics around memtable insert time come in handy in identifying high partition contention in the memtable. I know I've been involved in a situation over the past year where we got actionable info from this metric. Reducing resolution to milliseconds is probably a no go since most things in this path should complete in less than a millisecond. 
> 
> Revisiting the use of the codahale metrics in the hot path like this definitely seems like a good idea though. I don't think it's been something we've talked about a lot, and it definitely looks like we could benefit from using something more specialized here. I think it's worth doing, especially since there won't be any major changes to how we do threading in 4.0. It's probably also worth opening a JIRA and investigating the calls to nano time. We at least need microsecond resolution here, and there could be something we haven't thought of? It's worth a look at least.
> 
> Thanks,
> 
> Blake
> 
> On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:
> 
>    Hi,
> 
>    I wanted to get some input from the mailing list before making a JIRA 
>    and potential fixes. I'll touch the performance more on latter part, but 
>    there's one important question regarding the write latency metric 
>    recording place. Currently we measure the writeLatency (and metric write 
>    sampler..) in ColumnFamilyStore.apply() and this is also the metric we 
>    then replicate to Keyspace metrics etc.
> 
>    This is an odd place for writeLatency. Not to mention it is in a 
>    hot-path of Memtable-modifications, but it also does not measure the 
>    real write latency, since it completely ignores the CommitLog latency in 
>    that same process. Is the intention really to measure 
>    Memtable-modification latency only or the actual write latencies?
> 
>    Then the real issue.. this single metric is a cause of huge overhead in 
>    Memtable processing. There are several metrics / events in the CFS apply 
>    method, including metric sampler, storageHook reportWrite, 
>    colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free 
>    at all when it comes to the processing. I made a small JMH benchmark 
>    here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade 
>    that I'll be referring to.
> 
>    The most offending of all these metrics is the writeLatency metric. What 
>    it does is update the latency in codahale's timer, doing a histogram 
>    update and then going through all the parent metrics also which update 
>    the keyspace writeLatency and globalWriteLatency. When measuring the 
>    performance of Memtable.put with parameter of 1 partition (to reduce the 
>    ConcurrentSkipListMap search speed impact - that's separate issue and 
>    takes a little bit longer to solve although I've started to prototype 
>    something..) on my machine I see 1.3M/s performance with the metric and 
>    when it is disabled the performance climbs to 4M/s. So the overhead for 
>    this single metric is ~2/3 of total performance. That's insane. My perf 
>    stats indicate that the CPU is starved as it can't get enough data in.
> 
>    Removing the replication from TableMetrics to the Keyspace & global 
>    latencies in the write time (and doing this when metrics are requested 
>    instead) improves the performance to 2.1M/s on my machine. It's an 
>    improvement, but it's still huge amount. Even when we pressure the 
>    ConcurrentSkipListMap with 100 000 partitions in one active Memtable, 
>    the performance drops by about ~40% due to this metric, so it's never free.
> 
>    i did not find any discussion replacing the metric processing with 
>    something faster, so has this been considered before? At least for these 
>    performance sensitive ones. The other issue is obviously the use of 
>    System.nanotime() which by itself is very slow (two System.nanotime() 
>    calls eat another ~1M/s from the performance)
> 
>    My personal quick fix would be to move writeLatency to Keyspace.apply, 
>    change write time aggregates to read time processing (metrics are read 
>    less often than we write data) and maybe even reduce the nanotime -> 
>    currentTimeMillis (even given it's relative lack of precision). That is 
>    - if these metrics make any sense at all at CFS level? Maybe these 
>    should be measured from the network processing time (including all the 
>    deserializations and such) ? Especially if at some point the smarter 
>    threading / eventlooping changes go forward (in which case they might 
>    sleep at some "queue" for a while).
> 
>       - Micke
> 
> 
>    ---------------------------------------------------------------------
>    To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>    For additional commands, e-mail: dev-help@cassandra.apache.org
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: Expensive metrics?

Posted by Blake Eggleston <be...@apple.com>.
Hi Micke,

This is really cool, thanks for taking the time to investigate this. I believe the metrics around memtable insert time come in handy in identifying high partition contention in the memtable. I know I've been involved in a situation over the past year where we got actionable info from this metric. Reducing resolution to milliseconds is probably a no go since most things in this path should complete in less than a millisecond. 

Revisiting the use of the codahale metrics in the hot path like this definitely seems like a good idea though. I don't think it's been something we've talked about a lot, and it definitely looks like we could benefit from using something more specialized here. I think it's worth doing, especially since there won't be any major changes to how we do threading in 4.0. It's probably also worth opening a JIRA and investigating the calls to nano time. We at least need microsecond resolution here, and there could be something we haven't thought of? It's worth a look at least.

Thanks,

Blake

On 2/22/18, 6:10 AM, "Michael Burman" <mi...@redhat.com> wrote:

    Hi,
    
    I wanted to get some input from the mailing list before making a JIRA 
    and potential fixes. I'll touch the performance more on latter part, but 
    there's one important question regarding the write latency metric 
    recording place. Currently we measure the writeLatency (and metric write 
    sampler..) in ColumnFamilyStore.apply() and this is also the metric we 
    then replicate to Keyspace metrics etc.
    
    This is an odd place for writeLatency. Not to mention it is in a 
    hot-path of Memtable-modifications, but it also does not measure the 
    real write latency, since it completely ignores the CommitLog latency in 
    that same process. Is the intention really to measure 
    Memtable-modification latency only or the actual write latencies?
    
    Then the real issue.. this single metric is a cause of huge overhead in 
    Memtable processing. There are several metrics / events in the CFS apply 
    method, including metric sampler, storageHook reportWrite, 
    colUpdateTimeDeltaHistogram and metric.writeLatency. These are not free 
    at all when it comes to the processing. I made a small JMH benchmark 
    here: https://gist.github.com/burmanm/b5b284bc9f1d410b1d635f6d3dac3ade 
    that I'll be referring to.
    
    The most offending of all these metrics is the writeLatency metric. What 
    it does is update the latency in codahale's timer, doing a histogram 
    update and then going through all the parent metrics also which update 
    the keyspace writeLatency and globalWriteLatency. When measuring the 
    performance of Memtable.put with parameter of 1 partition (to reduce the 
    ConcurrentSkipListMap search speed impact - that's separate issue and 
    takes a little bit longer to solve although I've started to prototype 
    something..) on my machine I see 1.3M/s performance with the metric and 
    when it is disabled the performance climbs to 4M/s. So the overhead for 
    this single metric is ~2/3 of total performance. That's insane. My perf 
    stats indicate that the CPU is starved as it can't get enough data in.
    
    Removing the replication from TableMetrics to the Keyspace & global 
    latencies in the write time (and doing this when metrics are requested 
    instead) improves the performance to 2.1M/s on my machine. It's an 
    improvement, but it's still huge amount. Even when we pressure the 
    ConcurrentSkipListMap with 100 000 partitions in one active Memtable, 
    the performance drops by about ~40% due to this metric, so it's never free.
    
    i did not find any discussion replacing the metric processing with 
    something faster, so has this been considered before? At least for these 
    performance sensitive ones. The other issue is obviously the use of 
    System.nanotime() which by itself is very slow (two System.nanotime() 
    calls eat another ~1M/s from the performance)
    
    My personal quick fix would be to move writeLatency to Keyspace.apply, 
    change write time aggregates to read time processing (metrics are read 
    less often than we write data) and maybe even reduce the nanotime -> 
    currentTimeMillis (even given it's relative lack of precision). That is 
    - if these metrics make any sense at all at CFS level? Maybe these 
    should be measured from the network processing time (including all the 
    deserializations and such) ? Especially if at some point the smarter 
    threading / eventlooping changes go forward (in which case they might 
    sleep at some "queue" for a while).
    
       - Micke
    
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
    For additional commands, e-mail: dev-help@cassandra.apache.org
    
    



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org