You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alain RODRIGUEZ <ar...@gmail.com> on 2018/10/01 09:21:24 UTC

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Hello Carl,

Here is a message I sent to my team a few months ago. I hope this will be
helpful to you and more people around :). It might not be exhaustive and we
were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2
is similar to C*3.0 if I remember correctly in terms of metrics. Here it is
for what it's worth:

Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
- ColumnFamily --> Table
- XXpercentile --> pXX
- 1MinuteRate -->  m1_rate
- metric name before KS and Table names and some other changes of this kind.
- ^ aggregations / aliases indexes changed because of this (using graphite
for example) ^
- ‘.value’ is not appended in the metric name anymore for gauges, nothing
instead.

For example (graphite):

From
aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
2, 3), 1, 7, 8, 9)

to
aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
2, 3), 1, 8, 9, 10)

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 28 sept. 2018 à 20:38, Carl Mueller
<ca...@smartthings.com.invalid> a écrit :

> VERY NICE! Thank you very much
>
> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
> lyuben.todorov@instaclustr.com> wrote:
>
>> Nothing as fancy as a matrix but a list of what JMX term can see.
>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>
>> /lyubent
>>
>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>> <ca...@smartthings.com.invalid> wrote:
>>
>>> It's my understanding that metrics got heavily re-namespaced in JMX for
>>> 2.2 from 2.1
>>>
>>> Did anyone ever make a migration matrix/guide for conversion of old
>>> metrics to new metrics?
>>>
>>>
>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.
Your dashboards are great. The only challenge is getting all the data to
feed them.


On Tue, Oct 16, 2018 at 1:45 PM Carl Mueller <ca...@smartthings.com>
wrote:

> metadata.csv: that helps a lot, thank you!
>
> On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
>> I feel you for most of the troubles you faced, I've been facing most of
>> them too. Again, Datadog support can probably help you with most of those.
>> You should really consider sharing this feedback to them.
>>
>> there is re-namespacing of the metric names in lots of cases, and these
>>> don't appear to be centrally documented, but maybe i haven't found the
>>> magic page.
>>>
>>
>> I don't know if that would be the 'magic' page, but that's something:
>> https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv
>>
>> There are sooooo many good stats.
>>
>>
>> Yes, and it's still improving. I love this about Cassandra. It's our work
>> to pick the relevant ones for each situation. I would not like Cassandra to
>> reduce the number of metrics exposed, we need to learn to handle them
>> properly. Also, this is the reason we designed 4 dashboards out the box,
>> the goal was to have everything we need for distinct scenarios:
>> - Overview - global health-check / anomaly detection
>> - Read Path - troubleshooting / optimizing read ops
>> - Write Path - troubleshooting / optimizing write ops
>> - SSTable Management - troubleshooting / optimizing -
>> comapction/flushes/... anything related to sstables.
>>
>> instead of the single overview dashboard that was present before. We are
>> also perfectly aware that it's far from perfect, but aiming at perfect
>> would only have had us never releasing anything. Anyone interested could
>> now build missing dashboards or improve existing ones for himself or/and
>> suggest improvements to Datadog :). I hope I'll do some more of this work
>> at some point in the future.
>>
>> Good luck,
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
>> <ca...@smartthings.com.invalid> a écrit :
>>
>>> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
>>> endpoint via https, bypassing the agent-imposed 350. But integrating that
>>> required targetting the other shared libs in the cassandra path, so the
>>> build is a bit of a pain when we update major versions.
>>>
>>> We are migrating our 2.1.x specific dashboards, and we will use
>>> agent-delivered metrics for non-table, and adapt the custom library to
>>> deliver the table-based ones, at a slower rate than the "core" ones.
>>>
>>> Datadog is also super annoying because there doesn't appear to be
>>> anything that reports what metrics the agent is sending (the metric count
>>> can indicate if a configured new metric increased the count and is being
>>> reported, but it's still... a guess), and there is re-namespacing of the
>>> metric names in lots of cases, and these don't appear to be centrally
>>> documented, but maybe i haven't found the magic page.
>>>
>>> There are sooooo many good stats. We might also implement some facility
>>> to dynamically turn on the delivery of detailed metrics on the nodes.
>>>
>>> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>>> Hello Carl,
>>>>
>>>> I guess we can use bean_regex to do specific targetted metrics for the
>>>>> important tables anyway.
>>>>>
>>>>
>>>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>>>> We have a LOT of metrics available.
>>>>
>>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>>>> tables
>>>>>
>>>>
>>>> I noticed this while I was working on providing default dashboards for
>>>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>>>> an issue for users, that I should not care about it. As you pointed out,
>>>> per table metrics quickly increase the total number of metrics we need to
>>>> collect.
>>>>
>>>> I believe you can set the following option: *"max_returned_metrics:
>>>> 1000"* - it can be used if metrics are missing to increase the limit
>>>> of the number of collected metrics. Be aware of CPU utilization that this
>>>> might imply (greatly improved in dd-agent version 6+ I believe -thanks
>>>> Datadog teams for that- making this fully usable for Cassandra). This
>>>> option should go in the *cassandra.yaml* file for Cassandra
>>>> integrations, off the top of my head.
>>>>
>>>> Also, do not hesitate to reach to Datadog directly for this kind of
>>>> questions, I have always been very happy with their support so far, I am
>>>> sure they would guide you through this as well, probably better than we can
>>>> do :). It also provides them with feedback on what people are struggling
>>>> with I imagine.
>>>>
>>>> I am interested to know if you still have issues getting more metrics
>>>> (option above not working / CPU under too much load) as this would make the
>>>> dashboards we built mostly unusable for clusters with more tables. We might
>>>> then need to review the design.
>>>>
>>>> As a side note, I believe metrics are handled the same way cross
>>>> version, they got the same name/label for C*2.1, 2.2 and 3+ on Datadog.
>>>> There is an abstraction layer that removes this complexity (if I remember
>>>> well, we built those dashboards a while ago).
>>>>
>>>> C*heers
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>> Le lun. 1 oct. 2018 à 19:38, Carl Mueller
>>>> <ca...@smartthings.com.invalid> a écrit :
>>>>
>>>>> That's great too, thank you.
>>>>>
>>>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>>>> tables, but I guess we can use bean_regex to do specific targetted metrics
>>>>> for the important tables anyway.
>>>>>
>>>>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <ar...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Carl,
>>>>>>
>>>>>> Here is a message I sent to my team a few months ago. I hope this
>>>>>> will be helpful to you and more people around :). It might not be
>>>>>> exhaustive and we were moving from C*2.1 to C*3+ in this case, thus
>>>>>> skipping C*2.2, but C*2.2 is similar to C*3.0 if I remember correctly in
>>>>>> terms of metrics. Here it is for what it's worth:
>>>>>>
>>>>>> Quite a few things changed between metric reporter in C* 2.1 and
>>>>>> C*3.0.
>>>>>> - ColumnFamily --> Table
>>>>>> - XXpercentile --> pXX
>>>>>> - 1MinuteRate -->  m1_rate
>>>>>> - metric name before KS and Table names and some other changes of
>>>>>> this kind.
>>>>>> - ^ aggregations / aliases indexes changed because of this (using
>>>>>> graphite for example) ^
>>>>>> - ‘.value’ is not appended in the metric name anymore for gauges,
>>>>>> nothing instead.
>>>>>>
>>>>>> For example (graphite):
>>>>>>
>>>>>> From
>>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>>>>>> 2, 3), 1, 7, 8, 9)
>>>>>>
>>>>>> to
>>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>>>>>> 2, 3), 1, 8, 9, 10)
>>>>>>
>>>>>> C*heers,
>>>>>> -----------------------
>>>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>>>> France / Spain
>>>>>>
>>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>>> http://www.thelastpickle.com
>>>>>>
>>>>>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>>>>>> <ca...@smartthings.com.invalid> a écrit :
>>>>>>
>>>>>>> VERY NICE! Thank you very much
>>>>>>>
>>>>>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>>>>>> lyuben.todorov@instaclustr.com> wrote:
>>>>>>>
>>>>>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>>>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>>>>>
>>>>>>>> /lyubent
>>>>>>>>
>>>>>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>>>>>> <ca...@smartthings.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> It's my understanding that metrics got heavily re-namespaced in
>>>>>>>>> JMX for 2.2 from 2.1
>>>>>>>>>
>>>>>>>>> Did anyone ever make a migration matrix/guide for conversion of
>>>>>>>>> old metrics to new metrics?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.
metadata.csv: that helps a lot, thank you!

On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:

> I feel you for most of the troubles you faced, I've been facing most of
> them too. Again, Datadog support can probably help you with most of those.
> You should really consider sharing this feedback to them.
>
> there is re-namespacing of the metric names in lots of cases, and these
>> don't appear to be centrally documented, but maybe i haven't found the
>> magic page.
>>
>
> I don't know if that would be the 'magic' page, but that's something:
> https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv
>
> There are sooooo many good stats.
>
>
> Yes, and it's still improving. I love this about Cassandra. It's our work
> to pick the relevant ones for each situation. I would not like Cassandra to
> reduce the number of metrics exposed, we need to learn to handle them
> properly. Also, this is the reason we designed 4 dashboards out the box,
> the goal was to have everything we need for distinct scenarios:
> - Overview - global health-check / anomaly detection
> - Read Path - troubleshooting / optimizing read ops
> - Write Path - troubleshooting / optimizing write ops
> - SSTable Management - troubleshooting / optimizing -
> comapction/flushes/... anything related to sstables.
>
> instead of the single overview dashboard that was present before. We are
> also perfectly aware that it's far from perfect, but aiming at perfect
> would only have had us never releasing anything. Anyone interested could
> now build missing dashboards or improve existing ones for himself or/and
> suggest improvements to Datadog :). I hope I'll do some more of this work
> at some point in the future.
>
> Good luck,
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
> <ca...@smartthings.com.invalid> a écrit :
>
>> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
>> endpoint via https, bypassing the agent-imposed 350. But integrating that
>> required targetting the other shared libs in the cassandra path, so the
>> build is a bit of a pain when we update major versions.
>>
>> We are migrating our 2.1.x specific dashboards, and we will use
>> agent-delivered metrics for non-table, and adapt the custom library to
>> deliver the table-based ones, at a slower rate than the "core" ones.
>>
>> Datadog is also super annoying because there doesn't appear to be
>> anything that reports what metrics the agent is sending (the metric count
>> can indicate if a configured new metric increased the count and is being
>> reported, but it's still... a guess), and there is re-namespacing of the
>> metric names in lots of cases, and these don't appear to be centrally
>> documented, but maybe i haven't found the magic page.
>>
>> There are sooooo many good stats. We might also implement some facility
>> to dynamically turn on the delivery of detailed metrics on the nodes.
>>
>> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> Hello Carl,
>>>
>>> I guess we can use bean_regex to do specific targetted metrics for the
>>>> important tables anyway.
>>>>
>>>
>>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>>> We have a LOT of metrics available.
>>>
>>> Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>>>>
>>>
>>> I noticed this while I was working on providing default dashboards for
>>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>>> an issue for users, that I should not care about it. As you pointed out,
>>> per table metrics quickly increase the total number of metrics we need to
>>> collect.
>>>
>>> I believe you can set the following option: *"max_returned_metrics:
>>> 1000"* - it can be used if metrics are missing to increase the limit of
>>> the number of collected metrics. Be aware of CPU utilization that this
>>> might imply (greatly improved in dd-agent version 6+ I believe -thanks
>>> Datadog teams for that- making this fully usable for Cassandra). This
>>> option should go in the *cassandra.yaml* file for Cassandra
>>> integrations, off the top of my head.
>>>
>>> Also, do not hesitate to reach to Datadog directly for this kind of
>>> questions, I have always been very happy with their support so far, I am
>>> sure they would guide you through this as well, probably better than we can
>>> do :). It also provides them with feedback on what people are struggling
>>> with I imagine.
>>>
>>> I am interested to know if you still have issues getting more metrics
>>> (option above not working / CPU under too much load) as this would make the
>>> dashboards we built mostly unusable for clusters with more tables. We might
>>> then need to review the design.
>>>
>>> As a side note, I believe metrics are handled the same way cross
>>> version, they got the same name/label for C*2.1, 2.2 and 3+ on Datadog.
>>> There is an abstraction layer that removes this complexity (if I remember
>>> well, we built those dashboards a while ago).
>>>
>>> C*heers
>>> -----------------------
>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>> Le lun. 1 oct. 2018 à 19:38, Carl Mueller
>>> <ca...@smartthings.com.invalid> a écrit :
>>>
>>>> That's great too, thank you.
>>>>
>>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>>> tables, but I guess we can use bean_regex to do specific targetted metrics
>>>> for the important tables anyway.
>>>>
>>>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <ar...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello Carl,
>>>>>
>>>>> Here is a message I sent to my team a few months ago. I hope this will
>>>>> be helpful to you and more people around :). It might not be exhaustive and
>>>>> we were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but
>>>>> C*2.2 is similar to C*3.0 if I remember correctly in terms of metrics. Here
>>>>> it is for what it's worth:
>>>>>
>>>>> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
>>>>> - ColumnFamily --> Table
>>>>> - XXpercentile --> pXX
>>>>> - 1MinuteRate -->  m1_rate
>>>>> - metric name before KS and Table names and some other changes of this
>>>>> kind.
>>>>> - ^ aggregations / aliases indexes changed because of this (using
>>>>> graphite for example) ^
>>>>> - ‘.value’ is not appended in the metric name anymore for gauges,
>>>>> nothing instead.
>>>>>
>>>>> For example (graphite):
>>>>>
>>>>> From
>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>>>>> 2, 3), 1, 7, 8, 9)
>>>>>
>>>>> to
>>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>>>>> 2, 3), 1, 8, 9, 10)
>>>>>
>>>>> C*heers,
>>>>> -----------------------
>>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>>> France / Spain
>>>>>
>>>>> The Last Pickle - Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com
>>>>>
>>>>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>>>>> <ca...@smartthings.com.invalid> a écrit :
>>>>>
>>>>>> VERY NICE! Thank you very much
>>>>>>
>>>>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>>>>> lyuben.todorov@instaclustr.com> wrote:
>>>>>>
>>>>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>>>>
>>>>>>> /lyubent
>>>>>>>
>>>>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>>>>> <ca...@smartthings.com.invalid> wrote:
>>>>>>>
>>>>>>>> It's my understanding that metrics got heavily re-namespaced in JMX
>>>>>>>> for 2.2 from 2.1
>>>>>>>>
>>>>>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>>>>>> metrics to new metrics?
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
I feel you for most of the troubles you faced, I've been facing most of
them too. Again, Datadog support can probably help you with most of those.
You should really consider sharing this feedback to them.

there is re-namespacing of the metric names in lots of cases, and these
> don't appear to be centrally documented, but maybe i haven't found the
> magic page.
>

I don't know if that would be the 'magic' page, but that's something:
https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv

There are sooooo many good stats.


Yes, and it's still improving. I love this about Cassandra. It's our work
to pick the relevant ones for each situation. I would not like Cassandra to
reduce the number of metrics exposed, we need to learn to handle them
properly. Also, this is the reason we designed 4 dashboards out the box,
the goal was to have everything we need for distinct scenarios:
- Overview - global health-check / anomaly detection
- Read Path - troubleshooting / optimizing read ops
- Write Path - troubleshooting / optimizing write ops
- SSTable Management - troubleshooting / optimizing -
comapction/flushes/... anything related to sstables.

instead of the single overview dashboard that was present before. We are
also perfectly aware that it's far from perfect, but aiming at perfect
would only have had us never releasing anything. Anyone interested could
now build missing dashboards or improve existing ones for himself or/and
suggest improvements to Datadog :). I hope I'll do some more of this work
at some point in the future.

Good luck,
C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
<ca...@smartthings.com.invalid> a écrit :

> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
> endpoint via https, bypassing the agent-imposed 350. But integrating that
> required targetting the other shared libs in the cassandra path, so the
> build is a bit of a pain when we update major versions.
>
> We are migrating our 2.1.x specific dashboards, and we will use
> agent-delivered metrics for non-table, and adapt the custom library to
> deliver the table-based ones, at a slower rate than the "core" ones.
>
> Datadog is also super annoying because there doesn't appear to be anything
> that reports what metrics the agent is sending (the metric count can
> indicate if a configured new metric increased the count and is being
> reported, but it's still... a guess), and there is re-namespacing of the
> metric names in lots of cases, and these don't appear to be centrally
> documented, but maybe i haven't found the magic page.
>
> There are sooooo many good stats. We might also implement some facility to
> dynamically turn on the delivery of detailed metrics on the nodes.
>
> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
>> Hello Carl,
>>
>> I guess we can use bean_regex to do specific targetted metrics for the
>>> important tables anyway.
>>>
>>
>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>> We have a LOT of metrics available.
>>
>> Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>>>
>>
>> I noticed this while I was working on providing default dashboards for
>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>> an issue for users, that I should not care about it. As you pointed out,
>> per table metrics quickly increase the total number of metrics we need to
>> collect.
>>
>> I believe you can set the following option: *"max_returned_metrics:
>> 1000"* - it can be used if metrics are missing to increase the limit of
>> the number of collected metrics. Be aware of CPU utilization that this
>> might imply (greatly improved in dd-agent version 6+ I believe -thanks
>> Datadog teams for that- making this fully usable for Cassandra). This
>> option should go in the *cassandra.yaml* file for Cassandra
>> integrations, off the top of my head.
>>
>> Also, do not hesitate to reach to Datadog directly for this kind of
>> questions, I have always been very happy with their support so far, I am
>> sure they would guide you through this as well, probably better than we can
>> do :). It also provides them with feedback on what people are struggling
>> with I imagine.
>>
>> I am interested to know if you still have issues getting more metrics
>> (option above not working / CPU under too much load) as this would make the
>> dashboards we built mostly unusable for clusters with more tables. We might
>> then need to review the design.
>>
>> As a side note, I believe metrics are handled the same way cross version,
>> they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. There is an
>> abstraction layer that removes this complexity (if I remember well, we
>> built those dashboards a while ago).
>>
>> C*heers
>> -----------------------
>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> Le lun. 1 oct. 2018 à 19:38, Carl Mueller
>> <ca...@smartthings.com.invalid> a écrit :
>>
>>> That's great too, thank you.
>>>
>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>> tables, but I guess we can use bean_regex to do specific targetted metrics
>>> for the important tables anyway.
>>>
>>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>>> Hello Carl,
>>>>
>>>> Here is a message I sent to my team a few months ago. I hope this will
>>>> be helpful to you and more people around :). It might not be exhaustive and
>>>> we were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but
>>>> C*2.2 is similar to C*3.0 if I remember correctly in terms of metrics. Here
>>>> it is for what it's worth:
>>>>
>>>> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
>>>> - ColumnFamily --> Table
>>>> - XXpercentile --> pXX
>>>> - 1MinuteRate -->  m1_rate
>>>> - metric name before KS and Table names and some other changes of this
>>>> kind.
>>>> - ^ aggregations / aliases indexes changed because of this (using
>>>> graphite for example) ^
>>>> - ‘.value’ is not appended in the metric name anymore for gauges,
>>>> nothing instead.
>>>>
>>>> For example (graphite):
>>>>
>>>> From
>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>>>> 2, 3), 1, 7, 8, 9)
>>>>
>>>> to
>>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>>>> 2, 3), 1, 8, 9, 10)
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France / Spain
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>>>> <ca...@smartthings.com.invalid> a écrit :
>>>>
>>>>> VERY NICE! Thank you very much
>>>>>
>>>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>>>> lyuben.todorov@instaclustr.com> wrote:
>>>>>
>>>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>>>
>>>>>> /lyubent
>>>>>>
>>>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>>>> <ca...@smartthings.com.invalid> wrote:
>>>>>>
>>>>>>> It's my understanding that metrics got heavily re-namespaced in JMX
>>>>>>> for 2.2 from 2.1
>>>>>>>
>>>>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>>>>> metrics to new metrics?
>>>>>>>
>>>>>>>
>>>>>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.
for 2.1.x we had a custom reporter that delivered  metrics to datadog's
endpoint via https, bypassing the agent-imposed 350. But integrating that
required targetting the other shared libs in the cassandra path, so the
build is a bit of a pain when we update major versions.

We are migrating our 2.1.x specific dashboards, and we will use
agent-delivered metrics for non-table, and adapt the custom library to
deliver the table-based ones, at a slower rate than the "core" ones.

Datadog is also super annoying because there doesn't appear to be anything
that reports what metrics the agent is sending (the metric count can
indicate if a configured new metric increased the count and is being
reported, but it's still... a guess), and there is re-namespacing of the
metric names in lots of cases, and these don't appear to be centrally
documented, but maybe i haven't found the magic page.

There are sooooo many good stats. We might also implement some facility to
dynamically turn on the delivery of detailed metrics on the nodes.

On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hello Carl,
>
> I guess we can use bean_regex to do specific targetted metrics for the
>> important tables anyway.
>>
>
> Yes, this would work, but 350 is very limited for Cassandra dashboards. We
> have a LOT of metrics available.
>
> Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>>
>
> I noticed this while I was working on providing default dashboards for
> Cassandra-Datadog integration. I was told by Datadog team it would not be
> an issue for users, that I should not care about it. As you pointed out,
> per table metrics quickly increase the total number of metrics we need to
> collect.
>
> I believe you can set the following option: *"max_returned_metrics: 1000"* -
> it can be used if metrics are missing to increase the limit of the number
> of collected metrics. Be aware of CPU utilization that this might imply
> (greatly improved in dd-agent version 6+ I believe -thanks Datadog teams
> for that- making this fully usable for Cassandra). This option should go in
> the *cassandra.yaml* file for Cassandra integrations, off the top of my
> head.
>
> Also, do not hesitate to reach to Datadog directly for this kind of
> questions, I have always been very happy with their support so far, I am
> sure they would guide you through this as well, probably better than we can
> do :). It also provides them with feedback on what people are struggling
> with I imagine.
>
> I am interested to know if you still have issues getting more metrics
> (option above not working / CPU under too much load) as this would make the
> dashboards we built mostly unusable for clusters with more tables. We might
> then need to review the design.
>
> As a side note, I believe metrics are handled the same way cross version,
> they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. There is an
> abstraction layer that removes this complexity (if I remember well, we
> built those dashboards a while ago).
>
> C*heers
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le lun. 1 oct. 2018 à 19:38, Carl Mueller
> <ca...@smartthings.com.invalid> a écrit :
>
>> That's great too, thank you.
>>
>> Datadog 350 metric limit is a PITA for tables once you get over 10
>> tables, but I guess we can use bean_regex to do specific targetted metrics
>> for the important tables anyway.
>>
>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> Hello Carl,
>>>
>>> Here is a message I sent to my team a few months ago. I hope this will
>>> be helpful to you and more people around :). It might not be exhaustive and
>>> we were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but
>>> C*2.2 is similar to C*3.0 if I remember correctly in terms of metrics. Here
>>> it is for what it's worth:
>>>
>>> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
>>> - ColumnFamily --> Table
>>> - XXpercentile --> pXX
>>> - 1MinuteRate -->  m1_rate
>>> - metric name before KS and Table names and some other changes of this
>>> kind.
>>> - ^ aggregations / aliases indexes changed because of this (using
>>> graphite for example) ^
>>> - ‘.value’ is not appended in the metric name anymore for gauges,
>>> nothing instead.
>>>
>>> For example (graphite):
>>>
>>> From
>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>>> 2, 3), 1, 7, 8, 9)
>>>
>>> to
>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>>> 2, 3), 1, 8, 9, 10)
>>>
>>> C*heers,
>>> -----------------------
>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>>> <ca...@smartthings.com.invalid> a écrit :
>>>
>>>> VERY NICE! Thank you very much
>>>>
>>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>>> lyuben.todorov@instaclustr.com> wrote:
>>>>
>>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>>
>>>>> /lyubent
>>>>>
>>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>>> <ca...@smartthings.com.invalid> wrote:
>>>>>
>>>>>> It's my understanding that metrics got heavily re-namespaced in JMX
>>>>>> for 2.2 from 2.1
>>>>>>
>>>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>>>> metrics to new metrics?
>>>>>>
>>>>>>
>>>>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hello Carl,

I guess we can use bean_regex to do specific targetted metrics for the
> important tables anyway.
>

Yes, this would work, but 350 is very limited for Cassandra dashboards. We
have a LOT of metrics available.

Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>

I noticed this while I was working on providing default dashboards for
Cassandra-Datadog integration. I was told by Datadog team it would not be
an issue for users, that I should not care about it. As you pointed out,
per table metrics quickly increase the total number of metrics we need to
collect.

I believe you can set the following option: *"max_returned_metrics: 1000"* -
it can be used if metrics are missing to increase the limit of the number
of collected metrics. Be aware of CPU utilization that this might imply
(greatly improved in dd-agent version 6+ I believe -thanks Datadog teams
for that- making this fully usable for Cassandra). This option should go in
the *cassandra.yaml* file for Cassandra integrations, off the top of my
head.

Also, do not hesitate to reach to Datadog directly for this kind of
questions, I have always been very happy with their support so far, I am
sure they would guide you through this as well, probably better than we can
do :). It also provides them with feedback on what people are struggling
with I imagine.

I am interested to know if you still have issues getting more metrics
(option above not working / CPU under too much load) as this would make the
dashboards we built mostly unusable for clusters with more tables. We might
then need to review the design.

As a side note, I believe metrics are handled the same way cross version,
they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. There is an
abstraction layer that removes this complexity (if I remember well, we
built those dashboards a while ago).

C*heers
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 1 oct. 2018 à 19:38, Carl Mueller
<ca...@smartthings.com.invalid> a écrit :

> That's great too, thank you.
>
> Datadog 350 metric limit is a PITA for tables once you get over 10 tables,
> but I guess we can use bean_regex to do specific targetted metrics for the
> important tables anyway.
>
> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
>> Hello Carl,
>>
>> Here is a message I sent to my team a few months ago. I hope this will be
>> helpful to you and more people around :). It might not be exhaustive and we
>> were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2
>> is similar to C*3.0 if I remember correctly in terms of metrics. Here it is
>> for what it's worth:
>>
>> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
>> - ColumnFamily --> Table
>> - XXpercentile --> pXX
>> - 1MinuteRate -->  m1_rate
>> - metric name before KS and Table names and some other changes of this
>> kind.
>> - ^ aggregations / aliases indexes changed because of this (using
>> graphite for example) ^
>> - ‘.value’ is not appended in the metric name anymore for gauges, nothing
>> instead.
>>
>> For example (graphite):
>>
>> From
>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>> 2, 3), 1, 7, 8, 9)
>>
>> to
>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>> 2, 3), 1, 8, 9, 10)
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>> <ca...@smartthings.com.invalid> a écrit :
>>
>>> VERY NICE! Thank you very much
>>>
>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>> lyuben.todorov@instaclustr.com> wrote:
>>>
>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>
>>>> /lyubent
>>>>
>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>> <ca...@smartthings.com.invalid> wrote:
>>>>
>>>>> It's my understanding that metrics got heavily re-namespaced in JMX
>>>>> for 2.2 from 2.1
>>>>>
>>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>>> metrics to new metrics?
>>>>>
>>>>>
>>>>>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

Posted by Carl Mueller <ca...@smartthings.com.INVALID>.
That's great too, thank you.

Datadog 350 metric limit is a PITA for tables once you get over 10 tables,
but I guess we can use bean_regex to do specific targetted metrics for the
important tables anyway.

On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hello Carl,
>
> Here is a message I sent to my team a few months ago. I hope this will be
> helpful to you and more people around :). It might not be exhaustive and we
> were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2
> is similar to C*3.0 if I remember correctly in terms of metrics. Here it is
> for what it's worth:
>
> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
> - ColumnFamily --> Table
> - XXpercentile --> pXX
> - 1MinuteRate -->  m1_rate
> - metric name before KS and Table names and some other changes of this
> kind.
> - ^ aggregations / aliases indexes changed because of this (using graphite
> for example) ^
> - ‘.value’ is not appended in the metric name anymore for gauges, nothing
> instead.
>
> For example (graphite):
>
> From
> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
> 2, 3), 1, 7, 8, 9)
>
> to
> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
> 2, 3), 1, 8, 9, 10)
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
> <ca...@smartthings.com.invalid> a écrit :
>
>> VERY NICE! Thank you very much
>>
>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>> lyuben.todorov@instaclustr.com> wrote:
>>
>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>
>>> /lyubent
>>>
>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>> <ca...@smartthings.com.invalid> wrote:
>>>
>>>> It's my understanding that metrics got heavily re-namespaced in JMX for
>>>> 2.2 from 2.1
>>>>
>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>> metrics to new metrics?
>>>>
>>>>
>>>>