You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kudu.apache.org by Scott Reynolds <sd...@gmail.com> on 2018/07/02 22:19:31 UTC

Having trouble collecting metrics from tservers

List,

Struggling with collecting metrics from the tserver. We are attempting to
pull down per tablet,
allowed_metrics = [
    "rows_inserted",
    "rows_upserted",
    "rows_deleted",
    "scanner_rows_scanned",
    "upserts_as_updates",
    "rows_updated",
    "insertations_failed_dup_key"
]
by querying the metrics json endpoint.

We see the metrics being sent and they appear to be the same value.

I have the following questions:
1. Our metrics unique per `id` in the json payload ?
2. How do other collect metrics for their clusters?

Any help would be appreciated thanks !

Our while loop looks like this:
while not self.shutdown_event.is_set():
            try:
                collection_time = time()
                http_response = requests.get("%s://localhost:%s/metrics" % (
                                          self.protocol, self.port,),
verify=False)
                for metric_type in http_response.json():
                    metric_prefix = metric_type['type']
                    for metric in metric_type['metrics']:
                        if metric["name"] not in allowed_metrics:
                            continue
                        full_name = metric_prefix + "." + metric["name"]
                        for key, value in metric.items():
                            if key == "name":
                                continue
                            log.info("%s_%s -> %s" % (full_name, key,
value,))
                            try:
                                point = float(value)
                                tags = metric_type['attributes'].copy()
                                tags['id'] = metric_type['id']
                                self.metrics_client.gauge(
                                    "%s_%s" % (full_name, key,),
                                    point,
                                    timestamp=collection_time,
                                    tags=tags)
                            except ValueError as not_a_number:
                                log.info("%s is not a number. Not sending",
                                         value)
                self.metrics_client.flush(timestamp=collection_time)
            except Exception as ex:
                log.error("Failed to parse kudu metrics", ex)
            log.info("Pausing for 10 seconds after processing metrics")
            self.shutdown_event.wait(10)

Re: Having trouble collecting metrics from tservers

Posted by Todd Lipcon <to...@cloudera.com.INVALID>.

On Mon, Jul 2, 2018 at 3:19 PM, Scott Reynolds <sd...@gmail.com>
wrote:

> List,
>
> Struggling with collecting metrics from the tserver. We are attempting to
> pull down per tablet,
> allowed_metrics = [
>     "rows_inserted",
>     "rows_upserted",
>     "rows_deleted",
>     "scanner_rows_scanned",
>     "upserts_as_updates",
>     "rows_updated",
>     "insertations_failed_dup_key"
>

I think there's a typo in this one above (insertions)


> ]
> by querying the metrics json endpoint.
>

You might also add a
'?metrics=rows_inserted,rows_upserted,rows_deleted,...' parameter to the
HTTP request so it only fetches what you need. That would be a small CPU
savings.


>
> We see the metrics being sent and they appear to be the same value.
>

What do you mean by "same value" here? Do you mean that, in between two
successive fetches of the metrics, you see that the value does not change?
Or that, across two tablets, you see the same value even though you
expected different ones?


>
> I have the following questions:
> 1. Our metrics unique per `id` in the json payload ?
>

Yes, each tablet has its own unique set of metrics. The metrics are reset
to zero when the tablet server restarts. For example, from one of our test
clusters:


   - {
      - type: "tablet",
      - id: "a07568613ab84836b1a262e296ad982c",
      - attributes:
      {
         - partition: "HASH (l_orderkey) PARTITION 8",
         - table_name: "impala::tpch_kudu.lineitem",
         - table_id: "829587aa24284cec8d5173de99d42054"
         },
      - metrics:
      [
         -
         {
            - name: "on_disk_data_size",
            - value: 30940899
            }
         ]
      },
   -
   {
      - type: "tablet",
      - id: "85ff9afe469a4a43bdce03da5d7587a0",
      - attributes:
      {
         - partition: "HASH (c1, c2, c3) PARTITION 1",
         - table_name: "impala::functional_kudu.decimal_tiny",
         - table_id: "039fd0513ed74f4094faa7fc80d232a1"
         },
      - metrics:
      [
         -
         {
            - name: "on_disk_data_size",
            - value: 685
            }
         ]
      },

...



> 2. How do other collect metrics for their clusters?
>

It's worth noting that, in recent versions, the tserver can be configured
to also log metrics to a file. See
https://kudu.apache.org/docs/administration.html#_diagnostics_logging for
some information about that. The format in this file is somewhat more
compact: metrics are not re-reported when they have not changed, for
example.


>
> Any help would be appreciated thanks !
>
> Our while loop looks like this:
> while not self.shutdown_event.is_set():
>             try:
>                 collection_time = time()
>                 http_response = requests.get("%s://localhost:%s/metrics"
> % (
>                                           self.protocol, self.port,),
> verify=False)
>                 for metric_type in http_response.json():
>                     metric_prefix = metric_type['type']
>                     for metric in metric_type['metrics']:
>                         if metric["name"] not in allowed_metrics:
>                             continue
>                         full_name = metric_prefix + "." + metric["name"]
>                         for key, value in metric.items():
>                             if key == "name":
>                                 continue
>                             log.info("%s_%s -> %s" % (full_name, key,
> value,))
>                             try:
>                                 point = float(value)
>                                 tags = metric_type['attributes'].copy()
>                                 tags['id'] = metric_type['id']
>                                 self.metrics_client.gauge(
>                                     "%s_%s" % (full_name, key,),
>                                     point,
>                                     timestamp=collection_time,
>                                     tags=tags)
>                             except ValueError as not_a_number:
>                                 log.info("%s is not a number. Not
> sending",
>                                          value)
>                 self.metrics_client.flush(timestamp=collection_time)
>             except Exception as ex:
>                 log.error("Failed to parse kudu metrics", ex)
>             log.info("Pausing for 10 seconds after processing metrics")
>             self.shutdown_event.wait(10)
>

The above snippet looks pretty reasonable to me.

Todd
-- 
Todd Lipcon
Software Engineer, Cloudera