You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/01/07 16:19:00 UTC

[jira] [Commented] (SOLR-15059) Default Grafana dashboard needs to expose graphs for monitoring query performance

    [ https://issues.apache.org/jira/browse/SOLR-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260615#comment-17260615 ] 

ASF subversion and git services commented on SOLR-15059:
--------------------------------------------------------

Commit 8b55fb868de1fb8b82b8663d19285a63ac9ee7af in lucene-solr's branch refs/heads/master from Timothy Potter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8b55fb8 ]

SOLR-15059: Improve query performance monitoring (#2165)



> Default Grafana dashboard needs to expose graphs for monitoring query performance
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-15059
>                 URL: https://issues.apache.org/jira/browse/SOLR-15059
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Grafana Dashboard, metrics
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>            Priority: Major
>             Fix For: 8.8, master (9.0)
>
>         Attachments: Screen Shot 2020-12-23 at 10.22.43 AM.png
>
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> The default Grafana dashboard doesn't expose graphs for monitoring query performance. For instance, if I want to see QPS for a collection, that's not shown in the default dashboard. Same for quantiles like p95 query latency.
> After some digging, these metrics are available in the output from {{/admin/metrics}} but are not exported by the exporter.
> This PR proposes to enhance the default dashboard with a new Query Metrics section with the following metrics:
> * Distributed QPS per Collection (aggregated across all cores)
> * Distributed QPS per Solr Node (aggregated across all base_url)
> * QPS 1-min rate per core
> * QPS 5-min rate per core
> * Top-level Query latency p99, p95, p75
> * Local (non-distrib) query count per core (this is important for determining if there is unbalanced load)
> * Local (non-distrib) query rate per core (1-min)
> * Local (non-distrib) p95 per core
> Also, the {{solr-exporter-config.xml}} uses {{jq}} queries to pull metrics from the output from {{/admin/metrics}}. This file is huge and contains a bunch of {{jq}} boilerplate. Moreover, I'm introducing another 15-20 metrics in this PR, it only makes the file more verbose.
> Thus, I'm also introducing support for jq templates so as to reduce boilerplate, reduce syntax errors, and improve readability. For instance the query metrics I'm adding to the config look like this:
> {code}
>           <str>
>             $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
>           </str>
>           <str>
>             $jq:core-query(5minRate, endswith(".distrib.requestTimes"))
>           </str>
> {code}
> Instead of duplicating the complicated {{jq}} query for each metric. The templates are optional and only should be used if a given jq structure is repeated 3 or more times. Otherwise, inlining the jq query is still supported. Here's how the templates work:
> {code}
>   A regex with named groups is used to match template references to template + vars using the basic pattern:
>       $jq:<TEMPLATE>( <UNIQUE>, <KEYSELECTOR>, <METRIC>, <TYPE> )
>   For instance,
>       $jq:core(requests_total, endswith(".requestTimes"), count, COUNTER)
>   TEMPLATE = core
>   UNIQUE = requests_total (unique suffix for this metric, results in a metric named "solr_metrics_core_requests_total")
>   KEYSELECTOR = endswith(".requestTimes") (filter to select the specific key for this metric)
>   METRIC = count
>   TYPE = COUNTER
>   Some templates may have a default type, so you can omit that from your template reference, such as:
>       $jq:core(requests_total, endswith(".requestTimes"), count)
>   Uses the defaultType=COUNTER as many uses of the core template are counts.
>   If a template reference omits the metric, then the unique suffix is used, for instance:
>       $jq:core-query(1minRate, endswith(".distrib.requestTimes"))
>   Creates a GAUGE metric (default type) named "solr_metrics_core_query_1minRate" using the 1minRate value from the selected JSON object.
> {code}
> Just so people don't have to go digging in the large diff on the config XML, here are the query metrics I'm adding to the exporter config with use of the templates idea:
> {code}
>           <str>
>             $jq:core-query(errors_1minRate, select(.key | endswith(".errors")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(client_errors_1minRate, select(.key | endswith(".clientErrors")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(1minRate, select(.key | endswith(".distrib.requestTimes")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(5minRate, select(.key | endswith(".distrib.requestTimes")), 5minRate)
>           </str>
>           <str>
>             $jq:core-query(median_ms, select(.key | endswith(".distrib.requestTimes")), median_ms)
>           </str>
>           <str>
>             $jq:core-query(p75_ms, select(.key | endswith(".distrib.requestTimes")), p75_ms)
>           </str>
>           <str>
>             $jq:core-query(p95_ms, select(.key | endswith(".distrib.requestTimes")), p95_ms)
>           </str>
>           <str>
>             $jq:core-query(p99_ms, select(.key | endswith(".distrib.requestTimes")), p99_ms)
>           </str>
>           <str>
>             $jq:core-query(mean_rate, select(.key | endswith(".distrib.requestTimes")), meanRate)
>           </str>
>           
>           <!-- Local (non-distrib) query metrics -->
>           <str>
>             $jq:core-query(local_1minRate, select(.key | endswith(".local.requestTimes")), 1minRate)
>           </str>
>           <str>
>             $jq:core-query(local_5minRate, select(.key | endswith(".local.requestTimes")), 5minRate)
>           </str>
>           <str>
>             $jq:core-query(local_median_ms, select(.key | endswith(".local.requestTimes")), median_ms)
>           </str>
>           <str>
>             $jq:core-query(local_p75_ms, select(.key | endswith(".local.requestTimes")), p75_ms)
>           </str>
>           <str>
>             $jq:core-query(local_p95_ms, select(.key | endswith(".local.requestTimes")), p95_ms)
>           </str>
>           <str>
>             $jq:core-query(local_p99_ms, select(.key | endswith(".local.requestTimes")), p99_ms)
>           </str>
>           <str>
>             $jq:core-query(local_mean_rate, select(.key | endswith(".local.requestTimes")), meanRate)
>           </str>
>           <str>
>             $jq:core-query(local_count, select(.key | endswith(".local.requestTimes")), count, COUNTER)
>           </str>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org