You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (Jira)" <ji...@apache.org> on 2021/11/26 19:18:00 UTC

[jira] [Created] (NUTCH-2909) Standardize Nutch Metrics Counters

Lewis John McGibbney created NUTCH-2909:
-------------------------------------------

             Summary: Standardize Nutch Metrics Counters
                 Key: NUTCH-2909
                 URL: https://issues.apache.org/jira/browse/NUTCH-2909
             Project: Nutch
          Issue Type: Improvement
          Components: metrics
    Affects Versions: 1.18
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.19


I revisited Nutch metrics counters and put some [metrics documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics] together for others to consult should they wish.

I thought a comprehensive collection of all Nutch Counters would be useful so I put together a [metrics table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable]. One of this (unintended) outcomes was that this highlighted the variability in counter group names and metric names. For example

*Metric Group*:
* _CleaningJobStatus_ - upper camel case
* _CrawlDB filter_ - inconsistent use of capitalization and space separated
* N/A - the DomainStatistics counters don't belong to a metric group
* _injector_ - lowercase named after the encapsulating Class
* _WebGraph.outlinks_ - inconsistent use of capitalization and period separated

The *Metric Name*'s are basically the same... pretty much all over the place.

I am keen to bring some convention to the Nutch metrics definitions but this is not all plain sailing. I do understand that existing users may rely upon the above metrics as are and changing the values would have impacts downstream.

*PROPOSAL*
I would like to discuss introducing a naming convention which follows some simple principles motivated by a [Datadog employees response on SO|https://stackoverflow.com/a/18131221].

As a take on that post, I want to propose the following

{quote}
1. With regards to *Metric Group* the highest level of hierarchy is the product line or the process i.e., _*nutch*_. The highest level of hierarchy is always lowercase.
2. The next level of hierarchy is the sub-component/tool, i.e., *_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*, *_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the enclosing Class. This way it is really simple to trace the metric back to the Class which it was defined within.
3. The third level of the hierarchy is the metric group which is a general grouping of functionality for the metric being defined i.e. *_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with words separated by underscore. If no obvious metric group exists simply provide the enclosing Class in lowercase i.e.,  *_nutch.Injector.injector.urls_filtered_*
4. With regards to the *Metric Name*, the last level of hierarchy is the thing being measured i.e., *_urls_filtered_*, *_above_exception_threshold_in_queue_*, etc. Everything is lowercase and words separated by underscore. Same as #3 above.

Example complete metrics

* *_nutch.Injector.injector.urls_filtered_*
* *_nutch.ResolverThread.update_host_db.checked_hosts_*
* *_nutch.WebGraph.outlinks.added links_*
{quote}

It would be greatly appreciated if folks could chime in on the details of the proposal. I'm sure there are several areas which could be improved. 

I will mention that my specific driver for cleaning this up is that I would like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler subsystem will be integrated with all the rest of the subsystems I am responsible for. We use Splunk for that kind of thing. I intend to do that by implementing the [Java statsd client|https://github.com/DataDog/java-dogstatsd-client] but I feel that comes after we clean up metrics and establish a metrics naming convention.

Thanks for any input. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)