You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (Jira)" <ji...@apache.org> on 2021/11/26 19:18:00 UTC

[jira] [Updated] (NUTCH-2909) Establish a metrics naming convention

     [ https://issues.apache.org/jira/browse/NUTCH-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-2909:
----------------------------------------
    Summary: Establish a metrics naming convention  (was: Standardize Nutch Metrics Counters)

> Establish a metrics naming convention
> -------------------------------------
>
>                 Key: NUTCH-2909
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2909
>             Project: Nutch
>          Issue Type: Improvement
>          Components: metrics
>    Affects Versions: 1.18
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.19
>
>
> I revisited Nutch metrics counters and put some [metrics documentation|https://cwiki.apache.org/confluence/display/NUTCH/Metrics] together for others to consult should they wish.
> I thought a comprehensive collection of all Nutch Counters would be useful so I put together a [metrics table|https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-MetricsTable]. One of this (unintended) outcomes was that this highlighted the variability in counter group names and metric names. For example
> *Metric Group*:
> * _CleaningJobStatus_ - upper camel case
> * _CrawlDB filter_ - inconsistent use of capitalization and space separated
> * N/A - the DomainStatistics counters don't belong to a metric group
> * _injector_ - lowercase named after the encapsulating Class
> * _WebGraph.outlinks_ - inconsistent use of capitalization and period separated
> The *Metric Name*'s are basically the same... pretty much all over the place.
> I am keen to bring some convention to the Nutch metrics definitions but this is not all plain sailing. I do understand that existing users may rely upon the above metrics as are and changing the values would have impacts downstream.
> *PROPOSAL*
> I would like to discuss introducing a naming convention which follows some simple principles motivated by a [Datadog employees response on SO|https://stackoverflow.com/a/18131221].
> As a take on that post, I want to propose the following
> {quote}
> 1. With regards to *Metric Group* the highest level of hierarchy is the product line or the process i.e., _*nutch*_. The highest level of hierarchy is always lowercase.
> 2. The next level of hierarchy is the sub-component/tool, i.e., *_nutch.Injector_*, *_nutch.Generator_*, *_nutch.ParseSegment_*, *_nutch.SitemapProcessor_*, etc. This constituent is exactly as that of the enclosing Class. This way it is really simple to trace the metric back to the Class which it was defined within.
> 3. The third level of the hierarchy is the metric group which is a general grouping of functionality for the metric being defined i.e. *_nutch.QueueFeeder.fetcher_status_*. This constituent is lowercase with words separated by underscore. If no obvious metric group exists simply provide the enclosing Class in lowercase i.e.,  *_nutch.Injector.injector.urls_filtered_*
> 4. With regards to the *Metric Name*, the last level of hierarchy is the thing being measured i.e., *_urls_filtered_*, *_above_exception_threshold_in_queue_*, etc. Everything is lowercase and words separated by underscore. Same as #3 above.
> Example complete metrics
> * *_nutch.Injector.injector.urls_filtered_*
> * *_nutch.ResolverThread.update_host_db.checked_hosts_*
> * *_nutch.WebGraph.outlinks.added links_*
> {quote}
> It would be greatly appreciated if folks could chime in on the details of the proposal. I'm sure there are several areas which could be improved. 
> I will mention that my specific driver for cleaning this up is that I would like to push Nutch metrics into Enterprise Splunk so that the Nutch crawler subsystem will be integrated with all the rest of the subsystems I am responsible for. We use Splunk for that kind of thing. I intend to do that by implementing the [Java statsd client|https://github.com/DataDog/java-dogstatsd-client] but I feel that comes after we clean up metrics and establish a metrics naming convention.
> Thanks for any input. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)