You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2019/05/29 02:55:13 UTC

[GitHub] [accumulo] EdColeman commented on issue #1133: Validate that hadoop2 metrics full covers legacy metrics so that legacy metrics can be removed

EdColeman commented on issue #1133: Validate that hadoop2 metrics full covers legacy metrics so that legacy metrics can be removed
URL: https://github.com/apache/accumulo/issues/1133#issuecomment-496763845

Sorry for the wall of text, but I wanted to provide you with additional background and some of my thoughts.

This is related to https://github.com/apache/accumulo/pull/1172, which both I and Christopher have been working. Removing the legacy metrics looks like it will also allow upgrading apache commons config too, so that makes this a blocker for 2.0, so any help is very welcome.

Christopher identified a couple of additional files that I particularly wanted to check that there was some kind of hadoop2 metrics equivalent - anything that extended AbstractMetricsImpl and especially the ThriftMetrics. You can look at the pull request 1172 and see which files seemed to be unused if the legacy code path is eliminated that we have identified so far.

If they are not currently covered, then we should be able to add equivalent hadoop2 metrics as replacements (probably as a separate issue / pull request), but something likely very desirable for releasing 2.0.

It may not be an exact match - I actually have some questions / reservations of the underlying calculations used in the legacy metrics - the way of calculating rolling averages and some others are not familiar to me and I cannot find any references, but they seem that they could be optimizations / light-weight methods for approximating those calculations. If we "replace" them with standard hadoop2 metrics or something from apache math, then a review of the calculation weight vs code paths should be reviewed so as not to impact performance. An example (and not implying that we would) but there are ways to calculate standard deviation in one pass vs the naive methods that keeps the samples and uses two passes, or otherwise suffer from stability problems.

Slight differences should not be a show stopper if we document the differences as well as explain the "upgraded" calculations. Another option would be to pull those calculation methods forward. but report them through hadoop2 metrics. My personal preference is for something standard that is backed by experts in numerical calculations as long as everything else is equal.

From a quick look, the hadoop2 metrics seem well thought out and efficiently implemented, and they are likely okay, but there could have been hadoop assumptions that may make them less than ideal for specific circumstances.

If we can get these issues examined and documented, then that will really clear the path for this for 2.0.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services