You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2022/07/26 13:08:48 UTC

[GitHub] [hbase] bbeaudreault commented on a diff in pull request #4638: HBASE-27224 HFile tool statistic sampling produces misleading results

bbeaudreault commented on code in PR #4638:
URL: https://github.com/apache/hbase/pull/4638#discussion_r929942907


##########
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java:
##########
@@ -640,49 +722,39 @@ public String toString() {
       if (prevCell == null) return "no data available for statistics";
 
       // Dump the metrics to the output stream
-      simpleReporter.stop();

Review Comment:
   Yea, so this previously had SimpleReporter extend ScheduledReporter. In my opinion this was overkill. The benefits of ScheduledReporter are:
   
   - it has a thread which can print the statistics periodically over time. You need to call reporter.start() for that.
   - It can do various filters on the metrics, which is mostly useful when you have larger shared MetricsRegistry and maybe only care about certain metrics.
   - It can simplify the collection of multiple different metric types (counters, gauges, etc). Including some extra pre-processing that can be applied.
   
   We weren't using any of these features, so I think it was unnecessary to use ScheduledReporter. I would have been fine to keep it in place just to simplify this PR, but there were issues with ScheduledReporter. 
   
   Our primary goal here was to add a global min/max for each of the metrics collected. Since ScheduledReporter is based around processing codahale metrics, it's hard to supplement those metrics with additional metadata. I think the "codahale-native" way for me to do this would have been to add a new Gauge for all of the existing Histograms. Then, in the `report` method below we'd get those gauges alongside the histograms. The annoying thing with that is we'd have to manually match up the names somehow, and the names have to be unique. So maybe I'd name the histogram `Key length` and then the gauge `Key length - max` and have to do some string munging to link the 2 together. 
   
   That felt pretty hacky and all for the purposes of keeping a thing we don't need/use anyway. I also tried some other hacks to keep it, liking having a static map which kept them together, but this felt similarly hacky. So that's why I ripped out the ScheduledReporter.
   
   With ScheduledReporter, the `report` method is triggered when you call `stop` (as well as periodically by the reporting thread if you had called `start()`, which we don't). So you're correct, that's why we used to have to call `stop()`. Now that we don't rely on ScheduledReporter, we don't need to call that at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org