You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2022/08/08 13:26:22 UTC

[GitHub] [hbase] bbeaudreault commented on pull request #4638: HBASE-27224 HFile tool statistic sampling produces misleading results

bbeaudreault commented on PR #4638:
URL: https://github.com/apache/hbase/pull/4638#issuecomment-1208128794

   @cbaenziger thanks for the input. I made changes per your request. Here's the updated format:
   
   ```
      Key length:
                  min = 29
                  max = 29
                 mean = 29.00
               stddev = 0.00
               median = 29.00
                 75% <= 29.00
                 95% <= 29.00
                 98% <= 29.00
                 99% <= 29.00
               99.9% <= 29.00
                count = 1000
              (range <= count):
                  10 <= 0
                  50 <= 1000
      Val length:
                  min = 3
                  max = 3
                 mean = 3.00
               stddev = 0.00
               median = 3.00
                 75% <= 3.00
                 95% <= 3.00
                 98% <= 3.00
                 99% <= 3.00
               99.9% <= 3.00
                count = 1000
              (range <= count):
                   1 <= 0
                   3 <= 1000
      Row size (bytes):
                  min = 40
                  max = 40
                 mean = 40.00
               stddev = 0.00
               median = 40.00
                 75% <= 40.00
                 95% <= 40.00
                 98% <= 40.00
                 99% <= 40.00
               99.9% <= 40.00
                count = 1000
              (range <= count):
                  10 <= 0
                  50 <= 1000
      Row size (columns):
                  min = 1
                  max = 1
                 mean = 1.00
               stddev = 0.00
               median = 1.00
                 75% <= 1.00
                 95% <= 1.00
                 98% <= 1.00
                 99% <= 1.00
               99.9% <= 1.00
                count = 1000
              (range <= count):
                   1 <= 1000
   
   
   Key of biggest row: row_00000000
   ```
   
   Worth noting that the new `range <= count` sections only show up if you enable the new `-d` arg. 
   
   To your last question:
   > Lastly, this may be a question/solution chasing a problem, but will this show a bi-modal set of ranges clearly (e.g. if I have keys of 50 bytes and keys of 5,000 bytes only) or will the elided bounding ranges be needed to point that out? Or is that what [line 845](https://github.com/apache/hbase/pull/4638/commits/820104f375217e2cad2cfa0d3f3c4631a6ec6599#diff-35423d30ce8e7b499844db2cd71563dea12517322eb476239bc7867ccd96ac6dL840-L841) is doing already?
   Yes, line 845 attempts to provide some context while not printing all of the 0 ranges between 2 values. Probably the most clear would be to print all ranges, but that'd also be way unnecessarily verbose in most cases. So I tried to only print the context necessary, i think it should be pretty intuitive once someone sees the output for a few files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org