You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2022/07/26 15:00:25 UTC

[GitHub] [hbase] ndimiduk commented on pull request #4601: Backport "HBASE-27155 Improvements to low level scanner tracing" to branch-2.5

ndimiduk commented on PR #4601:
URL: https://github.com/apache/hbase/pull/4601#issuecomment-1195598276

Okay, after too long of a delay, I have collected some data in which I have some amount confidence. The raw data and summaries are in this [Google Sheet](https://docs.google.com/spreadsheets/d/1O6iJQXh3a5y_VN9ChsGICbyoZhsTDNazraqci7GDhuw/edit?usp=sharing) for your examination. The charts are pasted here for your reference.

![Runtime (mins)](https://user-images.githubusercontent.com/45484/181039394-8ac82790-1eb1-41eb-bfc2-37e40273b641.png)
![Read Latency, 95th Percentile (µs)](https://user-images.githubusercontent.com/45484/181039463-72032c40-dfe7-4c23-a1d6-be136b9abea7.png)
![Read Latency, 99th Percentile (µs)](https://user-images.githubusercontent.com/45484/181040550-c804b021-298e-40c1-a127-fcae57dac466.png)

## Test Methodology

I tested three different builds: a baseline (ecf758b9ab), that baseline (ecf758b9ab) + HBASE-27153, and that baseline (ecf758b9ab) + HBASE-27153 + HBASE-27155. In all cases, the test is run with tracing disabled -- all that's measured here is the impact of the code changes made to facilitate manual instrumentation by each patch. The test run was a YCSB workload that I happened to have handy, with 20% writes and 80% random reads. The test runs for a little over 30 minutes. The data collected is the total test runtime, and the read latencies reported by YCSB at p95 and p99.

The test methodology was to first prepare a dataset by populating a pre-split table using the YCSB load feature, flush and major compact the table, snapshot the table. Each test iteration involved dropping the table, cloning the snapshot back into place, and then applying the test workload.

I kept an eye on cluster metrics as things ran. Client and server generally agree on number of requests served/sec. Each test, it took about 20 minutes for the block cache hit rate to climb from a starting point of 50% to the steady state of around 70%. I made an effort to exclude as much as possible the impact of compactions on the test results -- ASYNC_WAL was used, and dropping the table after each test run dropped the pending compaction work that accumulated via the write portion of the workload.

## My Interpretation of results

The changes introduced with HBASE-27153 appear to have an overall positive impact on read throughput and latency, although in latency in particular, there is a disturbingly large amount of internal variance. HBASE-27155 appears to undo all of that improvement and then do additional harm.

It is my opinion that the regression of 2ms at p95 and 1ms at p99 is too expensive to accept for the inclusion of HBASE-27155.

## Next Steps

We can conclude analysis here and decide to commit one or both patches. Or, we can attempt further analysis. The next analysis step I would suggest is collection and comparison of flame graphs at several points during the test period.

Please advise.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org