You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@bookkeeper.apache.org by GitBox <gi...@apache.org> on 2018/06/06 16:33:59 UTC

[GitHub] nicmichael commented on issue #1489: Better Prevent Read Outliers during short-term Bookie Slow-Down

nicmichael commented on issue #1489: Better Prevent Read Outliers during short-term Bookie Slow-Down
URL: https://github.com/apache/bookkeeper/issues/1489#issuecomment-395132917

The plot below shows the behaviour of the current implementation with RackawareEnsemblePlacementPolicy for a scenario with 3 bookies and a speculativeReadTimeout of 30 ms. Ticks on the x axis are minutes.

Plot #1:
![analysis_slowbookies_zoom2b](https://user-images.githubusercontent.com/27270046/41051957-66178628-696c-11e8-88ce-952ac49d04a0.png)

Shortly before minute 01, a read request on bookie 3 reached the speculativeReadTimeout, and bookie 3 is added to the slowBookie list. As a consequence, it received no further reads for the next minute (cyan curve drops to 0). Bookie 1 (green curve) sees its read rate double as a consequence. Before minute 02, the same happens to bookie 2 (magenta curve). Around minute 2:30 (until 4:00), overall read load increases on the bookies (not shown in graph). Shortly before minute 03, both bookie 2 and 3 are on the slowBookie list, and bookie 1 (green) is handling all the requests, which coincides with a steep increase in client-side response times (red curve). After minute 03, it is bookie 2 that's handling all requeest. This behaviour toggles between bookies several times.

Plot #2:
![analysis_slowbookies_zoom](https://user-images.githubusercontent.com/27270046/41051979-70fa6b46-696c-11e8-9cf2-e39d01117ced.png)

Shortly before minute 12, both bookies 1 (green) and 3 (cyan) hit the speculativeReadTimeout and are put on the slowBookie list, and all reads are redirected to bookie 2 (magenta). Neither bookie 1 or 3 had any longer-lasting performance problems, but both don't receive any reads for ~ 20 seconds, until bookie 2 - overloaded as it has to handle 3x the regular number of reads - also hits the speculativeReadTimeout. With all bookies on the slowBookies list, it is now bookie 3 that gets selected based on number of pending requests, then also bookie 1, while bookie 2 remains on the slowBookie list for a whole minute.

I have implemented a slightly different strategy to better deal with short latency spikes. When enabled, it deactivates the current slowBookie list implementation and instead reorders the read set based on the number of pending requests to each bookie based on a configurable threshold. The idea is that on average, all bookies should receive a similar rate of requests, and should (if they operate equally fast) have a similar number of outstanding requests. An increase in outstanding requests one one bookie relative to others indicates that it might have an issue - e.g. responding slower or for a brief time (e.g. during a Java GC) not responding at all. Once the primary bookie's number of outstanding requests is too large compared to other bookies in the ensemble, we direct the read request to the bookie with the shortest queue length.

Setting the threshold very low will more frequently reorder the read set and potentially result in better latency, but will also reduce data affinity of reads. Reads send to other than the preferred bookie have a low chance to be served from file system cache on that bookie, and will likely result in a physical read. Small thresholds therefore shuffle read requests more among bookies and may lead to reduced file system cache reach and increased physical reads on disks.

A larger timeout will maintain data affinity and avoid above problems, but only kick in once a bookie has built-up a considerable queue of requests. It therefore masks only slightly larger outliers, and leads to overall better efficiency.

Using the number of pending requests to a bookie is a good proxy for each bookies current performance, as it reacts quickly to any change in behavior. For example, if a bookie is blocked for ~ 25 ms for a Java GC, the number of outstanding requests will (depending on requests rate) quickly increase - typically within few ms - and after exceeding the threshold prevent further requests being sent to that bookie. Once the bookie resumes work (e.g. Java GC completed), its queue will drain quickly below the threshold, and it can receive further requests. Note that alternative approaches such as a moving average of response times would not achieve this (during Java GC, a bookie doesn't respond, so we only get slow responses once the problem - Java GC - has been resolved).

Below are two plots illustrating how a threshold of 100 (with slowBookie list disabled) behaves in comparison, and avoids redirection of all traffic to single bookies for any extended period of time. In my experiements, this implementation improves client-side response times by 4% (avg) and 14% (99th percentile), and reduces proxy READ_ENTRY times by 5% avg. A threshold of 10 reorders slightly more frequent, but does not lead to overall (measurably)
![analysis_readreordering_zoom3](https://user-images.githubusercontent.com/27270046/41052043-a1b1c720-696c-11e8-96bf-896579899c3c.png)
![analysis_readreordering_zoom2](https://user-images.githubusercontent.com/27270046/41052050-a46ee7d6-696c-11e8-855d-0321150f7967.png)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services