You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/08 07:05:40 UTC

[jira] [Commented] (BOOKKEEPER-889) BookKeeper client should try not to use bookies with errors/timeouts when forming a new ensemble

    [ https://issues.apache.org/jira/browse/BOOKKEEPER-889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184477#comment-15184477 ] 

ASF GitHub Bot commented on BOOKKEEPER-889:
-------------------------------------------

Github user sijie commented on the pull request:

    https://github.com/apache/bookkeeper/pull/11#issuecomment-193620922
  
    the new change looks good to me. +1


> BookKeeper client should try not to use bookies with errors/timeouts when forming a new ensemble
> ------------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-889
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-889
>             Project: Bookkeeper
>          Issue Type: Improvement
>          Components: bookkeeper-client
>    Affects Versions: 4.3.2
>            Reporter: Siddharth Sunil Boobna
>            Assignee: Siddharth Sunil Boobna
>             Fix For: 4.4.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Due to various issues (slow disks, network issues, bugs, etc), the bookkeeper can be slow or unresponsive for extended period of times. During this time, r/w operations will fail/timeout and ledgers will create a new segment and form a new ensemble replacing this bookie. For new ledgers, it might still pick up this bookie or we can replace this bookie with another faulty bookie if we have multiple faulty bookies. 
> The BK client should keep stats about these failure rates for all the bookies and it should "quarantine" failing bookies for a certain amount of time. Once a bookie is quarantined, it will not be picked up in forming a new ensemble, unless no other "healthy" bookies are available.
> Solution:
> Keep a counter of errors in the bookie client pool and periodically check for number of errors in a given time span and mark these bookies as "quarantined" in the BookieWatcher.
> In the BookieWatcher, try to create an ensemble list excluding the quarantined bookies and if that fails, fall back to an empty exclusion list.
> We will also remove the bookies from the quarantined list after a configurable period of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)