You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by David Capwell <dc...@gmail.com> on 2021/08/26 00:46:15 UTC

[DISCUSS] Repair Improvement Proposal

Now that 4.0 is out, I want to bring up improving repair again (earlier
thread
http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3CJIRA.13266448.1572997299000.99567.1572997440168@Atlassian.JIRA%3E),
specifically the following two JIRAs:


CASSANDRA-15566 - Repair coordinator can hang under some cases

CASSANDRA-15399 - Add ability to track state in repair


Right now repair has an issue if any message is lost, which leads to hung
or timed out repairs; in addition there is a large lack of visibility into
what is going on, and can be even harder if you wish to join coordinator
with participant state.


I propose the following changes to improve our current repair subsystem:



   1. New tracking system for coordinator and participants (covered by
   CASSANDRA-15399).  This system will expose progress on each instance and
   expose this information for internal access as well as external users
   2. Add retries to specific stages of coordination, such as prepare and
   validate.  In order to do these retries we first need to know what the
   state is for the participant which has yet to reply, this will leverage
   CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
   validation running? Did it complete?).  In addition to checking the
   state, we will need to store the validation MerkleTree, this allows for
   coordinator to fetch if goes missing (can be dropped in route to
   coordinator or even on the coordinator).


What is not in scope?

   - Rewriting all of Repair; the idea is specific "small" changes can fix
   80% of the issues
   - Handle coordinator node failure.  Being able to recover from a failed
   coordinator should be possible after the above work is done, so is seen as
   tangental and can be done later
   - Recovery from a downed participant.  Similar to the previous bullet,
   with the state being tracked this acts as a kind of checkpoint, so future
   work can come in to handle recovery
   - Handling "too large" range. Ideally we should add an ability to split
   the coordination into sub repairs, but this is not the goal of this work.
   - Overstreaming.  This is a byproduct of the previous "not in scope"
   bullet, and/or large partitions; so is tangental to this work


Wanted to share here before starting this work again; let me know if there
are any concerns or feedback!

Re: [DISCUSS] Repair Improvement Proposal

Posted by David Capwell <dc...@apple.com.INVALID>.
Cool, moving this from dev list to JIRA, will start breaking down tasks and document my progress there

https://issues.apache.org/jira/browse/CASSANDRA-16909

> On Aug 27, 2021, at 1:21 PM, David Capwell <dc...@apple.com.INVALID> wrote:
> 
> Push vs pull isn’t too critical, but there is one edge case to consider; if we didn’t think the participate got restarted triggering validation again (which may have caused the process to end) could be a problem.
> 
>> On Aug 26, 2021, at 9:50 AM, Yifan Cai <yc...@gmail.com> wrote:
>> 
>>> 
>>> 2. Add retries to specific stages of coordination, such as prepare and
>>>  validate. In order to do these retries we first need to know what the
>> 
>>  state is for the participant which has yet to reply...
>> 
>> 
>> If I understand it correctly, does it mean retries only happen in the
>> coordinator and the coordinator pulls the states of the participants
>> periodically?
>> If the handling of the requests in the participant is made to be idempotent
>> (which I think is required for retry anyway), pulling the state is
>> unnecessary. For example, the coordinator can just send the PrepareRequest
>> at regular intervals until it receives the PrepareResponse.
>> 
>> - Yifan
>> 
>> On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston
>> <be...@apple.com.invalid> wrote:
>> 
>>> +1 from me, any improvement in this area would be great.
>>> 
>>> It would be nice if this could include visibility into repair streams, but
>>> just exposing the repair state will be a big improvement.
>>> 
>>>> On Aug 25, 2021, at 5:46 PM, David Capwell <dc...@gmail.com> wrote:
>>>> 
>>>> Now that 4.0 is out, I want to bring up improving repair again (earlier
>>>> thread
>>>> 
>>> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3CJIRA.13266448.1572997299000.99567.1572997440168@Atlassian.JIRA%3E
>>> ),
>>>> specifically the following two JIRAs:
>>>> 
>>>> 
>>>> CASSANDRA-15566 - Repair coordinator can hang under some cases
>>>> 
>>>> CASSANDRA-15399 - Add ability to track state in repair
>>>> 
>>>> 
>>>> Right now repair has an issue if any message is lost, which leads to hung
>>>> or timed out repairs; in addition there is a large lack of visibility
>>> into
>>>> what is going on, and can be even harder if you wish to join coordinator
>>>> with participant state.
>>>> 
>>>> 
>>>> I propose the following changes to improve our current repair subsystem:
>>>> 
>>>> 
>>>> 
>>>> 1. New tracking system for coordinator and participants (covered by
>>>> CASSANDRA-15399).  This system will expose progress on each instance
>>> and
>>>> expose this information for internal access as well as external users
>>>> 2. Add retries to specific stages of coordination, such as prepare and
>>>> validate.  In order to do these retries we first need to know what the
>>>> state is for the participant which has yet to reply, this will leverage
>>>> CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
>>>> validation running? Did it complete?).  In addition to checking the
>>>> state, we will need to store the validation MerkleTree, this allows for
>>>> coordinator to fetch if goes missing (can be dropped in route to
>>>> coordinator or even on the coordinator).
>>>> 
>>>> 
>>>> What is not in scope?
>>>> 
>>>> - Rewriting all of Repair; the idea is specific "small" changes can fix
>>>> 80% of the issues
>>>> - Handle coordinator node failure.  Being able to recover from a failed
>>>> coordinator should be possible after the above work is done, so is
>>> seen as
>>>> tangental and can be done later
>>>> - Recovery from a downed participant.  Similar to the previous bullet,
>>>> with the state being tracked this acts as a kind of checkpoint, so
>>> future
>>>> work can come in to handle recovery
>>>> - Handling "too large" range. Ideally we should add an ability to split
>>>> the coordination into sub repairs, but this is not the goal of this
>>> work.
>>>> - Overstreaming.  This is a byproduct of the previous "not in scope"
>>>> bullet, and/or large partitions; so is tangental to this work
>>>> 
>>>> 
>>>> Wanted to share here before starting this work again; let me know if
>>> there
>>>> are any concerns or feedback!
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>>> For additional commands, e-mail: dev-help@cassandra.apache.org
>>> 
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: [DISCUSS] Repair Improvement Proposal

Posted by David Capwell <dc...@apple.com.INVALID>.
Push vs pull isn’t too critical, but there is one edge case to consider; if we didn’t think the participate got restarted triggering validation again (which may have caused the process to end) could be a problem.

> On Aug 26, 2021, at 9:50 AM, Yifan Cai <yc...@gmail.com> wrote:
> 
>> 
>> 2. Add retries to specific stages of coordination, such as prepare and
>>   validate. In order to do these retries we first need to know what the
> 
>   state is for the participant which has yet to reply...
> 
> 
> If I understand it correctly, does it mean retries only happen in the
> coordinator and the coordinator pulls the states of the participants
> periodically?
> If the handling of the requests in the participant is made to be idempotent
> (which I think is required for retry anyway), pulling the state is
> unnecessary. For example, the coordinator can just send the PrepareRequest
> at regular intervals until it receives the PrepareResponse.
> 
> - Yifan
> 
> On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston
> <be...@apple.com.invalid> wrote:
> 
>> +1 from me, any improvement in this area would be great.
>> 
>> It would be nice if this could include visibility into repair streams, but
>> just exposing the repair state will be a big improvement.
>> 
>>> On Aug 25, 2021, at 5:46 PM, David Capwell <dc...@gmail.com> wrote:
>>> 
>>> Now that 4.0 is out, I want to bring up improving repair again (earlier
>>> thread
>>> 
>> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3CJIRA.13266448.1572997299000.99567.1572997440168@Atlassian.JIRA%3E
>> ),
>>> specifically the following two JIRAs:
>>> 
>>> 
>>> CASSANDRA-15566 - Repair coordinator can hang under some cases
>>> 
>>> CASSANDRA-15399 - Add ability to track state in repair
>>> 
>>> 
>>> Right now repair has an issue if any message is lost, which leads to hung
>>> or timed out repairs; in addition there is a large lack of visibility
>> into
>>> what is going on, and can be even harder if you wish to join coordinator
>>> with participant state.
>>> 
>>> 
>>> I propose the following changes to improve our current repair subsystem:
>>> 
>>> 
>>> 
>>>  1. New tracking system for coordinator and participants (covered by
>>>  CASSANDRA-15399).  This system will expose progress on each instance
>> and
>>>  expose this information for internal access as well as external users
>>>  2. Add retries to specific stages of coordination, such as prepare and
>>>  validate.  In order to do these retries we first need to know what the
>>>  state is for the participant which has yet to reply, this will leverage
>>>  CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
>>>  validation running? Did it complete?).  In addition to checking the
>>>  state, we will need to store the validation MerkleTree, this allows for
>>>  coordinator to fetch if goes missing (can be dropped in route to
>>>  coordinator or even on the coordinator).
>>> 
>>> 
>>> What is not in scope?
>>> 
>>>  - Rewriting all of Repair; the idea is specific "small" changes can fix
>>>  80% of the issues
>>>  - Handle coordinator node failure.  Being able to recover from a failed
>>>  coordinator should be possible after the above work is done, so is
>> seen as
>>>  tangental and can be done later
>>>  - Recovery from a downed participant.  Similar to the previous bullet,
>>>  with the state being tracked this acts as a kind of checkpoint, so
>> future
>>>  work can come in to handle recovery
>>>  - Handling "too large" range. Ideally we should add an ability to split
>>>  the coordination into sub repairs, but this is not the goal of this
>> work.
>>>  - Overstreaming.  This is a byproduct of the previous "not in scope"
>>>  bullet, and/or large partitions; so is tangental to this work
>>> 
>>> 
>>> Wanted to share here before starting this work again; let me know if
>> there
>>> are any concerns or feedback!
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> For additional commands, e-mail: dev-help@cassandra.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: [DISCUSS] Repair Improvement Proposal

Posted by Yifan Cai <yc...@gmail.com>.
>
> 2. Add retries to specific stages of coordination, such as prepare and
>    validate. In order to do these retries we first need to know what the

   state is for the participant which has yet to reply...


If I understand it correctly, does it mean retries only happen in the
coordinator and the coordinator pulls the states of the participants
periodically?
If the handling of the requests in the participant is made to be idempotent
(which I think is required for retry anyway), pulling the state is
unnecessary. For example, the coordinator can just send the PrepareRequest
at regular intervals until it receives the PrepareResponse.

- Yifan

On Thu, Aug 26, 2021 at 8:56 AM Blake Eggleston
<be...@apple.com.invalid> wrote:

> +1 from me, any improvement in this area would be great.
>
> It would be nice if this could include visibility into repair streams, but
> just exposing the repair state will be a big improvement.
>
> > On Aug 25, 2021, at 5:46 PM, David Capwell <dc...@gmail.com> wrote:
> >
> > Now that 4.0 is out, I want to bring up improving repair again (earlier
> > thread
> >
> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3CJIRA.13266448.1572997299000.99567.1572997440168@Atlassian.JIRA%3E
> ),
> > specifically the following two JIRAs:
> >
> >
> > CASSANDRA-15566 - Repair coordinator can hang under some cases
> >
> > CASSANDRA-15399 - Add ability to track state in repair
> >
> >
> > Right now repair has an issue if any message is lost, which leads to hung
> > or timed out repairs; in addition there is a large lack of visibility
> into
> > what is going on, and can be even harder if you wish to join coordinator
> > with participant state.
> >
> >
> > I propose the following changes to improve our current repair subsystem:
> >
> >
> >
> >   1. New tracking system for coordinator and participants (covered by
> >   CASSANDRA-15399).  This system will expose progress on each instance
> and
> >   expose this information for internal access as well as external users
> >   2. Add retries to specific stages of coordination, such as prepare and
> >   validate.  In order to do these retries we first need to know what the
> >   state is for the participant which has yet to reply, this will leverage
> >   CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
> >   validation running? Did it complete?).  In addition to checking the
> >   state, we will need to store the validation MerkleTree, this allows for
> >   coordinator to fetch if goes missing (can be dropped in route to
> >   coordinator or even on the coordinator).
> >
> >
> > What is not in scope?
> >
> >   - Rewriting all of Repair; the idea is specific "small" changes can fix
> >   80% of the issues
> >   - Handle coordinator node failure.  Being able to recover from a failed
> >   coordinator should be possible after the above work is done, so is
> seen as
> >   tangental and can be done later
> >   - Recovery from a downed participant.  Similar to the previous bullet,
> >   with the state being tracked this acts as a kind of checkpoint, so
> future
> >   work can come in to handle recovery
> >   - Handling "too large" range. Ideally we should add an ability to split
> >   the coordination into sub repairs, but this is not the goal of this
> work.
> >   - Overstreaming.  This is a byproduct of the previous "not in scope"
> >   bullet, and/or large partitions; so is tangental to this work
> >
> >
> > Wanted to share here before starting this work again; let me know if
> there
> > are any concerns or feedback!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: [DISCUSS] Repair Improvement Proposal

Posted by Blake Eggleston <be...@apple.com.INVALID>.
+1 from me, any improvement in this area would be great.

It would be nice if this could include visibility into repair streams, but just exposing the repair state will be a big improvement.

> On Aug 25, 2021, at 5:46 PM, David Capwell <dc...@gmail.com> wrote:
> 
> Now that 4.0 is out, I want to bring up improving repair again (earlier
> thread
> http://mail-archives.apache.org/mod_mbox/cassandra-commits/201911.mbox/%3CJIRA.13266448.1572997299000.99567.1572997440168@Atlassian.JIRA%3E),
> specifically the following two JIRAs:
> 
> 
> CASSANDRA-15566 - Repair coordinator can hang under some cases
> 
> CASSANDRA-15399 - Add ability to track state in repair
> 
> 
> Right now repair has an issue if any message is lost, which leads to hung
> or timed out repairs; in addition there is a large lack of visibility into
> what is going on, and can be even harder if you wish to join coordinator
> with participant state.
> 
> 
> I propose the following changes to improve our current repair subsystem:
> 
> 
> 
>   1. New tracking system for coordinator and participants (covered by
>   CASSANDRA-15399).  This system will expose progress on each instance and
>   expose this information for internal access as well as external users
>   2. Add retries to specific stages of coordination, such as prepare and
>   validate.  In order to do these retries we first need to know what the
>   state is for the participant which has yet to reply, this will leverage
>   CASSANDRA-15399 to see what's going on (has the prepare been seen?  Is
>   validation running? Did it complete?).  In addition to checking the
>   state, we will need to store the validation MerkleTree, this allows for
>   coordinator to fetch if goes missing (can be dropped in route to
>   coordinator or even on the coordinator).
> 
> 
> What is not in scope?
> 
>   - Rewriting all of Repair; the idea is specific "small" changes can fix
>   80% of the issues
>   - Handle coordinator node failure.  Being able to recover from a failed
>   coordinator should be possible after the above work is done, so is seen as
>   tangental and can be done later
>   - Recovery from a downed participant.  Similar to the previous bullet,
>   with the state being tracked this acts as a kind of checkpoint, so future
>   work can come in to handle recovery
>   - Handling "too large" range. Ideally we should add an ability to split
>   the coordination into sub repairs, but this is not the goal of this work.
>   - Overstreaming.  This is a byproduct of the previous "not in scope"
>   bullet, and/or large partitions; so is tangental to this work
> 
> 
> Wanted to share here before starting this work again; let me know if there
> are any concerns or feedback!


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org