You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Jordan Ly <jo...@gmail.com> on 2017/08/31 02:18:51 UTC
[Design Doc] Hot Standby in Replicas to Reduce Failover Time
Hi everyone,
Following up on the discussion here:
https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
I've created a design document detailing the implementation of a "hot
standby" mechanism where scheduler followers would eagerly read and
apply entries from the replicated log. The goal of this change is
that, in the event of a failover, the newly elected follower will not
have to replay as many entries to rebuild its state and thus can start
serving traffic faster.
https://docs.google.com/document/d/1DOtKA4-vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit?usp=sharing
I have a working prototype of the above design running on a test
cluster. Please feel free to comment on the doc!
This document references a current proposal in Mesos by Ilya Pronin
here: https://lists.apache.org/thread.html/1b8fd10e151054a85c9ea3dc808f7fecb9a87fe5f5e87b10caa46e2a@%3Cdev.mesos.apache.org%3E
Cheers,
Jordan Ly
Re: [Design Doc] Hot Standby in Replicas to Reduce Failover Time
Posted by David McLaughlin <dm...@apache.org>.
+1, this proposal looks sound to me. I'll leave any minor feedback on the
doc but none of it will be blocking.
On Mon, Sep 4, 2017 at 10:31 AM, Erb, Stephan <St...@blue-yonder.com>
wrote:
> Thanks for the detailed design document and the in-depth walkthrough [1]!
> Your proposal seems to be sound. (But be warned, I don’t have much
> experience in this part of Aurora or Mesos :-))
>
> [1] https://docs.google.com/presentation/d/1fQMfNLaRex9rJyq3h08HIujtpULoY
> npFFV7-P6p6Zt0/edit#slide=id.p4
>
> On 31.08.17, 04:18, "Jordan Ly" <jo...@gmail.com> wrote:
>
> Hi everyone,
>
> Following up on the discussion here:
> https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043ead
> fc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
>
> I've created a design document detailing the implementation of a "hot
> standby" mechanism where scheduler followers would eagerly read and
> apply entries from the replicated log. The goal of this change is
> that, in the event of a failover, the newly elected follower will not
> have to replay as many entries to rebuild its state and thus can start
> serving traffic faster.
>
> https://docs.google.com/document/d/1DOtKA4-
> vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit?usp=sharing
>
> I have a working prototype of the above design running on a test
> cluster. Please feel free to comment on the doc!
>
> This document references a current proposal in Mesos by Ilya Pronin
> here: https://lists.apache.org/thread.html/
> 1b8fd10e151054a85c9ea3dc808f7fecb9a87fe5f5e87b10caa46e2a@%
> 3Cdev.mesos.apache.org%3E
>
> Cheers,
>
> Jordan Ly
>
>
>
Re: [Design Doc] Hot Standby in Replicas to Reduce Failover Time
Posted by "Erb, Stephan" <St...@blue-yonder.com>.
Thanks for the detailed design document and the in-depth walkthrough [1]!
Your proposal seems to be sound. (But be warned, I don’t have much experience in this part of Aurora or Mesos :-))
[1] https://docs.google.com/presentation/d/1fQMfNLaRex9rJyq3h08HIujtpULoYnpFFV7-P6p6Zt0/edit#slide=id.p4
On 31.08.17, 04:18, "Jordan Ly" <jo...@gmail.com> wrote:
Hi everyone,
Following up on the discussion here:
https://lists.apache.org/thread.html/e31d7dbcb054ed570f969ae2043eadfc090383edfe0751cec59b29d3@%3Cdev.aurora.apache.org%3E
I've created a design document detailing the implementation of a "hot
standby" mechanism where scheduler followers would eagerly read and
apply entries from the replicated log. The goal of this change is
that, in the event of a failover, the newly elected follower will not
have to replay as many entries to rebuild its state and thus can start
serving traffic faster.
https://docs.google.com/document/d/1DOtKA4-vrtxat1MaUYMQ6Y1iXhA8ob6Mfztzt-R1Oss/edit?usp=sharing
I have a working prototype of the above design running on a test
cluster. Please feel free to comment on the doc!
This document references a current proposal in Mesos by Ilya Pronin
here: https://lists.apache.org/thread.html/1b8fd10e151054a85c9ea3dc808f7fecb9a87fe5f5e87b10caa46e2a@%3Cdev.mesos.apache.org%3E
Cheers,
Jordan Ly