You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Renan DelValle <re...@gmail.com> on 2018/06/02 22:54:18 UTC

Recovery instructions updates

Hi all,

We tried following the recovery instructions from
http://aurora.apache.org/documentation/latest/operations/backup-restore/

After our change from the Twitter commons ZK library to Apache Curator,
these instructions are no longer valid.

In order for Aurora to carry out a leader election in the current state,
Aurora has to first connect to a Mesos master. What we ended up doing was
connecting to Mesos master that was had nothing on it to bypass this new
requirement.

Next, wiping away -native_log_file_path did not seem to be enough to
recover from a corrupted mesos replicated log. We had to manually wipe away
entries in ZK and move the snapshot backup directory in order for the
leader to not fall back on either a snapshot or the mesos-log to rehydrate
the leader.

Finally, somehow triggering a manual snapshot generated a snapshot with an
invalid entry which then caused the scheduler to fail after a failover
while trying to catch up on current state.

We are trying to investigate why this took place (it could have been we
didn't give the system enough time to finish hydrating the snapshot), but
the invalid entry which looked something like a Task with all null or 0
values, caused our leaders to fail (which necessitated restoring from an
earlier snapshot) and note that this was only after we triggered the manual
snapshot and BEFORE we tried to restore.

Will report more details as they become available and will provide some doc
updates based on our experience.

-Renan

Re: Recovery instructions updates

Posted by meghdoot bhattacharya <me...@yahoo.com.INVALID>.

From a cluster setup perspective, I am guessing the sequence would be reading the e2etest
1. Bring all schedulers down.2. cleanup replicated dir and reinitialize mesos log in all nodes.3. run the recovery tool in 1 node. Start up the scheduler4. Start scheduler in other nodes.
We will test it as well and update the docs.
Thx

      From: Meghdoot bhattacharya <me...@yahoo.com.INVALID>
 To: Bill Farner <wf...@apache.org> 
Cc: dev@aurora.apache.org
 Sent: Tuesday, June 5, 2018 9:34 AM
 Subject: Re: Recovery instructions updates
   
Thx Bill. We will take it up.

On Jun 5, 2018, at 6:57 AM, Bill Farner <wf...@apache.org> wrote:

>> How does the site get updated? Is it auto-generated when we build releases?
> 
> The source lives in the project SVN repo:
> 
> $ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn
> 
> Here are instructions for updating it.  It's a pretty mechanical process, but not automated.
> 
>> So, do we plan to add the patch for next release?
> 
> Meghdoot - i suspect David would appreciate an incoming patch for the docs, assuming that's what you're referring to.
> 
> He mentioned this step in the end-to-end tests, which is (hopefully) straightforward enough to try without assistance.
> 
> 
> 
>> On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya <me...@yahoo.com.invalid> wrote:
>> Thx David. So, do we plan to add the patch for next release? We will be happy to validate it as part of rc validation.
>> 
>> 
>> 
>>      From: David McLaughlin <dm...@apache.org>
>>  To: dev@aurora.apache.org 
>>  Sent: Monday, June 4, 2018 9:45 AM
>>  Subject: Re: Recovery instructions updates
>>    
>> We should definitely update that doc, Bill's patch makes this much easier
>> (as can be seen by the e2e test) and we've been using it in our scale test
>> environment. How does the site get updated? Is it auto-generated when we
>> build releases?
>> 
>> Having corrupted logs that frequently is concerning too, we haven't seen
>> anything like this and we do explicit snapshots/backups as part of every
>> Scheduler deploy. If there's a bug lurking, would be good to get in front
>> of it.
>> 
>> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
>> meghdoot_b@yahoo.com.invalid> wrote:
>> 
>> > We will try to recover the log files on the snapshot loading error.
>> >
>> > + 1 to Bill’s approach on making offline recovery. We will try the patch
>> > on our side.
>> >
>> > Renan, I would ask you to prepare a PR for the restoration docs proposing
>> > the 2 additional steps required in current world as we look to maybe using
>> > a different mechanism. The prep steps to get scheduler ready for backup can
>> > be eliminated hopefully with the alternative approach.
>> >
>> > On side lets see if we can recover the logs of the corrupted snapshot
>> > loading.
>> >
>> >
>> > Thx
>> >
>> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <st...@blue-yonder.com>
>> > wrote:
>> > >
>> > > That sounds indeed concerning. Would be great if you could file an issue
>> > and attach the related log files and tracebacks.
>> > >
>> > > Bill recently added a potential replacement for the existing restore
>> > mechanism: https://github.com/apache/aurora/commit/
>> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
>> > have bumped into with the current restore, this new approach might be worth
>> > exploring further.
>> > >
>> > > On 03.06.18, 08:43, "Meghdoot bhattacharya"
>> > <me...@yahoo.com.INVALID> wrote:
>> > >
>> > >    Thx Renan for sharing the details. This backup restore happened under
>> > not so easy circumstances, so would encourage the leads to keep docs
>> > updated as much as possible and include in release validation.
>> > >
>> > >    The other issue of snapshots having task and other objects as nil
>> > that causes to fail the schedulers, we have now seen 2 times in past year.
>> > Other than finding root cause why that entry happens during snapshot
>> > creation, there needs to be defensive code either to ignore that entry on
>> > loading or a way to fix the snapshot. Because we might have to go through a
>> > days worth of snapshots to find which one did not had that entry and
>> > recover from there. Mean time to recover gets impacted under the
>> > circumstances. One extra info not sure is relevant or not is the corrupted
>> > snapshot got created by the admin cli (assumption should not matter whether
>> > scheduler triggers or forced via cli) that showed success as well as the
>> > aurora logs but then loading it exposed the issue.
>> > >
>> > >    Thx
>> > >
>> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com>
>> > wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> We tried following the recovery instructions from
>> > >> http://aurora.apache.org/documentation/latest/
>> > operations/backup-restore/
>> > >>
>> > >> After our change from the Twitter commons ZK library to Apache Curator,
>> > >> these instructions are no longer valid.
>> > >>
>> > >> In order for Aurora to carry out a leader election in the current state,
>> > >> Aurora has to first connect to a Mesos master. What we ended up doing
>> > was
>> > >> connecting to Mesos master that was had nothing on it to bypass this new
>> > >> requirement.
>> > >>
>> > >> Next, wiping away -native_log_file_path did not seem to be enough to
>> > >> recover from a corrupted mesos replicated log. We had to manually wipe
>> > away
>> > >> entries in ZK and move the snapshot backup directory in order for the
>> > >> leader to not fall back on either a snapshot or the mesos-log to
>> > rehydrate
>> > >> the leader.
>> > >>
>> > >> Finally, somehow triggering a manual snapshot generated a snapshot with
>> > an
>> > >> invalid entry which then caused the scheduler to fail after a failover
>> > >> while trying to catch up on current state.
>> > >>
>> > >> We are trying to investigate why this took place (it could have been we
>> > >> didn't give the system enough time to finish hydrating the snapshot),
>> > but
>> > >> the invalid entry which looked something like a Task with all null or 0
>> > >> values, caused our leaders to fail (which necessitated restoring from an
>> > >> earlier snapshot) and note that this was only after we triggered the
>> > manual
>> > >> snapshot and BEFORE we tried to restore.
>> > >>
>> > >> Will report more details as they become available and will provide some
>> > doc
>> > >> updates based on our experience.
>> > >>
>> > >> -Renan
>> > >
>> > >
>> > >
>> >
>> >
>> 
>>    
>

Re: Recovery instructions updates

Posted by Meghdoot bhattacharya <me...@yahoo.com.INVALID>.

Thx Bill. We will take it up.

On Jun 5, 2018, at 6:57 AM, Bill Farner <wf...@apache.org> wrote:

>> How does the site get updated? Is it auto-generated when we build releases?
> 
> The source lives in the project SVN repo:
> 
> $ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn
> 
> Here are instructions for updating it.  It's a pretty mechanical process, but not automated.
> 
>> So, do we plan to add the patch for next release?
> 
> Meghdoot - i suspect David would appreciate an incoming patch for the docs, assuming that's what you're referring to.
> 
> He mentioned this step in the end-to-end tests, which is (hopefully) straightforward enough to try without assistance.
> 
> 
> 
>> On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya <me...@yahoo.com.invalid> wrote:
>> Thx David. So, do we plan to add the patch for next release? We will be happy to validate it as part of rc validation.
>> 
>> 
>> 
>>       From: David McLaughlin <dm...@apache.org>
>>  To: dev@aurora.apache.org 
>>  Sent: Monday, June 4, 2018 9:45 AM
>>  Subject: Re: Recovery instructions updates
>>    
>> We should definitely update that doc, Bill's patch makes this much easier
>> (as can be seen by the e2e test) and we've been using it in our scale test
>> environment. How does the site get updated? Is it auto-generated when we
>> build releases?
>> 
>> Having corrupted logs that frequently is concerning too, we haven't seen
>> anything like this and we do explicit snapshots/backups as part of every
>> Scheduler deploy. If there's a bug lurking, would be good to get in front
>> of it.
>> 
>> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
>> meghdoot_b@yahoo.com.invalid> wrote:
>> 
>> > We will try to recover the log files on the snapshot loading error.
>> >
>> > + 1 to Bill’s approach on making offline recovery. We will try the patch
>> > on our side.
>> >
>> > Renan, I would ask you to prepare a PR for the restoration docs proposing
>> > the 2 additional steps required in current world as we look to maybe using
>> > a different mechanism. The prep steps to get scheduler ready for backup can
>> > be eliminated hopefully with the alternative approach.
>> >
>> > On side lets see if we can recover the logs of the corrupted snapshot
>> > loading.
>> >
>> >
>> > Thx
>> >
>> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <st...@blue-yonder.com>
>> > wrote:
>> > >
>> > > That sounds indeed concerning. Would be great if you could file an issue
>> > and attach the related log files and tracebacks.
>> > >
>> > > Bill recently added a potential replacement for the existing restore
>> > mechanism: https://github.com/apache/aurora/commit/
>> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
>> > have bumped into with the current restore, this new approach might be worth
>> > exploring further.
>> > >
>> > > On 03.06.18, 08:43, "Meghdoot bhattacharya"
>> > <me...@yahoo.com.INVALID> wrote:
>> > >
>> > >    Thx Renan for sharing the details. This backup restore happened under
>> > not so easy circumstances, so would encourage the leads to keep docs
>> > updated as much as possible and include in release validation.
>> > >
>> > >    The other issue of snapshots having task and other objects as nil
>> > that causes to fail the schedulers, we have now seen 2 times in past year.
>> > Other than finding root cause why that entry happens during snapshot
>> > creation, there needs to be defensive code either to ignore that entry on
>> > loading or a way to fix the snapshot. Because we might have to go through a
>> > days worth of snapshots to find which one did not had that entry and
>> > recover from there. Mean time to recover gets impacted under the
>> > circumstances. One extra info not sure is relevant or not is the corrupted
>> > snapshot got created by the admin cli (assumption should not matter whether
>> > scheduler triggers or forced via cli) that showed success as well as the
>> > aurora logs but then loading it exposed the issue.
>> > >
>> > >    Thx
>> > >
>> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com>
>> > wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> We tried following the recovery instructions from
>> > >> http://aurora.apache.org/documentation/latest/
>> > operations/backup-restore/
>> > >>
>> > >> After our change from the Twitter commons ZK library to Apache Curator,
>> > >> these instructions are no longer valid.
>> > >>
>> > >> In order for Aurora to carry out a leader election in the current state,
>> > >> Aurora has to first connect to a Mesos master. What we ended up doing
>> > was
>> > >> connecting to Mesos master that was had nothing on it to bypass this new
>> > >> requirement.
>> > >>
>> > >> Next, wiping away -native_log_file_path did not seem to be enough to
>> > >> recover from a corrupted mesos replicated log. We had to manually wipe
>> > away
>> > >> entries in ZK and move the snapshot backup directory in order for the
>> > >> leader to not fall back on either a snapshot or the mesos-log to
>> > rehydrate
>> > >> the leader.
>> > >>
>> > >> Finally, somehow triggering a manual snapshot generated a snapshot with
>> > an
>> > >> invalid entry which then caused the scheduler to fail after a failover
>> > >> while trying to catch up on current state.
>> > >>
>> > >> We are trying to investigate why this took place (it could have been we
>> > >> didn't give the system enough time to finish hydrating the snapshot),
>> > but
>> > >> the invalid entry which looked something like a Task with all null or 0
>> > >> values, caused our leaders to fail (which necessitated restoring from an
>> > >> earlier snapshot) and note that this was only after we triggered the
>> > manual
>> > >> snapshot and BEFORE we tried to restore.
>> > >>
>> > >> Will report more details as they become available and will provide some
>> > doc
>> > >> updates based on our experience.
>> > >>
>> > >> -Renan
>> > >
>> > >
>> > >
>> >
>> >
>> 
>>    
>

Re: Recovery instructions updates

Posted by Bill Farner <wf...@apache.org>.

>
> How does the site get updated? Is it auto-generated when we build releases?


The source lives in the project SVN repo:

$ svn co https://svn.apache.org/repos/asf/aurora/ aurora-svn

Here <https://svn.apache.org/repos/asf/aurora/site/README.md> are
instructions for updating it.  It's a pretty mechanical process, but not
automated.

So, do we plan to add the patch for next release?


Meghdoot - i suspect David would appreciate an incoming patch for the docs,
assuming that's what you're referring to.

He mentioned this step
<https://github.com/apache/aurora/blob/34be631589ebf899e663b698dc76511eb1b9ad8a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L523-L530>
in the end-to-end tests, which is (hopefully) straightforward enough to try
without assistance.



On Tue, Jun 5, 2018 at 12:47 AM, meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> Thx David. So, do we plan to add the patch for next release? We will be
> happy to validate it as part of rc validation.
>
>
>
>       From: David McLaughlin <dm...@apache.org>
>  To: dev@aurora.apache.org
>  Sent: Monday, June 4, 2018 9:45 AM
>  Subject: Re: Recovery instructions updates
>
> We should definitely update that doc, Bill's patch makes this much easier
> (as can be seen by the e2e test) and we've been using it in our scale test
> environment. How does the site get updated? Is it auto-generated when we
> build releases?
>
> Having corrupted logs that frequently is concerning too, we haven't seen
> anything like this and we do explicit snapshots/backups as part of every
> Scheduler deploy. If there's a bug lurking, would be good to get in front
> of it.
>
> On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
> meghdoot_b@yahoo.com.invalid> wrote:
>
> > We will try to recover the log files on the snapshot loading error.
> >
> > + 1 to Bill’s approach on making offline recovery. We will try the patch
> > on our side.
> >
> > Renan, I would ask you to prepare a PR for the restoration docs proposing
> > the 2 additional steps required in current world as we look to maybe
> using
> > a different mechanism. The prep steps to get scheduler ready for backup
> can
> > be eliminated hopefully with the alternative approach.
> >
> > On side lets see if we can recover the logs of the corrupted snapshot
> > loading.
> >
> >
> > Thx
> >
> > > On Jun 3, 2018, at 9:50 AM, Stephan Erb <st...@blue-yonder.com>
> > wrote:
> > >
> > > That sounds indeed concerning. Would be great if you could file an
> issue
> > and attach the related log files and tracebacks.
> > >
> > > Bill recently added a potential replacement for the existing restore
> > mechanism: https://github.com/apache/aurora/commit/
> > 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
> > have bumped into with the current restore, this new approach might be
> worth
> > exploring further.
> > >
> > > On 03.06.18, 08:43, "Meghdoot bhattacharya"
> > <me...@yahoo.com.INVALID> wrote:
> > >
> > >    Thx Renan for sharing the details. This backup restore happened
> under
> > not so easy circumstances, so would encourage the leads to keep docs
> > updated as much as possible and include in release validation.
> > >
> > >    The other issue of snapshots having task and other objects as nil
> > that causes to fail the schedulers, we have now seen 2 times in past
> year.
> > Other than finding root cause why that entry happens during snapshot
> > creation, there needs to be defensive code either to ignore that entry on
> > loading or a way to fix the snapshot. Because we might have to go
> through a
> > days worth of snapshots to find which one did not had that entry and
> > recover from there. Mean time to recover gets impacted under the
> > circumstances. One extra info not sure is relevant or not is the
> corrupted
> > snapshot got created by the admin cli (assumption should not matter
> whether
> > scheduler triggers or forced via cli) that showed success as well as the
> > aurora logs but then loading it exposed the issue.
> > >
> > >    Thx
> > >
> > >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com>
> > wrote:
> > >>
> > >> Hi all,
> > >>
> > >> We tried following the recovery instructions from
> > >> http://aurora.apache.org/documentation/latest/
> > operations/backup-restore/
> > >>
> > >> After our change from the Twitter commons ZK library to Apache
> Curator,
> > >> these instructions are no longer valid.
> > >>
> > >> In order for Aurora to carry out a leader election in the current
> state,
> > >> Aurora has to first connect to a Mesos master. What we ended up doing
> > was
> > >> connecting to Mesos master that was had nothing on it to bypass this
> new
> > >> requirement.
> > >>
> > >> Next, wiping away -native_log_file_path did not seem to be enough to
> > >> recover from a corrupted mesos replicated log. We had to manually wipe
> > away
> > >> entries in ZK and move the snapshot backup directory in order for the
> > >> leader to not fall back on either a snapshot or the mesos-log to
> > rehydrate
> > >> the leader.
> > >>
> > >> Finally, somehow triggering a manual snapshot generated a snapshot
> with
> > an
> > >> invalid entry which then caused the scheduler to fail after a failover
> > >> while trying to catch up on current state.
> > >>
> > >> We are trying to investigate why this took place (it could have been
> we
> > >> didn't give the system enough time to finish hydrating the snapshot),
> > but
> > >> the invalid entry which looked something like a Task with all null or
> 0
> > >> values, caused our leaders to fail (which necessitated restoring from
> an
> > >> earlier snapshot) and note that this was only after we triggered the
> > manual
> > >> snapshot and BEFORE we tried to restore.
> > >>
> > >> Will report more details as they become available and will provide
> some
> > doc
> > >> updates based on our experience.
> > >>
> > >> -Renan
> > >
> > >
> > >
> >
> >
>
>
>

Re: Recovery instructions updates

Posted by meghdoot bhattacharya <me...@yahoo.com.INVALID>.

Thx David. So, do we plan to add the patch for next release? We will be happy to validate it as part of rc validation.



      From: David McLaughlin <dm...@apache.org>
 To: dev@aurora.apache.org 
 Sent: Monday, June 4, 2018 9:45 AM
 Subject: Re: Recovery instructions updates
   
We should definitely update that doc, Bill's patch makes this much easier
(as can be seen by the e2e test) and we've been using it in our scale test
environment. How does the site get updated? Is it auto-generated when we
build releases?

Having corrupted logs that frequently is concerning too, we haven't seen
anything like this and we do explicit snapshots/backups as part of every
Scheduler deploy. If there's a bug lurking, would be good to get in front
of it.

On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> We will try to recover the log files on the snapshot loading error.
>
> + 1 to Bill’s approach on making offline recovery. We will try the patch
> on our side.
>
> Renan, I would ask you to prepare a PR for the restoration docs proposing
> the 2 additional steps required in current world as we look to maybe using
> a different mechanism. The prep steps to get scheduler ready for backup can
> be eliminated hopefully with the alternative approach.
>
> On side lets see if we can recover the logs of the corrupted snapshot
> loading.
>
>
> Thx
>
> > On Jun 3, 2018, at 9:50 AM, Stephan Erb <st...@blue-yonder.com>
> wrote:
> >
> > That sounds indeed concerning. Would be great if you could file an issue
> and attach the related log files and tracebacks.
> >
> > Bill recently added a potential replacement for the existing restore
> mechanism: https://github.com/apache/aurora/commit/
> 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
> have bumped into with the current restore, this new approach might be worth
> exploring further.
> >
> > On 03.06.18, 08:43, "Meghdoot bhattacharya"
> <me...@yahoo.com.INVALID> wrote:
> >
> >    Thx Renan for sharing the details. This backup restore happened under
> not so easy circumstances, so would encourage the leads to keep docs
> updated as much as possible and include in release validation.
> >
> >    The other issue of snapshots having task and other objects as nil
> that causes to fail the schedulers, we have now seen 2 times in past year.
> Other than finding root cause why that entry happens during snapshot
> creation, there needs to be defensive code either to ignore that entry on
> loading or a way to fix the snapshot. Because we might have to go through a
> days worth of snapshots to find which one did not had that entry and
> recover from there. Mean time to recover gets impacted under the
> circumstances. One extra info not sure is relevant or not is the corrupted
> snapshot got created by the admin cli (assumption should not matter whether
> scheduler triggers or forced via cli) that showed success as well as the
> aurora logs but then loading it exposed the issue.
> >
> >    Thx
> >
> >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com>
> wrote:
> >>
> >> Hi all,
> >>
> >> We tried following the recovery instructions from
> >> http://aurora.apache.org/documentation/latest/
> operations/backup-restore/
> >>
> >> After our change from the Twitter commons ZK library to Apache Curator,
> >> these instructions are no longer valid.
> >>
> >> In order for Aurora to carry out a leader election in the current state,
> >> Aurora has to first connect to a Mesos master. What we ended up doing
> was
> >> connecting to Mesos master that was had nothing on it to bypass this new
> >> requirement.
> >>
> >> Next, wiping away -native_log_file_path did not seem to be enough to
> >> recover from a corrupted mesos replicated log. We had to manually wipe
> away
> >> entries in ZK and move the snapshot backup directory in order for the
> >> leader to not fall back on either a snapshot or the mesos-log to
> rehydrate
> >> the leader.
> >>
> >> Finally, somehow triggering a manual snapshot generated a snapshot with
> an
> >> invalid entry which then caused the scheduler to fail after a failover
> >> while trying to catch up on current state.
> >>
> >> We are trying to investigate why this took place (it could have been we
> >> didn't give the system enough time to finish hydrating the snapshot),
> but
> >> the invalid entry which looked something like a Task with all null or 0
> >> values, caused our leaders to fail (which necessitated restoring from an
> >> earlier snapshot) and note that this was only after we triggered the
> manual
> >> snapshot and BEFORE we tried to restore.
> >>
> >> Will report more details as they become available and will provide some
> doc
> >> updates based on our experience.
> >>
> >> -Renan
> >
> >
> >
>
>

Re: Recovery instructions updates

Posted by David McLaughlin <dm...@apache.org>.

We should definitely update that doc, Bill's patch makes this much easier
(as can be seen by the e2e test) and we've been using it in our scale test
environment. How does the site get updated? Is it auto-generated when we
build releases?

Having corrupted logs that frequently is concerning too, we haven't seen
anything like this and we do explicit snapshots/backups as part of every
Scheduler deploy. If there's a bug lurking, would be good to get in front
of it.

On Sun, Jun 3, 2018 at 10:40 AM, Meghdoot bhattacharya <
meghdoot_b@yahoo.com.invalid> wrote:

> We will try to recover the log files on the snapshot loading error.
>
> + 1 to Bill’s approach on making offline recovery. We will try the patch
> on our side.
>
> Renan, I would ask you to prepare a PR for the restoration docs proposing
> the 2 additional steps required in current world as we look to maybe using
> a different mechanism. The prep steps to get scheduler ready for backup can
> be eliminated hopefully with the alternative approach.
>
> On side lets see if we can recover the logs of the corrupted snapshot
> loading.
>
>
> Thx
>
> > On Jun 3, 2018, at 9:50 AM, Stephan Erb <st...@blue-yonder.com>
> wrote:
> >
> > That sounds indeed concerning. Would be great if you could file an issue
> and attach the related log files and tracebacks.
> >
> > Bill recently added a potential replacement for the existing restore
> mechanism: https://github.com/apache/aurora/commit/
> 2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you
> have bumped into with the current restore, this new approach might be worth
> exploring further.
> >
> > On 03.06.18, 08:43, "Meghdoot bhattacharya"
> <me...@yahoo.com.INVALID> wrote:
> >
> >    Thx Renan for sharing the details. This backup restore happened under
> not so easy circumstances, so would encourage the leads to keep docs
> updated as much as possible and include in release validation.
> >
> >    The other issue of snapshots having task and other objects as nil
> that causes to fail the schedulers, we have now seen 2 times in past year.
> Other than finding root cause why that entry happens during snapshot
> creation, there needs to be defensive code either to ignore that entry on
> loading or a way to fix the snapshot. Because we might have to go through a
> days worth of snapshots to find which one did not had that entry and
> recover from there. Mean time to recover gets impacted under the
> circumstances. One extra info not sure is relevant or not is the corrupted
> snapshot got created by the admin cli (assumption should not matter whether
> scheduler triggers or forced via cli) that showed success as well as the
> aurora logs but then loading it exposed the issue.
> >
> >    Thx
> >
> >> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com>
> wrote:
> >>
> >> Hi all,
> >>
> >> We tried following the recovery instructions from
> >> http://aurora.apache.org/documentation/latest/
> operations/backup-restore/
> >>
> >> After our change from the Twitter commons ZK library to Apache Curator,
> >> these instructions are no longer valid.
> >>
> >> In order for Aurora to carry out a leader election in the current state,
> >> Aurora has to first connect to a Mesos master. What we ended up doing
> was
> >> connecting to Mesos master that was had nothing on it to bypass this new
> >> requirement.
> >>
> >> Next, wiping away -native_log_file_path did not seem to be enough to
> >> recover from a corrupted mesos replicated log. We had to manually wipe
> away
> >> entries in ZK and move the snapshot backup directory in order for the
> >> leader to not fall back on either a snapshot or the mesos-log to
> rehydrate
> >> the leader.
> >>
> >> Finally, somehow triggering a manual snapshot generated a snapshot with
> an
> >> invalid entry which then caused the scheduler to fail after a failover
> >> while trying to catch up on current state.
> >>
> >> We are trying to investigate why this took place (it could have been we
> >> didn't give the system enough time to finish hydrating the snapshot),
> but
> >> the invalid entry which looked something like a Task with all null or 0
> >> values, caused our leaders to fail (which necessitated restoring from an
> >> earlier snapshot) and note that this was only after we triggered the
> manual
> >> snapshot and BEFORE we tried to restore.
> >>
> >> Will report more details as they become available and will provide some
> doc
> >> updates based on our experience.
> >>
> >> -Renan
> >
> >
> >
>
>

Re: Recovery instructions updates

Posted by Meghdoot bhattacharya <me...@yahoo.com.INVALID>.

We will try to recover the log files on the snapshot loading error.

+ 1 to Bill’s approach on making offline recovery. We will try the patch on our side.

Renan, I would ask you to prepare a PR for the restoration docs proposing the 2 additional steps required in current world as we look to maybe using a different mechanism. The prep steps to get scheduler ready for backup can be eliminated hopefully with the alternative approach.

On side lets see if we can recover the logs of the corrupted snapshot loading.


Thx

> On Jun 3, 2018, at 9:50 AM, Stephan Erb <st...@blue-yonder.com> wrote:
> 
> That sounds indeed concerning. Would be great if you could file an issue and attach the related log files and tracebacks.
> 
> Bill recently added a potential replacement for the existing restore mechanism: https://github.com/apache/aurora/commit/2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you have bumped into with the current restore, this new approach might be worth exploring further.
> 
> On 03.06.18, 08:43, "Meghdoot bhattacharya" <me...@yahoo.com.INVALID> wrote:
> 
>    Thx Renan for sharing the details. This backup restore happened under not so easy circumstances, so would encourage the leads to keep docs updated as much as possible and include in release validation.
> 
>    The other issue of snapshots having task and other objects as nil that causes to fail the schedulers, we have now seen 2 times in past year. Other than finding root cause why that entry happens during snapshot creation, there needs to be defensive code either to ignore that entry on loading or a way to fix the snapshot. Because we might have to go through a days worth of snapshots to find which one did not had that entry and recover from there. Mean time to recover gets impacted under the circumstances. One extra info not sure is relevant or not is the corrupted snapshot got created by the admin cli (assumption should not matter whether scheduler triggers or forced via cli) that showed success as well as the aurora logs but then loading it exposed the issue.
> 
>    Thx
> 
>> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> We tried following the recovery instructions from
>> http://aurora.apache.org/documentation/latest/operations/backup-restore/
>> 
>> After our change from the Twitter commons ZK library to Apache Curator,
>> these instructions are no longer valid.
>> 
>> In order for Aurora to carry out a leader election in the current state,
>> Aurora has to first connect to a Mesos master. What we ended up doing was
>> connecting to Mesos master that was had nothing on it to bypass this new
>> requirement.
>> 
>> Next, wiping away -native_log_file_path did not seem to be enough to
>> recover from a corrupted mesos replicated log. We had to manually wipe away
>> entries in ZK and move the snapshot backup directory in order for the
>> leader to not fall back on either a snapshot or the mesos-log to rehydrate
>> the leader.
>> 
>> Finally, somehow triggering a manual snapshot generated a snapshot with an
>> invalid entry which then caused the scheduler to fail after a failover
>> while trying to catch up on current state.
>> 
>> We are trying to investigate why this took place (it could have been we
>> didn't give the system enough time to finish hydrating the snapshot), but
>> the invalid entry which looked something like a Task with all null or 0
>> values, caused our leaders to fail (which necessitated restoring from an
>> earlier snapshot) and note that this was only after we triggered the manual
>> snapshot and BEFORE we tried to restore.
>> 
>> Will report more details as they become available and will provide some doc
>> updates based on our experience.
>> 
>> -Renan
> 
> 
>

Re: Recovery instructions updates

Posted by Stephan Erb <st...@blue-yonder.com>.

That sounds indeed concerning. Would be great if you could file an issue and attach the related log files and tracebacks.

Bill recently added a potential replacement for the existing restore mechanism: https://github.com/apache/aurora/commit/2e1ca42887bc8ea1e8c6cddebe9d1cf29268c714. Given the set of issues you have bumped into with the current restore, this new approach might be worth exploring further.

On 03.06.18, 08:43, "Meghdoot bhattacharya" <me...@yahoo.com.INVALID> wrote:

    Thx Renan for sharing the details. This backup restore happened under not so easy circumstances, so would encourage the leads to keep docs updated as much as possible and include in release validation.
    
    The other issue of snapshots having task and other objects as nil that causes to fail the schedulers, we have now seen 2 times in past year. Other than finding root cause why that entry happens during snapshot creation, there needs to be defensive code either to ignore that entry on loading or a way to fix the snapshot. Because we might have to go through a days worth of snapshots to find which one did not had that entry and recover from there. Mean time to recover gets impacted under the circumstances. One extra info not sure is relevant or not is the corrupted snapshot got created by the admin cli (assumption should not matter whether scheduler triggers or forced via cli) that showed success as well as the aurora logs but then loading it exposed the issue.
    
    Thx
    
    > On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com> wrote:
    > 
    > Hi all,
    > 
    > We tried following the recovery instructions from
    > http://aurora.apache.org/documentation/latest/operations/backup-restore/
    > 
    > After our change from the Twitter commons ZK library to Apache Curator,
    > these instructions are no longer valid.
    > 
    > In order for Aurora to carry out a leader election in the current state,
    > Aurora has to first connect to a Mesos master. What we ended up doing was
    > connecting to Mesos master that was had nothing on it to bypass this new
    > requirement.
    > 
    > Next, wiping away -native_log_file_path did not seem to be enough to
    > recover from a corrupted mesos replicated log. We had to manually wipe away
    > entries in ZK and move the snapshot backup directory in order for the
    > leader to not fall back on either a snapshot or the mesos-log to rehydrate
    > the leader.
    > 
    > Finally, somehow triggering a manual snapshot generated a snapshot with an
    > invalid entry which then caused the scheduler to fail after a failover
    > while trying to catch up on current state.
    > 
    > We are trying to investigate why this took place (it could have been we
    > didn't give the system enough time to finish hydrating the snapshot), but
    > the invalid entry which looked something like a Task with all null or 0
    > values, caused our leaders to fail (which necessitated restoring from an
    > earlier snapshot) and note that this was only after we triggered the manual
    > snapshot and BEFORE we tried to restore.
    > 
    > Will report more details as they become available and will provide some doc
    > updates based on our experience.
    > 
    > -Renan

Re: Recovery instructions updates

Posted by Meghdoot bhattacharya <me...@yahoo.com.INVALID>.

Thx Renan for sharing the details. This backup restore happened under not so easy circumstances, so would encourage the leads to keep docs updated as much as possible and include in release validation.

The other issue of snapshots having task and other objects as nil that causes to fail the schedulers, we have now seen 2 times in past year. Other than finding root cause why that entry happens during snapshot creation, there needs to be defensive code either to ignore that entry on loading or a way to fix the snapshot. Because we might have to go through a days worth of snapshots to find which one did not had that entry and recover from there. Mean time to recover gets impacted under the circumstances. One extra info not sure is relevant or not is the corrupted snapshot got created by the admin cli (assumption should not matter whether scheduler triggers or forced via cli) that showed success as well as the aurora logs but then loading it exposed the issue.

Thx

> On Jun 2, 2018, at 3:54 PM, Renan DelValle <re...@gmail.com> wrote:
> 
> Hi all,
> 
> We tried following the recovery instructions from
> http://aurora.apache.org/documentation/latest/operations/backup-restore/
> 
> After our change from the Twitter commons ZK library to Apache Curator,
> these instructions are no longer valid.
> 
> In order for Aurora to carry out a leader election in the current state,
> Aurora has to first connect to a Mesos master. What we ended up doing was
> connecting to Mesos master that was had nothing on it to bypass this new
> requirement.
> 
> Next, wiping away -native_log_file_path did not seem to be enough to
> recover from a corrupted mesos replicated log. We had to manually wipe away
> entries in ZK and move the snapshot backup directory in order for the
> leader to not fall back on either a snapshot or the mesos-log to rehydrate
> the leader.
> 
> Finally, somehow triggering a manual snapshot generated a snapshot with an
> invalid entry which then caused the scheduler to fail after a failover
> while trying to catch up on current state.
> 
> We are trying to investigate why this took place (it could have been we
> didn't give the system enough time to finish hydrating the snapshot), but
> the invalid entry which looked something like a Task with all null or 0
> values, caused our leaders to fail (which necessitated restoring from an
> earlier snapshot) and note that this was only after we triggered the manual
> snapshot and BEFORE we tried to restore.
> 
> Will report more details as they become available and will provide some doc
> updates based on our experience.
> 
> -Renan