You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Adam Lerman <al...@gmail.com> on 2022/05/09 10:37:19 UTC

Rolling Update for patches

Accumulo Devs --

Wanted to put a feeler out if there was interest in adding a method for
rolling updates to accumulo, especially for patch updates. I would love to
see this adopted in the future so that patch updates could be applied with
no downtime for the cluster.

My general thought would be:

1) put the system in upgrade mode (shell command)
           - suspend migrations, increase tserver.suspend.duration
2) Manually update and roll tservers -- slowly so as not to cause too much
churn
3) Bounce the common services (manager, gc, etc)
4) Verify all looks good
5) take system out of upgrade mode

Does anyone have any thoughts about adding something like this?

Thanks!

Adam

Re: Rolling Update for patches

Posted by Mike Miller <mm...@apache.org>.
There is already a SAFE_MODE for the Manager. Were you thinking of adding
another state for the Manager?

On Mon, May 9, 2022 at 6:38 AM Adam Lerman <al...@gmail.com> wrote:

> Accumulo Devs --
>
> Wanted to put a feeler out if there was interest in adding a method for
> rolling updates to accumulo, especially for patch updates. I would love to
> see this adopted in the future so that patch updates could be applied with
> no downtime for the cluster.
>
> My general thought would be:
>
> 1) put the system in upgrade mode (shell command)
>            - suspend migrations, increase tserver.suspend.duration
> 2) Manually update and roll tservers -- slowly so as not to cause too much
> churn
> 3) Bounce the common services (manager, gc, etc)
> 4) Verify all looks good
> 5) take system out of upgrade mode
>
> Does anyone have any thoughts about adding something like this?
>
> Thanks!
>
> Adam
>

RE: Rolling Update for patches

Posted by dev1 <de...@etcoleman.com>.
Why are you specifically calling out pausing migrations?

And other than suspending migrations - what you have outlined is essentially the procedure(s) that can be done today.  I'm not sure how much impact there is with pausing migrations, as long as the suspend duration is increased.

Overall, I think we should be striving to  have services that are designed to be crash-only. That way a rolling restart is not a special case - just a series of planned crashes.  The suspend duration may be a special provision to limit churn, but it does slow recovery - so there is a trade-off there. Another trade-off would be flushing tablets of a tserver before the restart - this goes against the crash-only philosophy, but flushing can minimize the recovery necessary.  Working towards minimizing recovery would have benefits for the system in general, not just to support rolling restarts.

Potential issues that may need to be addressed:

  - What do you do about ingest? You'd need to account for both bulk and continuous ingest.  Stopping ingest for the entirety of the procedure might not be desired, but allowing it to continue would likely have the similar impacts as allowing migrations to continue.  With systems that perform a lot of continuous ingest, they would also likely benefit from flushing if ingest was not paused.
- What about compactions? 
 - The restarting of the tservers likely needs to be handled outside of Accumulo - there are too many ways that services are managed to account for variations - we could provide examples, but ultimately cluster users would need to tailor systemd or whatever they happen to use to their needs.
- The time for the duration of the restart is very user dependent. Some could decide that a very slow walk, would be "best" to minimize possible impacts to user scans, while others could opt to just rip off the band-aid - where user scans would be more likely to be impacted - but would occur over a smaller, defined window.  Some may decide that it should be completed within an hour, others might decide that completion within a single shift was acceptable, and others, well let's really stretch this out.
- Do you want to make special provisions for tservers that are hosting the root and metadata tablet(s)? If you identify those servers, you can elect to do them first so that they are out of the way - or do them last, or maybe it does not matter?  These tablets are the ones most likely to benefit from flushing before the restart to minimize recovery to the minimum extent practical. Depending on settings, flushing the metadata table may really help - a very active system and long periods between the gc runs and the gc flush / compaction settings.  The metadata should recover without any special provisions, but there are opportunities to speed up the process.

Ed Coleman

-----Original Message-----
From: Adam Lerman <al...@gmail.com> 
Sent: Monday, May 9, 2022 6:37 AM
To: dev@accumulo.apache.org
Subject: Rolling Update for patches

Accumulo Devs --

Wanted to put a feeler out if there was interest in adding a method for rolling updates to accumulo, especially for patch updates. I would love to see this adopted in the future so that patch updates could be applied with no downtime for the cluster.

My general thought would be:

1) put the system in upgrade mode (shell command)
           - suspend migrations, increase tserver.suspend.duration
2) Manually update and roll tservers -- slowly so as not to cause too much churn
3) Bounce the common services (manager, gc, etc)
4) Verify all looks good
5) take system out of upgrade mode

Does anyone have any thoughts about adding something like this?

Thanks!

Adam