You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Ligade, Shailesh [USA]" <Li...@bah.com> on 2021/12/02 14:43:31 UTC

RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver system (it is pretty quick in the order of second), I still get unassigned tablets on monitor page. My understand is that with that setting of 5m (or 200s etc), master will wait for that mush time before start moving unassigned tablets. In my situation, unassigned tablet counts goes back to zero after long time, and hence rolling restarts take lot longer (hours in most cases – depends on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my prod version I have not tested it completely.

Thanks

-S
From: Mike Miller <mm...@apache.org>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0" do you mean that the feature is not helping to solve your problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0, can I just stop both the masters and then restart tserver one at a time (or all at once)? Will that speed up the restart without getting into this offline tablet situation and or data loss type situation? I can stop the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration as you have in this email) – but the shell should have thrown an error saying the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to come up)

I’m assuming the requirement is to keep the cluster up and serving users without major disruption – not to rip through the restart as fast as possible.  With 6 – 8 nodes you should still be able to do this in under an hour.  If you had a much larger cluster then the concept is the same but you would want to use some number of tservers that is a fraction of the total available that would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option – this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what is going on – there should be messages showing the tserver leaving and then rejoining and any other activity related to recovery.  With a rolling restart the idea is to keep the cluster up and serving tables – only one (or a few) tservers go offline and for a short duration (general less than a minute) and between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to come up), i still get all tablets unassigned 🙁 may be, I need to bring masters down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S
________________________________
From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd with restart=always, it interferes with any of those stop (stop-all, stop-here etc) commands/scripts. So either we have to modify systemd settings or may be just shutdown vm type of operation (i think that is little brutal)

-S
________________________________
From: Michael Wall <mj...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect immediately?



Thanks



-S



From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$> – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the reassignment while a tserver is restarting – there is a trade-off on the data not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since we ar eusing system, I can issue restart command on a tserver. This causes all sorts of tablet movements even though accumulo is down for may be a second. If I wait for all unassigned tables to become 0, then to restart next tserver, then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a delay setting we can change so that it will not move tablets around? What may be a safe delay value?



-S

Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Posted by Mike Miller <mm...@apache.org>.

"we will always see unassigned tablets on the monitor page, when tserver
goes down." Not necessarily. From your previous email, it sounded as if you
were expecting unassigned tablets to become suspended. I was trying to make
it clear that once a tablet becomes unassigned, it can't be suspended. Only
tablets that are currently hosted on a tserver can be suspended.

On Thu, Dec 2, 2021 at 10:49 AM Ligade, Shailesh [USA] <
Ligade_Shailesh@bah.com> wrote:

> Thanks Mike,
>
>
>
> So let me see if I understood this,
>
>
>
> Doesn’t matter what this suspend.duration setting is, we will always see
> unassigned tablets on the monitor page, when tserver goes down.
>
>
>
> If the setting is high enough then master is basically assigning the same
> old tablets to that tserver, when it is back online, and thus will not move
> any tablets around.
>
>
>
> If the duration is default (0s) or short then, master will start
> reassigning tablets to other tserver. And when the original tablet comes
> back up, master will try to rebalance tablets (may not get the old tablets
> back).
>
>
>
> And thus having that setting high enough will make things faster to
> recover.
>
>
>
> Thanks
>
>
>
> -S
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* Thursday, December 2, 2021 10:39 AM
> *To:* user@accumulo.apache.org
> *Subject:* Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver
> rolling restart
>
>
>
> Some things to keep in mind... The Master will wait the
> table.suspend.duration before reassigning the SUSPENDED tablets to new
> tservers. With the table.suspend.duration set > 0, a tablet will go from
> HOSTED to SUSPENDED if it's tserver is shutdown. It will then stay
> SUSPENDED until it's old tserver is available or table.suspend.duration has
> passed. If table.suspend.duration has passed before it's tserver has
> returned, it will then be UNASSIGNED. Once a tablet is UNASSIGNED it won't
> enter the SUSPENDED state.
>
>
>
> On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] <
> Ligade_Shailesh@bah.com> wrote:
>
> Thanks Mike,
>
>
>
> If I set the value to 0s (default) or if set it to 5m, when I restart
> tserver system (it is pretty quick in the order of second), I still get
> unassigned tablets on monitor page. My understand is that with that setting
> of 5m (or 200s etc), master will wait for that mush time before start
> moving unassigned tablets. In my situation, unassigned tablet counts goes
> back to zero after long time, and hence rolling restarts take lot longer
> (hours in most cases – depends on how many tablets/tserver)
>
>
>
> This setting appears to be working on accumulo 2.0.1, but since that is
> not my prod version I have not tested it completely.
>
>
>
> Thanks
>
>
>
> -S
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* Thursday, December 2, 2021 9:38 AM
> *To:* user@accumulo.apache.org
> *Subject:* [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling
> restart
>
>
>
> When you say "since that setting (table.suspend.duration) is not working
> for me in accumulo 1.10.0" do you mean that the feature is not helping to
> solve your problem? Or that the feature is not working and there could be a
> bug?
>
>
>
> On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <
> Ligade_Shailesh@bah.com> wrote:
>
> Thanks for detail steps! Really appreciated.
>
>
>
> Just curious, since that setting (table.suspend.duration) is not working
> for me in accumulo 1.10.0, can I just stop both the masters and then
> restart tserver one at a time (or all at once)? Will that speed up the
> restart without getting into this offline tablet situation and or data loss
> type situation? I can stop the ingest, flush the tables and then bring down
> the master…
>
>
>
> We can take short downtime and my understanding is that the master is the
> one keeping track of tservers and offline tablets situation. So just
> curious…
>
>
>
> Thanks again
>
>
>
> -S
>
>
>
> *From:* dev1 <de...@etcoleman.com>
> *Sent:* Monday, November 29, 2021 2:56 PM
> *To:* 'user@accumulo.apache.org' <us...@accumulo.apache.org>
> *Subject:* [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling
> restart
>
>
>
> I believe the property is table.suspend.duration (not tablet.suspended.duration
> as you have in this email) – but the shell should have thrown an error
> saying the property cannot be set in zookeeper if you had it wrong.
>
>
>
> What do you mean by:
>
>
>
> *but when i issued restart tserver (one at a time without waiting for
> first to come up)*
>
>
>
> I’m assuming the requirement is to keep the cluster up and serving users
> without major disruption – not to rip through the restart as fast as
> possible.  With 6 – 8 nodes you should still be able to do this in under an
> hour.  If you had a much larger cluster then the concept is the same but
> you would want to use some number of tservers that is a fraction of the
> total available that would be cycled at any given point in time.
>
>
>
> In general the way that I would do a conservative, rolling restart:
>
>
>
>    1. [optional] pause ingest – or be prepared for recovering any failed
>    ingests if they occur.
>    2. [optional] Flush tables that have continuous ingest using the wait
>    option – this should help minimize recovery.
>    3. Set the table.suspend.duration
>    4. For each tserver – one (or a small group for large cluster) at a
>    time
>
>
>    1. Stop the tserver
>       2. Pause long enough that ZooKeeper recognizes the lost connection
>       3. Restart the tserver
>       4. Pause to allow for any recovery
>
>
>    1. Reset the table.suspend.duration back to 0s (the default)
>
>
>
> If you tail the master / manager debug log you should get a good idea of
> what is going on – there should be messages showing the tserver leaving and
> then rejoining and any other activity related to recovery.  With a rolling
> restart the idea is to keep the cluster up and serving tables – only one
> (or a few) tservers go offline and for a short duration (general less than
> a minute) and between each tserver restart, time is allowed for things to
> stabilize.
>
>
>
>
>
> *From:* Shailesh Ligade <SL...@FBI.GOV>
> *Sent:* Monday, November 29, 2021 11:17 AM
> *To:* user@accumulo.apache.org
> *Subject:* Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart
>
>
>
>
>
> Uhmm updated the setting tablet.suspended.duration to 5m
>
>
>
> config -s tablet.suspended.duration=5m
>
>
>
> but when i issued restart tserver (one at a time without waiting for first
> to come up), i still get all tablets unassigned 🙁 may be, I need to
> bring masters down first?
>
>
>
> btw this is for accumulo 1.10.0
>
>
>
> am I missing anything?
>
>
>
> -S
> ------------------------------
>
> *From:* Shailesh Ligade <SL...@FBI.GOV>
> *Sent:* Monday, November 29, 2021 10:35 AM
> *To:* user@accumulo.apache.org <us...@accumulo.apache.org>
> *Subject:* Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart
>
>
>
> Thanks Michael,
>
>
>
> stop cluster using admin stop? The issue is that, since we are using
> systemd with restart=always, it interferes with any of those stop
> (stop-all, stop-here etc) commands/scripts. So either we have to modify
> systemd settings or may be just shutdown vm type of operation (i think that
> is little brutal)
>
>
>
> -S
> ------------------------------
>
> *From:* Michael Wall <mj...@gmail.com>
> *Sent:* Monday, November 29, 2021 9:54 AM
> *To:* user@accumulo.apache.org <us...@accumulo.apache.org>
> *Subject:* [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart
>
>
>
> Is there a reason to not just stop the cluster, reset the heap and restart
> the cluster?  That is simpler.
>
>
>
> On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com> wrote:
>
> Yes – and don’t forget to reset it back when you are done.
>
>
>
> *From:* Ligade, Shailesh [USA] <Li...@bah.com>
> *Sent:* Monday, November 29, 2021 9:36 AM
> *To:* user@accumulo.apache.org
> *Subject:* RE: accumulo tserver rolling restart
>
>
>
> Thanks,
>
>
>
> I am assuming I can set that property using shell and it will take effect
> immediately?
>
>
>
> Thanks
>
>
>
> -S
>
>
>
> *From:* dev1 <de...@etcoleman.com>
> *Sent:* Monday, November 29, 2021 9:25 AM
> *To:* 'user@accumulo.apache.org' <us...@accumulo.apache.org>
> *Subject:* [External] RE: accumulo tserver rolling restart
>
>
>
> See
> https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node
> <https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$>
> – A note on rolling restarts.
>
>
>
> There is property that can be set (table.suspend.duration) that will delay
> the reassignment while a tserver is restarting – there is a trade-off on
> the data not being available so try to minimize the time the tserver is
> off-line.
>
>
>
> *From:* Ligade, Shailesh [USA] <Li...@bah.com>
> *Sent:* Monday, November 29, 2021 9:19 AM
> *To:* user@accumulo.apache.org
> *Subject:* accumulo tserver rolling restart
>
>
>
> Hello,
>
>
>
> I want to restart al the tservers, say I updated the tserver heap size.
> Since we ar eusing system, I can issue restart command on a tserver. This
> causes all sorts of tablet movements even though accumulo is down for may
> be a second. If I wait for all unassigned tables to become 0, then to
> restart next tserver, then to completely restart a small cluster (6-8
> nodes) take hours (roughly 4k+ tablets per tserver)
>
>
>
> What may be right way to perform such routine maintenance operation? Is
> there a delay setting we can change so that it will not move tablets
> around? What may be a safe delay value?
>
>
>
> -S
>
>

RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Mike,

So let me see if I understood this,

Doesn’t matter what this suspend.duration setting is, we will always see unassigned tablets on the monitor page, when tserver goes down.

If the setting is high enough then master is basically assigning the same old tablets to that tserver, when it is back online, and thus will not move any tablets around.

If the duration is default (0s) or short then, master will start reassigning tablets to other tserver. And when the original tablet comes back up, master will try to rebalance tablets (may not get the old tablets back).

And thus having that setting high enough will make things faster to recover.

Thanks

-S

From: Mike Miller <mm...@apache.org>
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Some things to keep in mind... The Master will wait the table.suspend.duration before reassigning the SUSPENDED tablets to new tservers. With the table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is available or table.suspend.duration has passed. If table.suspend.duration has passed before it's tserver has returned, it will then be UNASSIGNED. Once a tablet is UNASSIGNED it won't enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver system (it is pretty quick in the order of second), I still get unassigned tablets on monitor page. My understand is that with that setting of 5m (or 200s etc), master will wait for that mush time before start moving unassigned tablets. In my situation, unassigned tablet counts goes back to zero after long time, and hence rolling restarts take lot longer (hours in most cases – depends on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my prod version I have not tested it completely.

Thanks

-S
From: Mike Miller <mm...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0" do you mean that the feature is not helping to solve your problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0, can I just stop both the masters and then restart tserver one at a time (or all at once)? Will that speed up the restart without getting into this offline tablet situation and or data loss type situation? I can stop the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration as you have in this email) – but the shell should have thrown an error saying the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to come up)

I’m assuming the requirement is to keep the cluster up and serving users without major disruption – not to rip through the restart as fast as possible.  With 6 – 8 nodes you should still be able to do this in under an hour.  If you had a much larger cluster then the concept is the same but you would want to use some number of tservers that is a fraction of the total available that would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option – this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what is going on – there should be messages showing the tserver leaving and then rejoining and any other activity related to recovery.  With a rolling restart the idea is to keep the cluster up and serving tables – only one (or a few) tservers go offline and for a short duration (general less than a minute) and between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to come up), i still get all tablets unassigned 🙁 may be, I need to bring masters down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S
________________________________
From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd with restart=always, it interferes with any of those stop (stop-all, stop-here etc) commands/scripts. So either we have to modify systemd settings or may be just shutdown vm type of operation (i think that is little brutal)

-S
________________________________
From: Michael Wall <mj...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect immediately?



Thanks



-S



From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$> – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the reassignment while a tserver is restarting – there is a trade-off on the data not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since we ar eusing system, I can issue restart command on a tserver. This causes all sorts of tablet movements even though accumulo is down for may be a second. If I wait for all unassigned tables to become 0, then to restart next tserver, then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a delay setting we can change so that it will not move tablets around? What may be a safe delay value?



-S

Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Posted by Shailesh Ligade <SL...@FBI.GOV>.

Thanks,

I will leave it alone..the replication messages are chatty, so I was not able to see what/how that table.suspend.diration works in a cluster which is primary for replication.

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Monday, December 27, 2021 9:35 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Can you specify the messages?  It may be that replication is working as designed.  The current replication is based on the WALs – it would seem normal if the WAL is closed when the tserver stops, that it would then trigger replication, so it might just be expected activity.  The messages might look scarry – unexpected file closed, improperly closed file,… which would be more of a concern if they were happening in stable system (and if not associated with something like a tserver dying for reasons)



Do I need to turn off replication while I am rolling restart?



First, are you detecting errors / missing data in the replication destination? If not, then you might just want to leave it alone.



If you wanted to stop replication, you may need to stop ingest and then take steps so that data that is pending for replication is also sent before proceeding. I do not know if replication flushes changes when it stopped, or if it would pick back up where it left off on the restart.  If it does not, then any data that was “pending replication” could be lost.



Ed



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Monday, December 27, 2021 8:45 AM
To: user@accumulo.apache.org
Subject: RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart



Thanks,



Just a quick question. The steps identified worked..however I noticed that if replication is turned on, and I set the table.suspend.duration=5m and stop and reboot a tserver, I do get lot of replication messages in the master log. Since ingest is turned off, I thought I will not see much replication. Do I need to turn off replication while I am rolling restart? Will it have any adverse effects?



-S



From: Mike Miller <mm...@apache.org>>
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart



Some things to keep in mind... The Master will wait the table.suspend.duration before reassigning the SUSPENDED tablets to new tservers. With the table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is available or table.suspend.duration has passed. If table.suspend.duration has passed before it's tserver has returned, it will then be UNASSIGNED. Once a tablet is UNASSIGNED it won't enter the SUSPENDED state.



On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:

Thanks Mike,



If I set the value to 0s (default) or if set it to 5m, when I restart tserver system (it is pretty quick in the order of second), I still get unassigned tablets on monitor page. My understand is that with that setting of 5m (or 200s etc), master will wait for that mush time before start moving unassigned tablets. In my situation, unassigned tablet counts goes back to zero after long time, and hence rolling restarts take lot longer (hours in most cases – depends on how many tablets/tserver)



This setting appears to be working on accumulo 2.0.1, but since that is not my prod version I have not tested it completely.



Thanks



-S

From: Mike Miller <mm...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart



When you say "since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0" do you mean that the feature is not helping to solve your problem? Or that the feature is not working and there could be a bug?



On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:

Thanks for detail steps! Really appreciated.



Just curious, since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0, can I just stop both the masters and then restart tserver one at a time (or all at once)? Will that speed up the restart without getting into this offline tablet situation and or data loss type situation? I can stop the ingest, flush the tables and then bring down the master…



We can take short downtime and my understanding is that the master is the one keeping track of tservers and offline tablets situation. So just curious…



Thanks again



-S



From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart



I believe the property is table.suspend.duration (not tablet.suspended.duration as you have in this email) – but the shell should have thrown an error saying the property cannot be set in zookeeper if you had it wrong.



What do you mean by:



but when i issued restart tserver (one at a time without waiting for first to come up)



I’m assuming the requirement is to keep the cluster up and serving users without major disruption – not to rip through the restart as fast as possible.  With 6 – 8 nodes you should still be able to do this in under an hour.  If you had a much larger cluster then the concept is the same but you would want to use some number of tservers that is a fraction of the total available that would be cycled at any given point in time.



In general the way that I would do a conservative, rolling restart:



  1.  [optional] pause ingest – or be prepared for recovering any failed ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option – this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)



If you tail the master / manager debug log you should get a good idea of what is going on – there should be messages showing the tserver leaving and then rejoining and any other activity related to recovery.  With a rolling restart the idea is to keep the cluster up and serving tables – only one (or a few) tservers go offline and for a short duration (general less than a minute) and between each tserver restart, time is allowed for things to stabilize.





From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart





Uhmm updated the setting tablet.suspended.duration to 5m



config -s tablet.suspended.duration=5m



but when i issued restart tserver (one at a time without waiting for first to come up), i still get all tablets unassigned ?? may be, I need to bring masters down first?



btw this is for accumulo 1.10.0



am I missing anything?



-S

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart



Thanks Michael,



stop cluster using admin stop? The issue is that, since we are using systemd with restart=always, it interferes with any of those stop (stop-all, stop-here etc) commands/scripts. So either we have to modify systemd settings or may be just shutdown vm type of operation (i think that is little brutal)



-S

________________________________

From: Michael Wall <mj...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart



Is there a reason to not just stop the cluster, reset the heap and restart the cluster?  That is simpler.



On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect immediately?



Thanks



-S



From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2Fusg02.safelinks.protection.office365.us%2F%3Furl%3Dhttps*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24%26data%3D04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000%26sdata%3Dp*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D%26reserved%3D0__%3BJSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7C781ae344010a42cff3ee08d9c94644af%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637762127577303876%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ps9oeslRYAEMgpDtyDDsKxQXDW5et91w13uDzMgNwsI%3D&reserved=0> – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the reassignment while a tserver is restarting – there is a trade-off on the data not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since we ar eusing system, I can issue restart command on a tserver. This causes all sorts of tablet movements even though accumulo is down for may be a second. If I wait for all unassigned tables to become 0, then to restart next tserver, then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a delay setting we can change so that it will not move tablets around? What may be a safe delay value?



-S

RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Posted by dev1 <de...@etcoleman.com>.

Can you specify the messages?  It may be that replication is working as designed.  The current replication is based on the WALs – it would seem normal if the WAL is closed when the tserver stops, that it would then trigger replication, so it might just be expected activity.  The messages might look scarry – unexpected file closed, improperly closed file,… which would be more of a concern if they were happening in stable system (and if not associated with something like a tserver dying for reasons)

Do I need to turn off replication while I am rolling restart?

First, are you detecting errors / missing data in the replication destination? If not, then you might just want to leave it alone.

If you wanted to stop replication, you may need to stop ingest and then take steps so that data that is pending for replication is also sent before proceeding. I do not know if replication flushes changes when it stopped, or if it would pick back up where it left off on the restart.  If it does not, then any data that was “pending replication” could be lost.

Ed

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Monday, December 27, 2021 8:45 AM
To: user@accumulo.apache.org
Subject: RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks,

Just a quick question. The steps identified worked..however I noticed that if replication is turned on, and I set the table.suspend.duration=5m and stop and reboot a tserver, I do get lot of replication messages in the master log. Since ingest is turned off, I thought I will not see much replication. Do I need to turn off replication while I am rolling restart? Will it have any adverse effects?

-S

From: Mike Miller <mm...@apache.org>>
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Some things to keep in mind... The Master will wait the table.suspend.duration before reassigning the SUSPENDED tablets to new tservers. With the table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is available or table.suspend.duration has passed. If table.suspend.duration has passed before it's tserver has returned, it will then be UNASSIGNED. Once a tablet is UNASSIGNED it won't enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver system (it is pretty quick in the order of second), I still get unassigned tablets on monitor page. My understand is that with that setting of 5m (or 200s etc), master will wait for that mush time before start moving unassigned tablets. In my situation, unassigned tablet counts goes back to zero after long time, and hence rolling restarts take lot longer (hours in most cases – depends on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my prod version I have not tested it completely.

Thanks

-S
From: Mike Miller <mm...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0" do you mean that the feature is not helping to solve your problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0, can I just stop both the masters and then restart tserver one at a time (or all at once)? Will that speed up the restart without getting into this offline tablet situation and or data loss type situation? I can stop the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration as you have in this email) – but the shell should have thrown an error saying the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to come up)

I’m assuming the requirement is to keep the cluster up and serving users without major disruption – not to rip through the restart as fast as possible.  With 6 – 8 nodes you should still be able to do this in under an hour.  If you had a much larger cluster then the concept is the same but you would want to use some number of tservers that is a fraction of the total available that would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:


  1.  [optional] pause ingest – or be prepared for recovering any failed ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option – this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what is going on – there should be messages showing the tserver leaving and then rejoining and any other activity related to recovery.  With a rolling restart the idea is to keep the cluster up and serving tables – only one (or a few) tservers go offline and for a short duration (general less than a minute) and between each tserver restart, time is allowed for things to stabilize.


From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart


Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to come up), i still get all tablets unassigned 🙁 may be, I need to bring masters down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S
________________________________
From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd with restart=always, it interferes with any of those stop (stop-all, stop-here etc) commands/scripts. So either we have to modify systemd settings or may be just shutdown vm type of operation (i think that is little brutal)

-S
________________________________
From: Michael Wall <mj...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart



Thanks,



I am assuming I can set that property using shell and it will take effect immediately?



Thanks



-S



From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart



See https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$> – A note on rolling restarts.



There is property that can be set (table.suspend.duration) that will delay the reassignment while a tserver is restarting – there is a trade-off on the data not being available so try to minimize the time the tserver is off-line.



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: accumulo tserver rolling restart



Hello,



I want to restart al the tservers, say I updated the tserver heap size. Since we ar eusing system, I can issue restart command on a tserver. This causes all sorts of tablet movements even though accumulo is down for may be a second. If I wait for all unassigned tables to become 0, then to restart next tserver, then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ tablets per tserver)



What may be right way to perform such routine maintenance operation? Is there a delay setting we can change so that it will not move tablets around? What may be a safe delay value?



-S

RE: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks,

Just a quick question. The steps identified worked..however I noticed that if replication is turned on, and I set the table.suspend.duration=5m and stop and reboot a tserver, I do get lot of replication messages in the master log. Since ingest is turned off, I thought I will not see much replication. Do I need to turn off replication while I am rolling restart? Will it have any adverse effects?

-S

From: Mike Miller <mm...@apache.org>
Sent: Thursday, December 2, 2021 10:39 AM
To: user@accumulo.apache.org
Subject: Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Some things to keep in mind... The Master will wait the table.suspend.duration before reassigning the SUSPENDED tablets to new tservers. With the table.suspend.duration set > 0, a tablet will go from HOSTED to SUSPENDED if it's tserver is shutdown. It will then stay SUSPENDED until it's old tserver is available or table.suspend.duration has passed. If table.suspend.duration has passed before it's tserver has returned, it will then be UNASSIGNED. Once a tablet is UNASSIGNED it won't enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks Mike,

If I set the value to 0s (default) or if set it to 5m, when I restart tserver system (it is pretty quick in the order of second), I still get unassigned tablets on monitor page. My understand is that with that setting of 5m (or 200s etc), master will wait for that mush time before start moving unassigned tablets. In my situation, unassigned tablet counts goes back to zero after long time, and hence rolling restarts take lot longer (hours in most cases – depends on how many tablets/tserver)

This setting appears to be working on accumulo 2.0.1, but since that is not my prod version I have not tested it completely.

Thanks

-S
From: Mike Miller <mm...@apache.org>>
Sent: Thursday, December 2, 2021 9:38 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

When you say "since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0" do you mean that the feature is not helping to solve your problem? Or that the feature is not working and there could be a bug?

On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <Li...@bah.com>> wrote:
Thanks for detail steps! Really appreciated.

Just curious, since that setting (table.suspend.duration) is not working for me in accumulo 1.10.0, can I just stop both the masters and then restart tserver one at a time (or all at once)? Will that speed up the restart without getting into this offline tablet situation and or data loss type situation? I can stop the ingest, flush the tables and then bring down the master…

We can take short downtime and my understanding is that the master is the one keeping track of tservers and offline tablets situation. So just curious…

Thanks again

-S

From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 2:56 PM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

I believe the property is table.suspend.duration (not tablet.suspended.duration as you have in this email) – but the shell should have thrown an error saying the property cannot be set in zookeeper if you had it wrong.

What do you mean by:

but when i issued restart tserver (one at a time without waiting for first to come up)

I’m assuming the requirement is to keep the cluster up and serving users without major disruption – not to rip through the restart as fast as possible.  With 6 – 8 nodes you should still be able to do this in under an hour.  If you had a much larger cluster then the concept is the same but you would want to use some number of tservers that is a fraction of the total available that would be cycled at any given point in time.

In general the way that I would do a conservative, rolling restart:

  1.  [optional] pause ingest – or be prepared for recovering any failed ingests if they occur.
  2.  [optional] Flush tables that have continuous ingest using the wait option – this should help minimize recovery.
  3.  Set the table.suspend.duration
  4.  For each tserver – one (or a small group for large cluster) at a time

     *   Stop the tserver
     *   Pause long enough that ZooKeeper recognizes the lost connection
     *   Restart the tserver
     *   Pause to allow for any recovery

  1.  Reset the table.suspend.duration back to 0s (the default)

If you tail the master / manager debug log you should get a good idea of what is going on – there should be messages showing the tserver leaving and then rejoining and any other activity related to recovery.  With a rolling restart the idea is to keep the cluster up and serving tables – only one (or a few) tservers go offline and for a short duration (general less than a minute) and between each tserver restart, time is allowed for things to stabilize.

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 11:17 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Uhmm updated the setting tablet.suspended.duration to 5m

config -s tablet.suspended.duration=5m

but when i issued restart tserver (one at a time without waiting for first to come up), i still get all tablets unassigned 🙁 may be, I need to bring masters down first?

btw this is for accumulo 1.10.0

am I missing anything?

-S
________________________________
From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Monday, November 29, 2021 10:35 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Thanks Michael,

stop cluster using admin stop? The issue is that, since we are using systemd with restart=always, it interferes with any of those stop (stop-all, stop-here etc) commands/scripts. So either we have to modify systemd settings or may be just shutdown vm type of operation (i think that is little brutal)

-S
________________________________
From: Michael Wall <mj...@gmail.com>>
Sent: Monday, November 29, 2021 9:54 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Is there a reason to not just stop the cluster, reset the heap and restart the cluster?  That is simpler.

On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com>> wrote:

Yes – and don’t forget to reset it back when you are done.

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: RE: accumulo tserver rolling restart

Thanks,

I am assuming I can set that property using shell and it will take effect immediately?

Thanks

-S

From: dev1 <de...@etcoleman.com>>
Sent: Monday, November 29, 2021 9:25 AM
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo tserver rolling restart

See https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$> – A note on rolling restarts.

There is property that can be set (table.suspend.duration) that will delay the reassignment while a tserver is restarting – there is a trade-off on the data not being available so try to minimize the time the tserver is off-line.

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Monday, November 29, 2021 9:19 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: accumulo tserver rolling restart

Hello,

I want to restart al the tservers, say I updated the tserver heap size. Since we ar eusing system, I can issue restart command on a tserver. This causes all sorts of tablet movements even though accumulo is down for may be a second. If I wait for all unassigned tables to become 0, then to restart next tserver, then to completely restart a small cluster (6-8 nodes) take hours (roughly 4k+ tablets per tserver)

What may be right way to perform such routine maintenance operation? Is there a delay setting we can change so that it will not move tablets around? What may be a safe delay value?

-S

Re: [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart

Posted by Mike Miller <mm...@apache.org>.

Some things to keep in mind... The Master will wait the
table.suspend.duration before reassigning the SUSPENDED tablets to new
tservers. With the table.suspend.duration set > 0, a tablet will go from
HOSTED to SUSPENDED if it's tserver is shutdown. It will then stay
SUSPENDED until it's old tserver is available or table.suspend.duration has
passed. If table.suspend.duration has passed before it's tserver has
returned, it will then be UNASSIGNED. Once a tablet is UNASSIGNED it won't
enter the SUSPENDED state.

On Thu, Dec 2, 2021 at 9:43 AM Ligade, Shailesh [USA] <
Ligade_Shailesh@bah.com> wrote:

> Thanks Mike,
>
>
>
> If I set the value to 0s (default) or if set it to 5m, when I restart
> tserver system (it is pretty quick in the order of second), I still get
> unassigned tablets on monitor page. My understand is that with that setting
> of 5m (or 200s etc), master will wait for that mush time before start
> moving unassigned tablets. In my situation, unassigned tablet counts goes
> back to zero after long time, and hence rolling restarts take lot longer
> (hours in most cases – depends on how many tablets/tserver)
>
>
>
> This setting appears to be working on accumulo 2.0.1, but since that is
> not my prod version I have not tested it completely.
>
>
>
> Thanks
>
>
>
> -S
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* Thursday, December 2, 2021 9:38 AM
> *To:* user@accumulo.apache.org
> *Subject:* [External] Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling
> restart
>
>
>
> When you say "since that setting (table.suspend.duration) is not working
> for me in accumulo 1.10.0" do you mean that the feature is not helping to
> solve your problem? Or that the feature is not working and there could be a
> bug?
>
>
>
> On Thu, Dec 2, 2021 at 8:00 AM Ligade, Shailesh [USA] <
> Ligade_Shailesh@bah.com> wrote:
>
> Thanks for detail steps! Really appreciated.
>
>
>
> Just curious, since that setting (table.suspend.duration) is not working
> for me in accumulo 1.10.0, can I just stop both the masters and then
> restart tserver one at a time (or all at once)? Will that speed up the
> restart without getting into this offline tablet situation and or data loss
> type situation? I can stop the ingest, flush the tables and then bring down
> the master…
>
>
>
> We can take short downtime and my understanding is that the master is the
> one keeping track of tservers and offline tablets situation. So just
> curious…
>
>
>
> Thanks again
>
>
>
> -S
>
>
>
> *From:* dev1 <de...@etcoleman.com>
> *Sent:* Monday, November 29, 2021 2:56 PM
> *To:* 'user@accumulo.apache.org' <us...@accumulo.apache.org>
> *Subject:* [External] RE: [EXTERNAL EMAIL] - Re: accumulo tserver rolling
> restart
>
>
>
> I believe the property is table.suspend.duration (not tablet.suspended.duration
> as you have in this email) – but the shell should have thrown an error
> saying the property cannot be set in zookeeper if you had it wrong.
>
>
>
> What do you mean by:
>
>
>
> *but when i issued restart tserver (one at a time without waiting for
> first to come up)*
>
>
>
> I’m assuming the requirement is to keep the cluster up and serving users
> without major disruption – not to rip through the restart as fast as
> possible.  With 6 – 8 nodes you should still be able to do this in under an
> hour.  If you had a much larger cluster then the concept is the same but
> you would want to use some number of tservers that is a fraction of the
> total available that would be cycled at any given point in time.
>
>
>
> In general the way that I would do a conservative, rolling restart:
>
>
>
>    1. [optional] pause ingest – or be prepared for recovering any failed
>    ingests if they occur.
>    2. [optional] Flush tables that have continuous ingest using the wait
>    option – this should help minimize recovery.
>    3. Set the table.suspend.duration
>    4. For each tserver – one (or a small group for large cluster) at a
>    time
>
>
>    1. Stop the tserver
>       2. Pause long enough that ZooKeeper recognizes the lost connection
>       3. Restart the tserver
>       4. Pause to allow for any recovery
>
>
>    1. Reset the table.suspend.duration back to 0s (the default)
>
>
>
> If you tail the master / manager debug log you should get a good idea of
> what is going on – there should be messages showing the tserver leaving and
> then rejoining and any other activity related to recovery.  With a rolling
> restart the idea is to keep the cluster up and serving tables – only one
> (or a few) tservers go offline and for a short duration (general less than
> a minute) and between each tserver restart, time is allowed for things to
> stabilize.
>
>
>
>
>
> *From:* Shailesh Ligade <SL...@FBI.GOV>
> *Sent:* Monday, November 29, 2021 11:17 AM
> *To:* user@accumulo.apache.org
> *Subject:* Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart
>
>
>
>
>
> Uhmm updated the setting tablet.suspended.duration to 5m
>
>
>
> config -s tablet.suspended.duration=5m
>
>
>
> but when i issued restart tserver (one at a time without waiting for first
> to come up), i still get all tablets unassigned 🙁 may be, I need to
> bring masters down first?
>
>
>
> btw this is for accumulo 1.10.0
>
>
>
> am I missing anything?
>
>
>
> -S
> ------------------------------
>
> *From:* Shailesh Ligade <SL...@FBI.GOV>
> *Sent:* Monday, November 29, 2021 10:35 AM
> *To:* user@accumulo.apache.org <us...@accumulo.apache.org>
> *Subject:* Re: [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart
>
>
>
> Thanks Michael,
>
>
>
> stop cluster using admin stop? The issue is that, since we are using
> systemd with restart=always, it interferes with any of those stop
> (stop-all, stop-here etc) commands/scripts. So either we have to modify
> systemd settings or may be just shutdown vm type of operation (i think that
> is little brutal)
>
>
>
> -S
> ------------------------------
>
> *From:* Michael Wall <mj...@gmail.com>
> *Sent:* Monday, November 29, 2021 9:54 AM
> *To:* user@accumulo.apache.org <us...@accumulo.apache.org>
> *Subject:* [EXTERNAL EMAIL] - Re: accumulo tserver rolling restart
>
>
>
> Is there a reason to not just stop the cluster, reset the heap and restart
> the cluster?  That is simpler.
>
>
>
> On Mon, Nov 29, 2021 at 9:37 AM dev1 <de...@etcoleman.com> wrote:
>
> Yes – and don’t forget to reset it back when you are done.
>
>
>
> *From:* Ligade, Shailesh [USA] <Li...@bah.com>
> *Sent:* Monday, November 29, 2021 9:36 AM
> *To:* user@accumulo.apache.org
> *Subject:* RE: accumulo tserver rolling restart
>
>
>
> Thanks,
>
>
>
> I am assuming I can set that property using shell and it will take effect
> immediately?
>
>
>
> Thanks
>
>
>
> -S
>
>
>
> *From:* dev1 <de...@etcoleman.com>
> *Sent:* Monday, November 29, 2021 9:25 AM
> *To:* 'user@accumulo.apache.org' <us...@accumulo.apache.org>
> *Subject:* [External] RE: accumulo tserver rolling restart
>
>
>
> See
> https://accumulo.apache.org/1.10/accumulo_user_manual.html#_restarting_process_on_a_node
> <https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2Faccumulo.apache.org*2F1.10*2Faccumulo_user_manual.html*_restarting_process_on_a_node__*3BIw!!May37g!evyseDphy3PM_d8-tSlk89Sw1fFlSXHtH7vhiQedtcADc_P7OLEHw2kVZjlQ4Q8G_Q*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7C363899b757914815738508d9b34de39b*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637737969389540183*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=p*2FOeqj*2BgzX5PV4H*2Bd3TluGSvACs2CERSRhwEnifXX1c*3D&reserved=0__;JSUlJSUlJSUlKiUlJSUlJSUlJSUlJSUlJSUlJQ!!May37g!e_nAdxcZ_YbW8DCkWUX6TA7ZQTyaCUgOoHwNBzElKw28V3WJEuUD93wefizCiH0Epg$>
> – A note on rolling restarts.
>
>
>
> There is property that can be set (table.suspend.duration) that will delay
> the reassignment while a tserver is restarting – there is a trade-off on
> the data not being available so try to minimize the time the tserver is
> off-line.
>
>
>
> *From:* Ligade, Shailesh [USA] <Li...@bah.com>
> *Sent:* Monday, November 29, 2021 9:19 AM
> *To:* user@accumulo.apache.org
> *Subject:* accumulo tserver rolling restart
>
>
>
> Hello,
>
>
>
> I want to restart al the tservers, say I updated the tserver heap size.
> Since we ar eusing system, I can issue restart command on a tserver. This
> causes all sorts of tablet movements even though accumulo is down for may
> be a second. If I wait for all unassigned tables to become 0, then to
> restart next tserver, then to completely restart a small cluster (6-8
> nodes) take hours (roughly 4k+ tablets per tserver)
>
>
>
> What may be right way to perform such routine maintenance operation? Is
> there a delay setting we can change so that it will not move tablets
> around? What may be a safe delay value?
>
>
>
> -S
>
>