You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/12/20 18:09:43 UTC

[GitHub] [accumulo] ddanielr opened a new issue, #3138: Add shell command for marking tservers as "dead"

ddanielr opened a new issue, #3138:
URL: https://github.com/apache/accumulo/issues/3138

   **Is your feature request related to a problem? Please describe.**
   
   ***Problem***
   There is an operational situation where a tserver cannot be terminated locally (either ssh or physical access) and is causing issues in accumulo. 
   The tserver needs to marked as "Dead" and accumulo will not automatically change the tserver state due to operational health checks still passing.
   
   ***Current workaround***
   In situations where an admin cannot ssh into or remotely terminate a failed node running tservers, an admin needs to be able to mark the tservers on that node as "dead" so accumulo starts reassigning tablets to other tservers. 
    
   Currently this has been handled by getting the active zlock for that specific tserver from zookeeper and then running a delete zlock command via the zookeeper shell. 
   
   This seems risky and complicated for what should be a simple operation. 
   
   **Describe the solution you'd like**
   As an Admin, I would like the ability to mark tservers as "dead" via the accumulo shell with requiring ssh access to the specific tservers. 
   
   I should be able to do this via a manual accumulo shell command on either a per-tserver basis or via a tserver regex to match all tservers against a specific node.
   This command should prompt the Admin with a confirmation action and a preview list of tservers to mark 'dead'" 
   
   **Describe alternatives you've considered**
   Alternative solution would be increasing the ability for accumulo to detect issues with the tservers.
   However this doesn't help with node maintenance actions that may require taking tservers offline without ssh access.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1359957846

   In 1.10.x and 2.x there is an Admin StopCommand, I haven't looked deeply, but that might work.  Have they tried that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by GitBox <gi...@apache.org>.
EdColeman commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1359958127

   Removing the zlock is an effective way of killing the tserver.  This could be implemented independent of the shell (only ZooKeeper access is required) and could be a stand-alone utility or even a bash script.
   
   One issue could be actually defining "dead".  Removing the lock will kill the tserver - but something else (as defined by the user) could just try to restart the process - and usually this is probably the default action - taking flapping and other considerations into account.  To prevent external processes from restarting the process would be outside the scope of the shell.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.
EdColeman commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1410935644

   Another dimension to this might occur if the dead server has just enough functionality to keep the ZooKeeper connection from timing out but otherwise unable to fully receive / respond to ZooKeeper events.
   
   What would "happen" is the zoo lock is deleted, which should force the tserver to stop hosting its tablets.  The manager sees the tables unassigned, and assigns them to the another tserver.  If the original tserver does not realize that it should not be hosting the tablets then both the original and the new server are serving the same tablets - which we make assumptions that it will never happen.
   
   There is an IT test, HalfDeadITServer that tries to test some of this, but not sure how much it actually covers.  And I recall past attempts to mock / wrap an ZooKeeper client to inject various errors, but I am unsure how far they progressed. 
   
   Most of this may be outside of this issue (if the Fate command is insufficient) - but there may be other issues that should be looked at.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by GitBox <gi...@apache.org>.
EdColeman commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1360129408

   I was thinking specifically of systemd restart, but the same caution may hold for puppet,.....  There are different schools of thought.  Someone could be "strict" and never automatically restart a node without some verification, while others could decide that restarting has low enough risk that intervention is not required.
   
   Some classes of errors that kill a tserver such as loss of ZooKeeper lock or an OOM likely can be restarted - but they should also be trended so that underlying problems are not hidden because things "seem to work"  Repeatedly failing and then restarting a node - will can cause a lot of table migrations and work for recovery.
   
   One particular "fun-class" of problems are where "bad-data", maybe its an improper row, or its an iterator configuration issue. For example, if a file is bulk-imported it may have un-processable row(s) that will trigger a failure.  Accumulo recovers, and the tablet / row migrates and the cycle repeats....
   
   In terms of this issue, killing the tserver via admin stop command or otherwise removing the ZooKeeper lock will kill the tserver - but that is different from being marked dead.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ddanielr commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by GitBox <gi...@apache.org>.
ddanielr commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1360077412

    @EdColeman Correct, there is nothing stopping an external action from starting up the problematic node again.
   What you're describing sounds like more of an instance state tracking mechanism akin to k8s cordon functionality. 
   https://kubernetes.io/docs/concepts/architecture/nodes/#manual-node-administration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ddanielr commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by GitBox <gi...@apache.org>.
ddanielr commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1360101411

   @dlmarion Thanks, I'll get a response but I suspect it hasn't been tried. 
   I'm wondering what the likelihood is that the FATE created by the `admin stopCommand --force` command will be processed if there are a large amount of FATEs currently "IN_PROGRESS" due to the problematic node?   
   
   In general, how are FATEs processed by the accumulo manager?
   Is it just FIFO with the max number of FATEs "IN_PROGRESS" being determined by threadpool size? Or is there a concept of higher-priority FATEs for "system actions"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] dlmarion commented on issue #3138: Add shell command for marking tservers as "dead"

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #3138:
URL: https://github.com/apache/accumulo/issues/3138#issuecomment-1359961853

   With the Admin StopCommand, if force is set to true, then it does not try to connect to the tserver. It creates a FaTE operation to remove the ZK lock.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org