You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "cshannon (via GitHub)" <gi...@apache.org> on 2023/04/29 16:44:13 UTC

[GitHub] [accumulo] cshannon opened a new pull request, #3366: Handle exceptions during tablet metadata task

cshannon opened a new pull request, #3366:
URL: https://github.com/apache/accumulo/pull/3366

   Make sure to catch and log any errors when the tablet metadata verification task runs to check tablet metadata so that errors do not cause the task and server to halt.
   
   This closes #3346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on a diff in pull request #3366: Handle exceptions during tablet metadata task

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on code in PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#discussion_r1181565688


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java:
##########
@@ -841,7 +841,6 @@ public void run() {
         try (TabletsMetadata tabletsMetadata =

Review Comment:
   I believe this will absolutely fix the problem.  I am wondering if changing the "watchCriticalFixedDelay" on line 826 can simply be changed to "watchNonCriticalScheduledTask" instead ?



##########
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java:
##########
@@ -841,7 +841,6 @@ public void run() {
         try (TabletsMetadata tabletsMetadata =

Review Comment:
   I believe this will absolutely fix the problem.  I am wondering if changing the "watchCriticalFixedDelay" on line 826 can simply be changed to "watchNonCriticalScheduledTask" instead ?  Maybe a combination of this and that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1530107739

   Note that the validation was not throwing an exception and hence was not actually halting the server.  It was halting because the tserver had a failure when attempting to scan the metadata.  Hence catching Exception is OK here (or moving it to the non-critical thread pool) UNLESS you want to have the validation halt the server.  In any case a scan failure should NOT halt the server.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.
EdColeman commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531814869

   I had been thinking that a metric for the elapsed time in this check would make a good metric - it might be a good proxy for health over time as well as possibly being useful to compare across servers.  But with the upcoming changes I thought it may be better to hold off until those settle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] cshannon commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1533341127

   After some discussion we decided to keep it as a critical task as it already was but to just keep the try/catch change so we don't fall over for normal exceptions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531615672

   Also, if there are any bugs in the check itself we would not want that to kill tservers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on a diff in pull request #3366: Handle exceptions during tablet metadata task

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on code in PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#discussion_r1181565688


##########
server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java:
##########
@@ -841,7 +841,6 @@ public void run() {
         try (TabletsMetadata tabletsMetadata =

Review Comment:
   I believe this will absolutely fix the problem.  I am wondering if changing the "watchCriticalFixedDelay" on line 826 can simply be changed to "watchNonCriticalScheduledTask" ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.
EdColeman commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1529943361

   I am not sure that just logging the error and then waiting for the next check is appropriate, and a retry and fail after X attempts may be a better approach.
   
   For the case that triggered this, it seems the error is transient, but what if it was not?  I think the check was to validate that the tserver can read the entire metadata table - if it cannot, it may not be "safe" to continue to run with incomplete metadata information.  Even if those operations "fail" because the metadata cannot be known to be consistent, if we allow the tserver to keep running, then it seems like work would be continued to be assigned / attempted even though it will not work with the current tserver state.
   
   This would be similar the the issue where tservers could not host tablets, but the manager keeps seeing that tserver as having 0 tablets and kept trying to make assignments to that "under-utilized" tserver. (the solution was to detect and remove the tserver lock to stop assignment attempts.
   
   This change shifts the priority to "keeping the tserver up" rather than killing the process is it cannot fill a basic requirement of being able to maintain a consistent view of the metadata.  A retry may provide a balance of having a better chance to ride out transient errors, but still provide a hard fail if it cannot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] cshannon commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1532804340

   I moved the check to a non critical thread based on the feedback here. Is there anything else to do here or is this good to merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.
EdColeman commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531553674

   My point is that the entire check seems to be trying to validate that the tserver can read and has a consistent view of the metadata tablet.   If this check cannot complete then I think the intention was to "hard" fail the tserver so that it was not trying to run with possibly incomplete / incorrect view of the metadata.
   
   Because it seems a transient (but somehow repeating) failure is stopping the tserver - a more conservative change would be to retry the scan (or the entire check) to guard against transient failures - but if the check cannot complete within a reasonable period, then it would be safer to continue to stop the process.
   
   Changing the check so that it only logs the exception seems that it could lead to cases where the tserver would continue to run when the intention is that it would have been halted.
   
   We may want to add resiliency to the check, but the underlying issue seems to be something in the environment is repeatably causing a scan failure while reading the metadata - nurffing the check to accommodate that case my just be hiding an underlying issue that needs to be corrected.  This change seems that it could allow other issues to also be hidden.
   
   Again, I am addressing the overall goal of the check - not the issue that a transient scan failure seems to be triggering more aggressively that necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531807723

   > Are there things that we should be checking for a tserver to validate that it meets some minimal level of healthy?
   
   We also have some sort of check for stuck compactions, I think it just logs something also.  Does it make sense to emit metrics for these things in addition a log message?  Maybe each server could have a unhealthy count metric.  The impl of the metric could survey known things and add one if they are currently unhealthy.  If a compaction is currently stuck that would contribute to the count when surveyed. If the metadata validation with memory is having problems that would contribute to the counter when surveyed.  If things are unhealthy and then healthy the survey could count nothing and emit zero.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] cshannon commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1530279093

   > Note that the validation was not throwing an exception and hence was not actually halting the server (see Tablet.compareTabletInfo). It was halting because the tserver had a failure when attempting to scan the metadata. Hence catching Exception is OK here (or moving it to the non-critical thread pool) UNLESS you want to have the validation halt the server. In any case a scan failure should NOT halt the server.
   
   Right I wouldn't think we should halt either on scan failure so that's what this fix does, is prevent that. In terms of making it a non critical task I guess that is fine too, if we are catching the exception and continuing I'm not sure it much matters at that point. Either way it seems like catching an exception and logging it there vs killing the task makes sense so it can be retried normally so the task and server don't fall over.
   
   In terms of whether not we retry faster or some other logic on scan failure maybe that's a follow on issue if that is going to be more complex and maybe could be something for 2.1.2. This at least fixes the immediately issue that is a regression in 2.1 for 2.1.1 so the server doesn't fall over on a scan failure which just may be transient. Halting on metadata failure itself (not a scan failure) is another issue entirely of course and maybe that is valid but maybe shouldn't be a bug fix release and 3.0 instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] keith-turner commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "keith-turner (via GitHub)" <gi...@apache.org>.
keith-turner commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531606988

   I think it makes sense for the check not to terminate the tablet server and log an error.  The original intention of the check was to ensure what tablets have in memory matches the metadata table, it was not to test scanning the metadata table.  Not being able to scan the metadata table is not great, but it was not the intent of the check.  So the check can find problems that it was not designed to, that is a good thing.  However the check finding a new unforeseen problem and not giving anyone a chance to work through it (because it nuked all the tservers) is not a good thing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] EdColeman commented on pull request #3366: Handle exceptions during tablet metadata task

Posted by "EdColeman (via GitHub)" <gi...@apache.org>.
EdColeman commented on PR #3366:
URL: https://github.com/apache/accumulo/pull/3366#issuecomment-1531740334

   If I had the wrong impression of the check, then I am okay with the changes - maybe log the error and also move it to a non-critical thread check?
   
   Are there things that we *should* be checking for a tserver to validate that it meets some minimal level of healthy?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] cshannon merged pull request #3366: Handle exceptions during tablet metadata task

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon merged PR #3366:
URL: https://github.com/apache/accumulo/pull/3366


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org