You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "ivakegg (via GitHub)" <gi...@apache.org> on 2023/04/26 17:30:00 UTC

[GitHub] [accumulo] ivakegg opened a new issue, #3346: remote scan exception halts tserver

ivakegg opened a new issue, #3346:
URL: https://github.com/apache/accumulo/issues/3346

   We we seeing cases where a scan on a remote tserver (possibly for data from accumulo.root) fails and this results in the local tserver be halted.  This happened on 3 tservers at the same time running a scan on the same remote tserver.  The exception looks something like this:
   ```
   org.apache.accumulo.core.util.threads.ThreadPools$ExecutionError: Critical scheduled background task failed.
         at org.apache.accumulo.core.util.threads.ThreadPools.checkTaskfailed(ThreadPools.java.139)
   ...
   Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.accumulo.core.clientImpl.AccumuloServerException: Error on server RemoteHost:9997
   ...
   Caused by: java.lang.RuntimeException: org.apache.accumulo.core.clientImpl.AccumuloServerException: Error on server RemoteHost:9997
        at org.apache.accumujlo.core.clientImpl.TabletServerBatchReaderIterator.hasNext(TabletServerBatchReaderIterator.java:194)
   ...
   Caused by: org.apache.accumulo.core.clientImpl.AccumuloServerException: Error on server RemoteHost:9997
       at org.apache.accumulo.core.clientImpl.TabletServerBatchReaderiterator.doLookup(TabletServerBatchReaderIterator.java:911)
   ...
   ```
   I do not see any exceptions in the RemoteHost logs that correlate with these failures.  I see the accumulo.audit INFO message "operation: Permitted; user: <user>; client: <ip>:<port>; action: authenticate"  but I do not see any exceptions or other messages.
   
   openjdk versio0n "11.0.15" 2022-04-19 LTS (Corretto)
   CentOS 7.3.1611
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] cshannon closed issue #3346: remote scan exception halts tserver

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon closed issue #3346: remote scan exception halts tserver
URL: https://github.com/apache/accumulo/issues/3346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on issue #3346: remote scan exception halts tserver

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1525636031

   This same situation happened 5 times last night.  This time it was a different RemoteHost being communicated with which was not hosting the accumulo.root.  The only thing I saw in this situation were Slow sync cost messages on the RemoteHost around the time the hosts fell over.  I am guessing in this case the tservers must have been doing accumulo.metadata scans since the accumulo root was not being hosted there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on issue #3346: remote scan exception halts tserver

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1523816632

   This type of exception happens frequently on the RemoteHost, and only one of them may have been correlated with one of the tservers halting (occurred 17 seconds prior):
   ```
   [scan.NextBatchTask] WARN : exception while scannint tablet !0;... for <ip>:<port>
   java.lang.IllegalStateException: Tried to use scanner after it was closed.
       at org.apache.accumulo.tserver.tablet.Scanner.read(Scanner.java:87)
       at org.apache.accumulo.tserver.scan.NextbatchTask.run(NextBatchTask.java:78)
       at org.apache.accumulo.tserver.session.ScanSession$ScanMeasurer.run(ScanSession.java:62)
       ...  
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on issue #3346: remote scan exception halts tserver

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1525667597

   Now I think I am indeed seeing some exceptions in the RemoteHost logs, however they are being displayed in the RemoteHost log many seconds or even a few minutes after the time the tservers had received the exception via thrift.  They are all EOFExceptions while reading an rfile (potential hdfs issues).  So I am no longer concerned about the exceptions being logged.  However I am concerned that this is toppling tservers when they could do some sort or retry or exponential backoff.  If that is already the case and we are missing configuration then please let me know.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on issue #3346: remote scan exception halts tserver

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1523888328

   What's probably more disturbing than anything else is that I am not seeing the exceptions in the RemoteHost logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on issue #3346: remote scan exception halts tserver

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1526189043

   It appears that I did not add enough of the stack trace.  The critical task that failed is the lamda in TabletServer at line 826


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ctubbsii commented on issue #3346: remote scan exception halts tserver

Posted by "ctubbsii (via GitHub)" <gi...@apache.org>.
ctubbsii commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1523811246

   What Accumulo version?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ivakegg commented on issue #3346: remote scan exception halts tserver

Posted by "ivakegg (via GitHub)" <gi...@apache.org>.
ivakegg commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1523817528

   accumulo 2.1.1-SNAPSHOT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] cshannon commented on issue #3346: remote scan exception halts tserver

Posted by "cshannon (via GitHub)" <gi...@apache.org>.
cshannon commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1528823989

   It looks like the original version of this was first merged in #2320 and was to fix #2301.  There were a couple more modifications by #2524 and #2583
   
   So there of course is the question of why the scanner is closed (as @ivakegg said, possible HDFS issues reading RFiles) but at the very least we need to catch exceptions here because as shown uncaught runtime exceptions will bubble up and kill the task and the server.
   
   I think that simply catching exceptions and logging an error is probably fine here and I don't necessarily think we need to do anything else because if it's something like HDFS issues as alluded do with the scans than we really can't handle that other than catch the errors and not fall over. I also don't think we need to worry about an exponential backoff or anything and can just let the task retry normally the next run. The default health check period is every 30 minutes so it's certainly not a rapid check that needs to be backed off (at least unless someone decided to speed it up with the property [here](https://github.com/apache/accumulo/blob/ba472d6e24daa8f0014a22cabace3061f5d46413/core/src/main/java/org/apache/accumulo/core/conf/Property.java#L625))


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [accumulo] ddanielr commented on issue #3346: remote scan exception halts tserver

Posted by "ddanielr (via GitHub)" <gi...@apache.org>.
ddanielr commented on issue #3346:
URL: https://github.com/apache/accumulo/issues/3346#issuecomment-1526368453

   > It appears that I did not add enough of the stack trace. The critical task that failed is the lamda in TabletServer at line 826
   
   So this is code in question?
   https://github.com/apache/accumulo/blob/74009a505667c5a181d74e84cdc6d190d6cda6a9/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L826-L858
   
   Is there anymore relevant information in the stack trace after the critical task failed in TabletServer:826? 
   
   Specifically anything related to TabletsMetadata?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org