You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/03/29 16:14:56 UTC

[GitHub] [accumulo] keith-ratcliffe opened a new issue #2598: Apparent thread leak causing OOME in tservers

keith-ratcliffe opened a new issue #2598:
URL: https://github.com/apache/accumulo/issues/2598


   **Describe the bug**
   While testing 2.1 in AWS we've observed a consistent pattern of OOME's resulting in dead tservers. OOME occurs relatively quickly when the tservers are under sufficient query load, but still seem to occur under any amount of load given enough time.
   
   The OOME's present this stacktrace pretty consistently
   ```
   [rpc.CustomNonBlockingServer$CustomFrameBuffer] ERROR: Unexpected throwable while invoking!
   java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
           at java.lang.Thread.start0(Native Method) ~[?:?]
           at java.lang.Thread.start(Thread.java:798) ~[?:?]
           at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
           at java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1583) ~[?:?]
           at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:346) ~[?:?]
           at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]
           at org.apache.accumulo.core.util.threads.ThreadPools$3.schedule(ThreadPools.java:529) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.accumulo.tserver.session.SessionManager.removeIfNotAccessed(SessionManager.java:283) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.accumulo.tserver.ThriftClientHandler.continueMultiScan(ThriftClientHandler.java:581) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.accumulo.tserver.ThriftClientHandler.startMultiScan(ThriftClientHandler.java:532) ~[accumulo-tserver-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) ~[?:?]
           at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
           at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
           at org.apache.accumulo.core.trace.TraceUtil.lambda$wrapService$1(TraceUtil.java:221) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at com.sun.proxy.$Proxy35.startMultiScan(Unknown Source) ~[?:?]
           at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3038) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:3017) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.15.0.jar:0.15.0]
           at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.15.0.jar:0.15.0]
           at org.apache.accumulo.server.rpc.TimedProcessor.process(TimedProcessor.java:54) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.15.0.jar:0.15.0]
           at org.apache.accumulo.server.rpc.CustomNonBlockingServer$CustomFrameBuffer.invoke(CustomNonBlockingServer.java:129) ~[accumulo-server-base-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.15.0.jar:0.15.0]
           at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
           at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) ~[accumulo-core-2.1.0-SNAPSHOT.jar:2.1.0-SNAPSHOT]
           at java.lang.Thread.run(Thread.java:829) ~[?:?]
   ```
   
   **Versions (OS, Maven, Java, and others, as appropriate):**
    - Affected version(s) of this project: 2.1-SNAPSHOT. So far, I've been able to replicate the issue on the following commits: `918bb92` , `2ca070b`, `4b66b96`, `9451dd0`
    - OS: CentOS 7.5
    - Others: Hadoop 3.3.1, ZK 3.5.9, Java 11, Maven 3.6.3
   
   **To Reproduce**
   1. Put Accumulo under reasonably heavy query load, and observe thread counts steadily increasing in tserver JVMs until OOME occurs
   
   **Expected behavior**
   No OOME
   
   **Additional context**
   What appears to be happening is that we seem to be getting lots of TimeoutExceptions thrown in `ThriftClientHandler` due to the hardcoded 1-second timeout being hit:
   https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L581
   
   Timeout duration is defined here:
   https://github.com/apache/accumulo/blob/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L168
   
   As a result, we seem to get tons of new threads spun up in `SessionManager.removeIfNotAccessed`:
   https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L283
   
   ...and those threads seem to linger in the JVM indefinitely. From jstacks that I've captured on tservers just before they die, there will typically be around 30,000+ threads spun up in the JVM when the OOME is about to strike. The amount of time that it takes to hit the OOME varies based on the amount of query load we're putting on accumulo, but all our tservers seem to die this way eventually, given enough time.
   
   FWIW, I'm currently running Accumulo with `ThriftClientHandler.MAX_TIME_TO_WAIT_FOR_SCAN_RESULT_MILLIS` hardcoded to 60 seconds, and that seems to resolve this issue entirely. But that's not the ideal solution here, I know. A configurable timeout would be more ideal; or perhaps there's more going on here than meets the eye
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-ratcliffe commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
keith-ratcliffe commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082115471


   For addtional context:
   I was kind of surprised that we hadn't seen this in our 2.1 testing until recently, given that the ThriftClientHandler code in question is by no means new. Coincidentally, we've also observed a general decline in performance recently on our EC2 VMs, which we can attribute to a variety of reasons having nothing to do with Accumulo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082153249


   > I think this should be calling ServerContext.getScheduledExecutor() instead of ThreadPools.getServerThreadPools().createGeneralScheduledExecutorService() so that it adds a Runnable to the shared general ScheduledThreadPoolExecutor instead of creating a new ThreadPoolExecutor.
   
   This is in #2593 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii closed issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
ctubbsii closed issue #2598:
URL: https://github.com/apache/accumulo/issues/2598


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1083399888


   I don't see a reason why we shouldn't make the timeout configurable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-ratcliffe edited a comment on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
keith-ratcliffe edited a comment on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082097539


   > Do you see the same thing with Hadoop 3.3.0? I had issues with 3.3.1. Might not be related to this, though.
   
   I hadn't tried 3.3.0 yet, but I can if you think that might be helpful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082086875


   Do you see the same thing with Hadoop 3.3.0? I had issues with 3.3.1. Might not be related to this, though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082107727


   A quick look at the code, I don't see why that can't be configurable. I am just not sure how/why that timeout is also used to get summaries. I opened a PR to make it configurable at least: https://github.com/apache/accumulo/pull/2599


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-ratcliffe commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
keith-ratcliffe commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082097539


   > Do you see the same thing with Hadoop 3.3.0? I had issues with 3.3.1. Might not be related to this, though.
   I hadn't tried 3.3.0 yet, but I can if you think that might be helpful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] milleruntime commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1083391728


   Do we still want to configure the timeout https://github.com/apache/accumulo/pull/2599 ? I guess wait and see if @dlmarion can improve the code so we may not need to configure the timeout?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion edited a comment on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
dlmarion edited a comment on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082112884


   > As a result, we seem to get tons of new threads spun up in SessionManager.removeIfNotAccessed:
   https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L283
   
   I think this should be calling `ServerContext.getScheduledExecutor()` instead of `ThreadPools.getServerThreadPools().createGeneralScheduledExecutorService()` so that it adds a Runnable to the shared general ScheduledThreadPoolExecutor instead of creating a new ThreadPoolExecutor.
   
   Also, you will likely want to increase the timeout for the property that @milleruntime is adding and also increase the number of threads for the shared general ScheduledThreadPoolExecutor (Property.GENERAL_SIMPLETIMER_THREADPOOL_SIZE or general.server.simpletimer.threadpool.size) from the default of 1 to something larger. Instead of 30,000 threads, you will end up with 30,000 runnables on a queue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082112884


   > As a result, we seem to get tons of new threads spun up in SessionManager.removeIfNotAccessed:
   https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/session/SessionManager.java#L283
   
   I think this should be calling `ServerContext.getScheduledExecutor()` instead of `ThreadPools.getServerThreadPools().createGeneralScheduledExecutorService()` so that it adds a Runnable to the shared general ScheduledThreadPoolExecutor instead of creating a new ThreadPoolExecutor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082108697


   Also, 3.3.2 is out.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
dlmarion commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1082146876


   Looking a little further at this, I think there is a case where we can create a large amount of Threads (or Runnables in the suggested fix) for a single scan when ThriftClientHandler.MAX_TIME_TO_WAIT_FOR_SCAN_RESULT_MILLIS is low and the scan does not return any data. When a TimeoutException is raised `ThriftClientHandler.continueMultiScan` returns a `MultiScanResult` object (https://github.com/apache/accumulo/blame/main/server/tserver/src/main/java/org/apache/accumulo/tserver/ThriftClientHandler.java#L585) with the `more` attribute (last constructor parameter) set to `true`. In TabletServerBatchReaderIterator (https://github.com/apache/accumulo/blob/main/core/src/main/java/org/apache/accumulo/core/clientImpl/TabletServerBatchReaderIterator.java#L721) when `continueMultiScan` is called, it's in a loop while `more` is `true`. In this case `scanResult.results` is an empty collection and `more` is `true`. I'm wondering if we should back off.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on issue #2598: Apparent thread leak causing OOME in tservers

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #2598:
URL: https://github.com/apache/accumulo/issues/2598#issuecomment-1083365728


   It looks like this issue is fixed in #2593, so this can be closed. @dlmarion proposed also trying to reduce the number of Runnables added for a single scan, such as having the client back off. If this is still desired, please create a new issue or PR for that. I'm closing this one, since the immediate issue causing this bug was fixed in #2593 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org