You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "manoj (Jira)" <ji...@apache.org> on 2020/07/01 14:07:01 UTC
[jira] [Created] (HIVE-23792) [LLAP] Long continuous running job
degrade performance of LLAP because of leaked shuffle manager threads
manoj created HIVE-23792:
----------------------------
Summary: [LLAP] Long continuous running job degrade performance of LLAP because of leaked shuffle manager threads
Key: HIVE-23792
URL: https://issues.apache.org/jira/browse/HIVE-23792
Project: Hive
Issue Type: Bug
Components: llap, Query Processor, Tez
Affects Versions: 3.1.0
Environment: Ubuntu 18.04
Hadoop 3.1.1
TEZ: 0.9.1
HIve : 3.1.0
JDK: 1.8
Reporter: manoj
Attachments: Screenshot from 2020-07-01 17-43-57.png, t3.dump, tdump.pdf
*[Test Case/Reproduction]*
Run TPCH Q19 on 10 Gigs data in infinite loop and disable result caching
*[Observation]*
On LLAP server I see a strange behaviour continuous increase in Threads.Although query will keep running but with time performance gets degrade
*[Analysis]*
I took multiple thread-dumps at different intervals to figure out which category of threads causing this issue, and the culprit thread is *tez-shuffle manager*
.m2/org/apache/tez/tez-runtime-library/0.9.1/tez-runtime-library-0.9.1-sources.jar!/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java:324
{quote}try {
while ((runningFetchers.size() >= numFetchers || pendingHosts.isEmpty())
&& numCompletedInputs.get() < numInputs) {
inputContext.notifyProgress();
boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS);
}
} finally {
lock.unlock();
}{quote}
*[Stack Trace of culprit thread]*
{quote}threadId:Thread 16661 - state:BLOCKED
stackTrace:
- sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
- java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
- java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long, java.util.concurrent.TimeUnit) @bci=97, line=2163 (Compiled frame)
- org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal() @bci=125, line=327 (Compiled frame)
- org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$RunShuffleCallable.callInternal() @bci=1, line=311 (Compiled frame)
- org.apache.tez.common.CallableWithNdc.call() @bci=8, line=36 (Compiled frame)
- com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly() @bci=18, line=108 (Compiled frame)
- com.google.common.util.concurrent.InterruptibleTask.run() @bci=16, line=41 (Compiled frame)
- com.google.common.util.concurrent.TrustedListenableFutureTask.run() @bci=10, line=77 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1149 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=748 (Compiled frame){quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)