You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Lalit Mishra <la...@mojonetworks.com> on 2018/01/23 14:04:42 UTC

Queries getting stuck in RUNNING state occasionally

Hello,

We are using drill 1.11 (under yarn) on a 3 node cluster.
Occasionally a query would remain stuck in the RUNNING state. The same
query runs successfully on multiple occasions. I have not captured any
information previous times this occurred, but have collected following on
the latest occurrence -

   - Full json profile
   - Thread dumps on all three nodes

I can provide these if needed.

In the thread-dumps there are 107 threads tagged to the query id.
105 of them are stuck with following stack-trace -

2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
    - waiting on <0x4a20ff6e> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x4a20ff6e> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
    at
java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
    at
org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
    at
org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
    at
org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
    at
org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
    at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
    at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
    at
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
    at
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:357)
    at
org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
    at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
    at
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
    at
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
    at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
    at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
    at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
    at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
    at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    Locked synchronizers: count = 1
      - java.util.concurrent.ThreadPoolExecutor$Worker@45083904


While 2 are stuck with -

2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
    - waiting on <0x730eeaf1> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    - locked <0x730eeaf1> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at
java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
    at
java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
    at
org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
    at
org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
    at
org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.getNext(MergingRecordBatch.java:147)
    at
org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
    at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
    at
org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
    at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
    at
org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
    at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
    at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
    at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
    at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
    at
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
    at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
    at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
    at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
    at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
    at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    Locked synchronizers: count = 1
      - java.util.concurrent.ThreadPoolExecutor$Worker@378527f8


Any help with regards to figuring out what is going wrong will be
appreciated. Thanks in advance!

Thanks,
Lalit Mishra

Re: Queries getting stuck in RUNNING state occasionally

Posted by Lalit Mishra <la...@mojonetworks.com>.
Hi Kunal,

Minor fragments for fragment 6 have been pretty much distributed all across
the three nodes. I'm attaching the thread-dumps for all three nodes -



Thanks,
Lalit Mishra

On Fri, Jan 26, 2018 at 12:07 AM, Kunal Khatua <kk...@mapr.com> wrote:

> Hi Lalit
> Your profile hints that it is stuck in the Major Fragment 06-xx-xx, which
> is fed data from 16-xx-xx via 11-Exchange.
>
> Looking at the operators’ overview and the similarity with other major
> fragments, only this one seems to be stuck at completing the sort.
>
> Could you provide the JStack on any of the nodes which are hosting
> fragments of 06-xx-xx ?
>
> Thanks
> Kunal
>
> From: Lalit Mishra [mailto:lalit.mishra@mojonetworks.com]
> Sent: Thursday, January 25, 2018 4:03 AM
> To: user@drill.apache.org
> Subject: Re: Queries getting stuck in RUNNING state occasionally
>
> Hello Timothy,
>
> PFA the profile file (it exceeded message limit, so I had to gzip it).
> Please excuse the length of query, it is a long query unioned 5 times. I
> have tried to reproduce with a smaller query, but have failed so far.
>
> Yes, we are using MapR 6.0.
>
> Thanks,
> Lalit Mishra
>
> On Thu, Jan 25, 2018 at 2:37 AM, Timothy Farkas <timothyfarkas@apache.org<
> mailto:timothyfarkas@apache.org>> wrote:
>
>
> On 2018/01/23 14:04:42, Lalit Mishra <lalit.mishra@mojonetworks.com
> <ma...@mojonetworks.com>> wrote:
> > Hello,
> >
> > We are using drill 1.11 (under yarn) on a 3 node cluster.
> > Occasionally a query would remain stuck in the RUNNING state. The same
> > query runs successfully on multiple occasions. I have not captured any
> > information previous times this occurred, but have collected following on
> > the latest occurrence -
> >
> >    - Full json profile
> >    - Thread dumps on all three nodes
> >
> > I can provide these if needed.
> >
> > In the thread-dumps there are 107 threads tagged to the query id.
> > 105 of them are stuck with following stack-trace -
> >
> > 2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
> >     - waiting on <0x4a20ff6e> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     - locked <0x4a20ff6e> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     at sun.misc.Unsafe.park(Native Method)
> >     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >     at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.takeFirst(
> LinkedBlockingDeque.java:492)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.take(
> LinkedBlockingDeque.java:680)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.
> UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(
> UnlimitedRawBatchBuffer.java:61)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.
> BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
> >     at
> > org.apache.drill.exec.physical.impl.unorderedreceiver.
> UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
> >     at
> > org.apache.drill.exec.physical.impl.unorderedreceiver.
> UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
> >     at
> > org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.load(ExternalSortBatch.java:357)
> >     at
> > org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.svremover.
> RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:105)
> >     at
> > org.apache.drill.exec.physical.impl.SingleSenderCreator$
> SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:95)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.
> FragmentExecutor$1.run(FragmentExecutor.java:234)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.
> FragmentExecutor$1.run(FragmentExecutor.java:227)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:422)
> >     at
> > org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1595)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.
> FragmentExecutor.run(FragmentExecutor.java:227)
> >     at
> > org.apache.drill.common.SelfCleaningRunnable.run(
> SelfCleaningRunnable.java:38)
> >     at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >     at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >     at java.lang.Thread.run(Thread.java:748)
> >
> >     Locked synchronizers: count = 1
> >       - java.util.concurrent.ThreadPoolExecutor$Worker@45083904<mailto:
> java.util.concurrent.ThreadPoolExecutor$Worker@45083904>
> >
> >
> > While 2 are stuck with -
> >
> > 2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
> >     - waiting on <0x730eeaf1> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     - locked <0x730eeaf1> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     at sun.misc.Unsafe.park(Native Method)
> >     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >     at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.takeFirst(
> LinkedBlockingDeque.java:492)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.take(
> LinkedBlockingDeque.java:680)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.
> UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(
> UnlimitedRawBatchBuffer.java:61)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.
> BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
> >     at
> > org.apache.drill.exec.physical.impl.mergereceiver.
> MergingRecordBatch.getNext(MergingRecordBatch.java:147)
> >     at
> > org.apache.drill.exec.physical.impl.mergereceiver.
> MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(
> LimitRecordBatch.java:115)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.svremover.
> RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:105)
> >     at
> > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(
> ScreenCreator.java:81)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:95)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.
> FragmentExecutor$1.run(FragmentExecutor.java:234)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.
> FragmentExecutor$1.run(FragmentExecutor.java:227)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:422)
> >     at
> > org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1595)
> >     at
> > org.apache.drill.exec.work<https://urldefense.proofpoint.
> com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=
> umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.
> FragmentExecutor.run(FragmentExecutor.java:227)
> >     at
> > org.apache.drill.common.SelfCleaningRunnable.run(
> SelfCleaningRunnable.java:38)
> >     at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >     at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >     at java.lang.Thread.run(Thread.java:748)
> >
> >     Locked synchronizers: count = 1
> >       - java.util.concurrent.ThreadPoolExecutor$Worker@378527f8<mailto:
> java.util.concurrent.ThreadPoolExecutor$Worker@378527f8>
> >
> >
> > Any help with regards to figuring out what is going wrong will be
> > appreciated. Thanks in advance!
> >
> > Thanks,
> > Lalit Mishra
> >
> Hi Lalit,
>
> The stack traces you provided indicate that down stream operators are
> waiting for data to be sent by upstream operators which are blocked. This
> could mean that a scan operator is blocked reading from a data source, or
> it could mean that an operator like Sort or HashAgg is getting stuck. Can
> you please provide the query you are using along with the json profile?
>
> Also please note that Apache Drill does not have YARN support yet, the PR
> is pending here https://github.com/apache/drill/pull/1011<https://
> urldefense.proofpoint.com/v2/url?u=https-3A__github.com_
> apache_drill_pull_1011&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-
> cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=
> 5S3fhzWCf4BMewMoMObRX36hSj1Nb5UbrDTA07DXmD4&e=> . So are you using MapR's
> proprietary distribution of Drill?
>
> Thanks,
> Tim
>
>

RE: Queries getting stuck in RUNNING state occasionally

Posted by Kunal Khatua <kk...@mapr.com>.
Hi Lalit
Your profile hints that it is stuck in the Major Fragment 06-xx-xx, which is fed data from 16-xx-xx via 11-Exchange.

Looking at the operators’ overview and the similarity with other major fragments, only this one seems to be stuck at completing the sort.

Could you provide the JStack on any of the nodes which are hosting fragments of 06-xx-xx ?

Thanks
Kunal

From: Lalit Mishra [mailto:lalit.mishra@mojonetworks.com]
Sent: Thursday, January 25, 2018 4:03 AM
To: user@drill.apache.org
Subject: Re: Queries getting stuck in RUNNING state occasionally

Hello Timothy,

PFA the profile file (it exceeded message limit, so I had to gzip it). Please excuse the length of query, it is a long query unioned 5 times. I have tried to reproduce with a smaller query, but have failed so far.

Yes, we are using MapR 6.0.

Thanks,
Lalit Mishra

On Thu, Jan 25, 2018 at 2:37 AM, Timothy Farkas <ti...@apache.org>> wrote:


On 2018/01/23 14:04:42, Lalit Mishra <la...@mojonetworks.com>> wrote:
> Hello,
>
> We are using drill 1.11 (under yarn) on a 3 node cluster.
> Occasionally a query would remain stuck in the RUNNING state. The same
> query runs successfully on multiple occasions. I have not captured any
> information previous times this occurred, but have collected following on
> the latest occurrence -
>
>    - Full json profile
>    - Thread dumps on all three nodes
>
> I can provide these if needed.
>
> In the thread-dumps there are 107 threads tagged to the query id.
> 105 of them are stuck with following stack-trace -
>
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
>     - waiting on <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:357)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
>     Locked synchronizers: count = 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@45083904<ma...@45083904>
>
>
> While 2 are stuck with -
>
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
>     - waiting on <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.getNext(MergingRecordBatch.java:147)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work<https://urldefense.proofpoint.com/v2/url?u=http-3A__org.apache.drill.exec.work&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=Bm8hdMuQbcqi5C4BwP15T13EUxF8ziRNyztWcXWPXgM&e=>.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
>
>     Locked synchronizers: count = 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@378527f8<ma...@378527f8>
>
>
> Any help with regards to figuring out what is going wrong will be
> appreciated. Thanks in advance!
>
> Thanks,
> Lalit Mishra
>
Hi Lalit,

The stack traces you provided indicate that down stream operators are waiting for data to be sent by upstream operators which are blocked. This could mean that a scan operator is blocked reading from a data source, or it could mean that an operator like Sort or HashAgg is getting stuck. Can you please provide the query you are using along with the json profile?

Also please note that Apache Drill does not have YARN support yet, the PR is pending here https://github.com/apache/drill/pull/1011<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_drill_pull_1011&d=DwMFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=-cT6otg6lpT_XkmYy7yg3A&m=umXwIIDPu7CIrHHD2R12jqdykYVdniRdtdHmfVScofg&s=5S3fhzWCf4BMewMoMObRX36hSj1Nb5UbrDTA07DXmD4&e=> . So are you using MapR's proprietary distribution of Drill?

Thanks,
Tim


Re: Queries getting stuck in RUNNING state occasionally

Posted by Lalit Mishra <la...@mojonetworks.com>.
Hello Timothy,

PFA the profile file (it exceeded message limit, so I had to gzip it).
Please excuse the length of query, it is a long query unioned 5 times. I
have tried to reproduce with a smaller query, but have failed so far.

Yes, we are using MapR 6.0.

Thanks,
Lalit Mishra

On Thu, Jan 25, 2018 at 2:37 AM, Timothy Farkas <ti...@apache.org>
wrote:

>
>
> On 2018/01/23 14:04:42, Lalit Mishra <la...@mojonetworks.com>
> wrote:
> > Hello,
> >
> > We are using drill 1.11 (under yarn) on a 3 node cluster.
> > Occasionally a query would remain stuck in the RUNNING state. The same
> > query runs successfully on multiple occasions. I have not captured any
> > information previous times this occurred, but have collected following on
> > the latest occurrence -
> >
> >    - Full json profile
> >    - Thread dumps on all three nodes
> >
> > I can provide these if needed.
> >
> > In the thread-dumps there are 107 threads tagged to the query id.
> > 105 of them are stuck with following stack-trace -
> >
> > 2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
> >     - waiting on <0x4a20ff6e> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     - locked <0x4a20ff6e> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     at sun.misc.Unsafe.park(Native Method)
> >     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >     at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.takeFirst(
> LinkedBlockingDeque.java:492)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.take(
> LinkedBlockingDeque.java:680)
> >     at
> > org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$
> UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
> >     at
> > org.apache.drill.exec.work.batch.BaseRawBatchBuffer.
> getNext(BaseRawBatchBuffer.java:170)
> >     at
> > org.apache.drill.exec.physical.impl.unorderedreceiver.
> UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
> >     at
> > org.apache.drill.exec.physical.impl.unorderedreceiver.
> UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
> >     at
> > org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.load(ExternalSortBatch.java:357)
> >     at
> > org.apache.drill.exec.physical.impl.xsort.managed.
> ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.svremover.
> RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:105)
> >     at
> > org.apache.drill.exec.physical.impl.SingleSenderCreator$
> SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:95)
> >     at
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.
> run(FragmentExecutor.java:234)
> >     at
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.
> run(FragmentExecutor.java:227)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:422)
> >     at
> > org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1595)
> >     at
> > org.apache.drill.exec.work.fragment.FragmentExecutor.run(
> FragmentExecutor.java:227)
> >     at
> > org.apache.drill.common.SelfCleaningRunnable.run(
> SelfCleaningRunnable.java:38)
> >     at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >     at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >     at java.lang.Thread.run(Thread.java:748)
> >
> >     Locked synchronizers: count = 1
> >       - java.util.concurrent.ThreadPoolExecutor$Worker@45083904
> >
> >
> > While 2 are stuck with -
> >
> > 2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
> >     - waiting on <0x730eeaf1> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     - locked <0x730eeaf1> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >     at sun.misc.Unsafe.park(Native Method)
> >     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >     at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.takeFirst(
> LinkedBlockingDeque.java:492)
> >     at
> > java.util.concurrent.LinkedBlockingDeque.take(
> LinkedBlockingDeque.java:680)
> >     at
> > org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$
> UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
> >     at
> > org.apache.drill.exec.work.batch.BaseRawBatchBuffer.
> getNext(BaseRawBatchBuffer.java:170)
> >     at
> > org.apache.drill.exec.physical.impl.mergereceiver.
> MergingRecordBatch.getNext(MergingRecordBatch.java:147)
> >     at
> > org.apache.drill.exec.physical.impl.mergereceiver.
> MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(
> LimitRecordBatch.java:115)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.svremover.
> RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:119)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:109)
> >     at
> > org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(
> AbstractSingleRecordBatch.java:51)
> >     at
> > org.apache.drill.exec.physical.impl.project.
> ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
> >     at
> > org.apache.drill.exec.record.AbstractRecordBatch.next(
> AbstractRecordBatch.java:164)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:105)
> >     at
> > org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(
> ScreenCreator.java:81)
> >     at
> > org.apache.drill.exec.physical.impl.BaseRootExec.
> next(BaseRootExec.java:95)
> >     at
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.
> run(FragmentExecutor.java:234)
> >     at
> > org.apache.drill.exec.work.fragment.FragmentExecutor$1.
> run(FragmentExecutor.java:227)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:422)
> >     at
> > org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1595)
> >     at
> > org.apache.drill.exec.work.fragment.FragmentExecutor.run(
> FragmentExecutor.java:227)
> >     at
> > org.apache.drill.common.SelfCleaningRunnable.run(
> SelfCleaningRunnable.java:38)
> >     at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >     at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >     at java.lang.Thread.run(Thread.java:748)
> >
> >     Locked synchronizers: count = 1
> >       - java.util.concurrent.ThreadPoolExecutor$Worker@378527f8
> >
> >
> > Any help with regards to figuring out what is going wrong will be
> > appreciated. Thanks in advance!
> >
> > Thanks,
> > Lalit Mishra
> >
>
> Hi Lalit,
>
> The stack traces you provided indicate that down stream operators are
> waiting for data to be sent by upstream operators which are blocked. This
> could mean that a scan operator is blocked reading from a data source, or
> it could mean that an operator like Sort or HashAgg is getting stuck. Can
> you please provide the query you are using along with the json profile?
>
> Also please note that Apache Drill does not have YARN support yet, the PR
> is pending here https://github.com/apache/drill/pull/1011 . So are you
> using MapR's proprietary distribution of Drill?
>
> Thanks,
> Tim
>

Re: Queries getting stuck in RUNNING state occasionally

Posted by Timothy Farkas <ti...@apache.org>.

On 2018/01/23 14:04:42, Lalit Mishra <la...@mojonetworks.com> wrote: 
> Hello,
> 
> We are using drill 1.11 (under yarn) on a 3 node cluster.
> Occasionally a query would remain stuck in the RUNNING state. The same
> query runs successfully on multiple occasions. I have not captured any
> information previous times this occurred, but have collected following on
> the latest occurrence -
> 
>    - Full json profile
>    - Thread dumps on all three nodes
> 
> I can provide these if needed.
> 
> In the thread-dumps there are 107 threads tagged to the query id.
> 105 of them are stuck with following stack-trace -
> 
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:6:3 id=266 state=WAITING
>     - waiting on <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x4a20ff6e> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.getNextBatch(UnorderedReceiverBatch.java:141)
>     at
> org.apache.drill.exec.physical.impl.unorderedreceiver.UnorderedReceiverBatch.next(UnorderedReceiverBatch.java:159)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:406)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:357)
>     at
> org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:302)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 
>     Locked synchronizers: count = 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@45083904
> 
> 
> While 2 are stuck with -
> 
> 2598df8d-8573-5e29-292c-fb343c99d280:frag:0:0 id=390 state=WAITING
>     - waiting on <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     - locked <0x730eeaf1> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at
> java.util.concurrent.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:492)
>     at
> java.util.concurrent.LinkedBlockingDeque.take(LinkedBlockingDeque.java:680)
>     at
> org.apache.drill.exec.work.batch.UnlimitedRawBatchBuffer$UnlimitedBufferQueue.take(UnlimitedRawBatchBuffer.java:61)
>     at
> org.apache.drill.exec.work.batch.BaseRawBatchBuffer.getNext(BaseRawBatchBuffer.java:170)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.getNext(MergingRecordBatch.java:147)
>     at
> org.apache.drill.exec.physical.impl.mergereceiver.MergingRecordBatch.innerNext(MergingRecordBatch.java:241)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext(LimitRecordBatch.java:115)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext(RemovingRecordBatch.java:93)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
>     at
> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
>     at
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:134)
>     at
> org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:164)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
>     at
> org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
>     at
> org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
>     at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
>     at
> org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
>     at
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
>     at
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> 
>     Locked synchronizers: count = 1
>       - java.util.concurrent.ThreadPoolExecutor$Worker@378527f8
> 
> 
> Any help with regards to figuring out what is going wrong will be
> appreciated. Thanks in advance!
> 
> Thanks,
> Lalit Mishra
> 

Hi Lalit,

The stack traces you provided indicate that down stream operators are waiting for data to be sent by upstream operators which are blocked. This could mean that a scan operator is blocked reading from a data source, or it could mean that an operator like Sort or HashAgg is getting stuck. Can you please provide the query you are using along with the json profile?

Also please note that Apache Drill does not have YARN support yet, the PR is pending here https://github.com/apache/drill/pull/1011 . So are you using MapR's proprietary distribution of Drill?

Thanks,
Tim