You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Rahul Challapalli (JIRA)" <ji...@apache.org> on 2014/12/29 18:28:13 UTC

[jira] [Commented] (DRILL-1804) random failures while running large number of queries

    [ https://issues.apache.org/jira/browse/DRILL-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260244#comment-14260244 ] 

Rahul Challapalli commented on DRILL-1804:
------------------------------------------

For queries which take very little setup time, we might have a race condition when writing to zookeeper (trying to add a node for status 'PENDING' and 'RUNNING' in a short span of time). This could explain the zookeeper error in the logs.  
If the write to zookeeper is asynchronous(?), the queries would continue to execute in the mean time. By the time zookeeper returns back with the exception, if the query is still running we would be seeing a failure. This also explains the randomness of the issue.

Let me know your thoughts

> random failures while running large number of queries
> -----------------------------------------------------
>
>                 Key: DRILL-1804
>                 URL: https://issues.apache.org/jira/browse/DRILL-1804
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.7.0
>            Reporter: Chun Chang
>
> #Tue Dec 02 14:38:34 EST 2014
> git.commit.id.abbrev=757e9a2
> Running Mondrian regression tests, out of over 6000 queries, sometimes I get one or two random failures. Here is the stack when it happens:
> 2014-12-02 17:49:32,271 [2b8193d3-f0ca-aa7c-094a-d8234d76d068:foreman] ERROR o.a.drill.exec.work.foreman.Foreman - Error aeae057b-ed0a-43aa-902d-fe3a41531511: Query failed: Unexpected exception during fragment initialization.
> org.apache.drill.exec.work.foreman.ForemanException: Unexpected exception during fragment initialization.
>   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:194) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.WorkManager$RunnableWrapper.run(WorkManager.java:254) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_45]
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_45]
>   at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> Caused by: java.lang.RuntimeException: Failure while accessing Zookeeper. Failure while accessing Zookeeper
>   at org.apache.drill.exec.store.sys.zk.ZkAbstractStore.put(ZkAbstractStore.java:111) ~[drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.QueryStatus.updateQueryStateInStore(QueryStatus.java:132) ~[drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:502) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:396) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan(Foreman.java:311) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:510) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:185) [drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   ... 4 common frames omitted
> Caused by: java.lang.RuntimeException: Failure while accessing Zookeeper
>   at org.apache.drill.exec.store.sys.zk.ZkEStore.createNodeInZK(ZkEStore.java:53) ~[drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   at org.apache.drill.exec.store.sys.zk.ZkAbstractStore.put(ZkAbstractStore.java:106) ~[drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   ... 10 common frames omitted
> Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /drill/running/2b8193d3-f0ca-aa7c-094a-d8234d76d068
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) ~[zookeeper-3.4.5-mapr-1406.jar:3.4.5-mapr-1406--1]
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) ~[zookeeper-3.4.5-mapr-1406.jar:3.4.5-mapr-1406--1]
>   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) ~[zookeeper-3.4.5-mapr-1406.jar:3.4.5-mapr-1406--1]
>   at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:676) ~[curator-framework-2.5.0.jar:na]
>   at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:660) ~[curator-framework-2.5.0.jar:na]
>   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.5.0.jar:na]
>   at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:656) ~[curator-framework-2.5.0.jar:na]
>   at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:441) ~[curator-framework-2.5.0.jar:na]
>   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:431) ~[curator-framework-2.5.0.jar:na]
>   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) ~[curator-framework-2.5.0.jar:na]
>   at org.apache.drill.exec.store.sys.zk.ZkEStore.createNodeInZK(ZkEStore.java:51) ~[drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]
>   ... 11 common frames omitted
> 2014-12-02 17:49:32,287 [2b8193d3-f0ca-aa7c-094a-d8234d76d068:frag:0:0] WARN  o.a.d.e.p.impl.SendingAccountor - Failure while waiting for send complete.
> java.lang.InterruptedException: null
>   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1301) ~[na:1.7.0_45]
>   at java.util.concurrent.Semaphore.acquire(Semaphore.java:472) ~[na:1.7.0_45]
>   at org.apache.drill.exec.physical.impl.SendingAccountor.waitForSendComplete(SendingAccountor.java:44) ~[drill-java-exec-0.7.0-SNAPSHOT-rebuffed.jar:0.7.0-SNAPSHOT]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)