You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Murtadha Hubail (JIRA)" <ji...@apache.org> on 2018/03/15 22:47:00 UTC

[jira] [Commented] (ASTERIXDB-2185) Cluster becomes UNUSABLE status after a NC fails to send a job failure.

    [ https://issues.apache.org/jira/browse/ASTERIXDB-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401211#comment-16401211 ] 

Murtadha Hubail commented on ASTERIXDB-2185:
--------------------------------------------

[~wangsaeu],

There is no indication in the logs that the NC failed to send the task failure notification. As a matter of fact, the logs look normal. There are three tasks to be aborted, so it is expected to see these logs repeated but with a different task id. Also, At the end of the logs, the Joblet close is logged, so the CC received the tasks failure notifications and instructed the NCs to do the clean up. Do you have the CC logs that show the cluster going to UNUSABLE? All of this could've happened due to an NC failing to send heartbeat or losing connection with the CC. This will result in the cluster state becoming UNUSABLE and the job being aborted.

> Cluster becomes UNUSABLE status after a NC fails to send a job failure.
> -----------------------------------------------------------------------
>
>                 Key: ASTERIXDB-2185
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2185
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: IDX - Indexes, RT - Runtime
>            Reporter: Taewoo Kim
>            Assignee: Murtadha Hubail
>            Priority: Major
>              Labels: triaged
>
> A cluster became UNUSABLE status after a NC failed to send a job failure message. See the exception below.
> {code}
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.control.nc.work.StartTasksWork run
> INFO: Initializing TAID:TID:ANID:ODID:16:0:1:0 -> [Asterix {
>   ets;
>   assign [0, 1, 2] := [Constant, Constant, Constant];
> }, org.apache.hyracks.storage.am.lsm.invertedindex.dataflow.LSMInvertedIndexSearchOperatorDescriptor@23d902c1, org.apache.hyracks.dataflow.std.sort.ExternalSort
> OperatorDescriptor$1@2fc09944]
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.dataflow.std.sort.AbstractSorterOperatorDescriptor$SortActivity$1 close
> INFO: InitialNumberOfRuns:0
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:13:0:1:0
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:13:0:0:0
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.dataflow.std.sort.AbstractSorterOperatorDescriptor$SortActivity$1 close
> INFO: InitialNumberOfRuns:0
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:16:0:0:0
> Dec 03, 2017 6:47:13 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskCompleteWork:TAID:TID:ANID:ODID:16:0:1:0
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: AbortTasks
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.work.AbortTasksWork run
> INFO: Aborting Tasks: JID:0:[TAID:TID:ANID:ODID:0:0:0:0, TAID:TID:ANID:ODID:3:0:0:0, TAID:TID:ANID:ODID:3:0:1:0]
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.Task run
> WARNING: Task TAID:TID:ANID:ODID:3:0:0:0 failed with exception
> java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
> 	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
> 	at org.apache.hyracks.control.nc.Task.run(Task.java:325)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:744)
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.Task run
> WARNING: Task TAID:TID:ANID:ODID:3:0:1:0 failed with exception
> java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
> 	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
> 	at org.apache.hyracks.control.nc.Task.run(Task.java:325)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:744)
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskFailure
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.work.NotifyTaskFailureWork run
> WARNING: 1 is sending a notification to cc that task TAID:TID:ANID:ODID:3:0:0:0 has failed
> org.apache.hyracks.api.exceptions.HyracksDataException: HYR0003: java.lang.InterruptedException
> 	at org.apache.hyracks.control.common.utils.ExceptionUtils.setNodeIds(ExceptionUtils.java:68)
> 	at org.apache.hyracks.control.nc.Task.run(Task.java:367)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
> 	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
> 	at org.apache.hyracks.control.nc.Task.run(Task.java:325)
> 	... 3 more
> 	
> 	
> ...... Same exception was repeated for several times ......
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.common.work.WorkQueue$WorkerThread run
> INFO: Executing: NotifyTaskFailure
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.work.NotifyTaskFailureWork run
> WARNING: 1 is sending a notification to cc that task TAID:TID:ANID:ODID:3:0:0:0 has failed
> org.apache.hyracks.api.exceptions.HyracksDataException: HYR0003: java.lang.InterruptedException
> 	at org.apache.hyracks.control.common.utils.ExceptionUtils.setNodeIds(ExceptionUtils.java:68)
> 	at org.apache.hyracks.control.nc.Task.run(Task.java:367)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.InterruptedException
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
> 	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
> 	at org.apache.hyracks.control.nc.Task.run(Task.java:325)
> 	... 3 more
> Dec 03, 2017 6:48:02 PM org.apache.hyracks.control.nc.Joblet close
> WARNING: Freeing leaked 458752 bytes	
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)