You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2017/02/23 18:59:44 UTC

[jira] [Resolved] (HIVE-15255) LLAP: service_busy error should not be retried so fast

     [ https://issues.apache.org/jira/browse/HIVE-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin resolved HIVE-15255.
-------------------------------------
    Resolution: Cannot Reproduce

Will reopen if I see it again

> LLAP: service_busy error should not be retried so fast
> ------------------------------------------------------
>
>                 Key: HIVE-15255
>                 URL: https://issues.apache.org/jira/browse/HIVE-15255
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> {noformat}
> 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
> 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
> 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
> 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
> 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> {noformat}
> As you can see by the attempt number, this has been going on for a while. In fact I think other tasks could have been scheduled in the time (not sure), but the thread just kept at it for this one task until it was finally scheduled.
> There should be some fallback after initial failures; we should also make sure such retries do not take over all scheduling (not sure if they do, need to check).
> LLAP on the node was alive, just busy with other tasks. The task did eventually get scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)