You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2016/11/22 00:58:58 UTC

[jira] [Created] (HIVE-15255) LLAP: service_busy error should not be retried so fast

Sergey Shelukhin created HIVE-15255:
---------------------------------------

             Summary: LLAP: service_busy error should not be retried so fast
                 Key: HIVE-15255
                 URL: https://issues.apache.org/jira/browse/HIVE-15255
             Project: Hive
          Issue Type: Bug
            Reporter: Sergey Shelukhin


{noformat}
2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328, timeTaken=5, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329, timeTaken=16, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330, timeTaken=117, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331, timeTaken=14, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332, timeTaken=6, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3), counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
{noformat}

As you can see by the attempt number, this has been going on for a while. In fact I think other tasks could have been scheduled in the time (not sure), but the thread just kept at it for this one task until it was finally scheduled.
There should be some fallback after initial failures; we should also make sure such retries do not take over all scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)