You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2016/09/02 21:16:20 UTC

[jira] [Comment Edited] (HIVE-14608) LLAP: slow scheduling due to LlapTaskScheduler not removing nodes on kill

    [ https://issues.apache.org/jira/browse/HIVE-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15459610#comment-15459610 ] 

Sergey Shelukhin edited comment on HIVE-14608 at 9/2/16 9:16 PM:
-----------------------------------------------------------------

[~sseth] I can actually see problems because of this. Easy repro - start LLAP (e.g. 7 nodes), start the session (with AM), flex LLAP down (e.g. to 4), run some query. There can be a large delay in scheduling and the whole job can slow down a lot because nodes are not removed from instanceToNodeMap...
{noformat}
2016-09-02 16:51:41,428 [INFO] [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerManager] |tezplugins.LlapTaskSchedulerService|: Setting up node: DynamicServiceInstance [alive=true, host=cn109... with resources=<memory:83968, vCores:16>, shufflePort=15551, servicesAddress=..., mgmtPort=15004] with available capacity=16, pendingQueueSize=null, memory=83968
...
(of course nothing is actually removed)
2016-09-02 16:52:01,490 [INFO] [StateChangeNotificationHandler] |tezplugins.LlapTaskSchedulerService$NodeStateChangeListener|: Removed node with identity: f9b37b46-f629-4460-862f-f34183ba0a24
2016-09-02 16:52:01,567 [INFO] [StateChangeNotificationHandler] |tezplugins.LlapTaskSchedulerService$NodeStateChangeListener|: Removed node with identity: 12399334-c743-4a9b-8224-8c0cbc21dea7
2016-09-02 16:52:01,776 [INFO] [StateChangeNotificationHandler] |tezplugins.LlapTaskSchedulerService$NodeStateChangeListener|: Removed node with identity: c7b50156-b4f9-4353-89a4-3d1a1ccea604
...
2016-09-02 16:53:39,511 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned task TaskInfo{task=attempt_1466700718395_1343_2_07_000000_1, priority=140, startTime=0, containerId=null, assignedInstance=null, uniqueId=24, localityDelayTimeout=0} to container container_222212222_1343_01_000025 on node=DynamicServiceInstance [alive=true, host=cn109... with resources=<memory:83968, vCores:16>, shufflePort=15551, servicesAddress=..., mgmtPort=15004]
{noformat}

Here, two attempts of the last reducer of the job failed with network errors, causing the query runtime to triple.


was (Author: sershe):
[~sseth] I can actually see problems because of this. Easy repro - start LLAP (e.g. 7 nodes), start the session (with AM), flex LLAP down (e.g. to 4), run some query. There can be a large delay in scheduling and the whole job can slow down a lot because nodes are not removed from instanceToNodeMap...
{noformat}
2016-09-02 16:51:41,428 [INFO] [ServiceThread:org.apache.tez.dag.app.rm.TaskSchedulerManager] |tezplugins.LlapTaskSchedulerService|: Setting up node: DynamicServiceInstance [alive=true, host=cn109... with resources=<memory:83968, vCores:16>, shufflePort=15551, servicesAddress=..., mgmtPort=15004] with available capacity=16, pendingQueueSize=null, memory=83968
...
(of course nothing is actually removed)
2016-09-02 16:52:01,490 [INFO] [StateChangeNotificationHandler] |tezplugins.LlapTaskSchedulerService$NodeStateChangeListener|: Removed node with identity: f9b37b46-f629-4460-862f-f34183ba0a24
2016-09-02 16:52:01,567 [INFO] [StateChangeNotificationHandler] |tezplugins.LlapTaskSchedulerService$NodeStateChangeListener|: Removed node with identity: 12399334-c743-4a9b-8224-8c0cbc21dea7
2016-09-02 16:52:01,776 [INFO] [StateChangeNotificationHandler] |tezplugins.LlapTaskSchedulerService$NodeStateChangeListener|: Removed node with identity: c7b50156-b4f9-4353-89a4-3d1a1ccea604
...
2016-09-02 16:53:39,511 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned task TaskInfo{task=attempt_1466700718395_1343_2_07_000000_1, priority=140, startTime=0, containerId=null, assignedInstance=null, uniqueId=24, localityDelayTimeout=0} to container container_222212222_1343_01_000025 on node=DynamicServiceInstance [alive=true, host=cn109... with resources=<memory:83968, vCores:16>, shufflePort=15551, servicesAddress=..., mgmtPort=15004]
{noformat}

> LLAP: slow scheduling due to LlapTaskScheduler not removing nodes on kill 
> --------------------------------------------------------------------------
>
>                 Key: HIVE-14608
>                 URL: https://issues.apache.org/jira/browse/HIVE-14608
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> ...and presumably doesn't disable them for scheduling. I haven't looked in detail though, I just see some harmless killed tasks in queries after I kill some LLAP nodes manually between queries
> {noformat}
>   public void workerNodeRemoved(ServiceInstance serviceInstance) {
>      // FIXME: disabling this for now
> // instanceToNodeMap.remove(serviceInstance.getWorkerIdentity());
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)