You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/02/18 20:09:00 UTC
[jira] [Work logged] (HIVE-22359) LLAP: when a node restarts with the exact same host/port in kubernetes it is not detected as a task failure

     [ https://issues.apache.org/jira/browse/HIVE-22359?focusedWorklogId=389041&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-389041 ]

ASF GitHub Bot logged work on HIVE-22359:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Feb/20 20:08
            Start Date: 18/Feb/20 20:08
    Worklog Time Spent: 10m 
      Work Description: prasanthj commented on pull request #917: HIVE-22359: LLAP: when a node restarts with the exact same host/port in kubernetes it is not detected as a task failure
URL: https://github.com/apache/hive/pull/917
 
 
   In kubernete environments, the hostnames and ports are same for LLAP service but IP address of pods can change. There are some assumptions in LLAP that handles hostname:port and caches connections based on that. Also AM thinks that certain host is running some task attempts but when the LLAP pod restarts all the tasks on that node gets killed or replaced with new tasks in which case LLAP will heartbeat with different task attempts which AM does not expect. 
   
   This PR fixes 2 issues
   - Includes IP address in hostId that is used for caching RPC connections
   - When AM expects some tasks to be there on some node and if does not exists then it will kill those task attempts so that it gets rescheduled.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 389041)
    Remaining Estimate: 0h
            Time Spent: 10m

> LLAP: when a node restarts with the exact same host/port in kubernetes it is not detected as a task failure
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-22359
>                 URL: https://issues.apache.org/jira/browse/HIVE-22359
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-22359.1.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> │ <14>1 2019-10-16T22:16:39.233Z query-coordinator-0-5.query-coordinator-0-service.compute-1569601454-l2x9.svc.cluster.local query-coordinator 1 461e5ad9-f05f-11e9-85f7-06e84765763e [mdc@18060 class="te │
> │ zplugins.LlapTaskCommunicator" level="INFO" thread="IPC Server handler 4 on 33333"] The tasks we expected to be on the node are not there: attempt_1569601631911_0000_1_04_000034_0, attempt_15696016319 │
> │ 11_0000_1_04_000071_0, attempt_1569601631911_0000_1_04_000191_0, attempt_1569601631911_0000_1_04_000211_0, attempt_1569601631911_0000_1_04_000229_0, attempt_1569601631911_0000_1_04_000231_0, attempt_1 │
> │ 569601631911_0000_1_04_000235_0, attempt_1569601631911_0000_1_04_000242_0, attempt_1569601631911_0000_1_04_000160_1, attempt_1569601631911_0000_1_04_000012_2, attempt_1569601631911_0000_1_04_000003_2, │
> │  attempt_1569601631911_0000_1_04_000056_2, 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)