You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Barna Zsombor Klara (JIRA)" <ji...@apache.org> on 2017/12/12 18:04:00 UTC

[jira] [Comment Edited] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

    [ https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287976#comment-16287976 ] 

Barna Zsombor Klara edited comment on HIVE-18263 at 12/12/17 6:03 PM:
----------------------------------------------------------------------

Thank you for the patch Adam and for the detailed analysis. Just one minor question:
Any reason we gather all the addresses into one set then iterate over the set instead of iterating over the nodes and iterating in a nested manner over the addresses to remove them from the failed hosts collection?
Not a big issue, I'm just curious.
Otherwise +1.

[~spena] based on the linked Jira it seems you came to a different conclusion, that the ips cannot clash between the killed and the live hosts. Would you please help in clarifying what is/can be going on here? I'm confused.


was (Author: zsombor.klara):
Thank you for the patch Adam and for the detailed analysis. Just one minor question:
Any reason we gather all the addresses into one set then iterate over the set instead of iterating over the nodes and iterating in a nested manner over the addresses to remove them from the failed hosts collection?
Not a big issue, I'm just curious.
Otherwise +1.

> Ptest execution are multiple times slower sometimes due to dying executor slaves
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-18263
>                 URL: https://issues.apache.org/jira/browse/HIVE-18263
>             Project: Hive
>          Issue Type: Bug
>          Components: Testing Infrastructure
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>         Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. Usually it should take about 1.5 hours, but in some cases it took over 4-5 hours.
> Looking in the logs of one such execution I've seen that some commands that were sent to test executing slaves returned 255. Here this typically means that there is unknown return code for the remote call since hiveptest-server can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed ([mTerminatedHosts map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93] contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the file are seen as running slaves, ptest will terminate these first so it can begin with a fresh start
> ** The IPs of these terminated instances already make their way into {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 minutes and might see a running host that has the same IP as an old slave had which was terminated at startup. It will think that this host should be terminated since it already tried 60 minutes ago as its IP is in {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check the contents of {{mTerminatedHosts}} and remove this IP from it if it is there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)