You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Qihong Wu (Jira)" <ji...@apache.org> on 2021/07/07 22:25:00 UTC
[jira] [Created] (YARN-10851) Tez session close does not interrupt yarn's async thread

Qihong Wu created YARN-10851:
--------------------------------

             Summary: Tez session close does not interrupt yarn's async thread
                 Key: YARN-10851
                 URL: https://issues.apache.org/jira/browse/YARN-10851
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.10.1, 2.8.5
         Environment: On an HA cluster, where RM1 is not the active RM
Yarn of version 2.8.5 and is configured with Tez
            Reporter: Qihong Wu
         Attachments: hive.log

Hi, I want to ask for the expertise knowledge on the yarn behavior when handling `InterruptedIOException`. 

The issue occurs on a HA cluster, where RM1 is NOT the active RM. Therefore, if the yarn request made to RM1 failed, the RM failover should happen. However, if an interrupted exception is thrown when connecting to RM1, the thread should try to [bail out|https://dzone.com/articles/how-to-handle-the-interruptedexception] as soon as possible to [respect interrupt request|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdownNow--], rather than moving on to another RM.

But I found my application (hive) after throwing `InterruptedIOException` when trying to connect with RM1 failed, continuing to RM2. I want to know how does yarn handle InterruptedIOException, shouldn't the async thread gets interrupted and shutdown when tez close() triggered interrupt request?



*The reproduction step is:*
 1. In an HA cluster which uses yarn of version 2.8.5 and is configured with Tez
 2. Make sure RM1 is not the active RM by checking `yarn rmadmin -getAllServiceState`. It it is, manually [transition RM2 as active RM|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html#Admin_commands].
 3. Apply failover-retry properties to yarn-site.xml 
{quote}<property>
 <name>yarn.client.failover-retries</name>
 <value>4</value>
 </property>
 <property>
 <name>yarn.client.failover-retries-on-socket-timeouts</name>
 <value>4</value>
 </property>
 <property>
 <name>yarn.client.failover-max-attempts</name>
 <value>4</value>
 </property>
{quote}
4. Run a simple application to yarn-client (for example, a simple hive DDL command)
{quote}hive --hiveconf hive.root.logger=TRACE,console -e "create table tez_test (id int, name string);"
{quote}
5. Find from application's log (for example, hive.log), you can find `RetryInvocationHandler` has captured the `InterruptedIOException` when request was talking over rm1, but the thread didn't bail out immediately, but continue moving to rm2.



*More information:*
The interrupted exception is triggered via via [TezSessionState#close|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L689] and [Future#cancel|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html#cancel-boolean-].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org