You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/09/09 15:57:46 UTC

[jira] [Comment Edited] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown

    [ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736453#comment-14736453 ] 

Jeff Zhang edited comment on TEZ-2724 at 9/9/15 1:57 PM:
---------------------------------------------------------

Steps to reproduce this issue:
* configuration requirements:
** yarn.timeline-service.generic-application-history.enabled=false
** yarn.resourcemanager.recovery.enabled=false
** ipc.client.connect.retry.interval=5000
** ipc.client.connect.max.retries=12 
* Run command: "hadoop jar tez-tests/target/tez-tests-0.8.1-SNAPSHOT.jar mrrsleep -m 5 -r 5 -mt 20000 -rt 10000"
* Kill the AM in the middle of job running
* New app attempt will be started and the dag will be recovered and completed. Check the RM UI to wait for the yarn app finished  then restart RM (Before the app completed, the client continue try to reconnect to the AM of the first app attempt). 



was (Author: zjffdu):
Steps to reproduce this issue:
* configuration requirements:
** yarn.timeline-service.generic-application-history.enabled=false
** yarn.resourcemanager.recovery.enabled=false
** ipc.client.connect.retry.interval=5000
** ipc.client.connect.max.retries=12
* Run command: "hadoop jar tez-tests/target/tez-tests-0.8.1-SNAPSHOT.jar mrrsleep -m 5 -r 5 -mt 20000 -rt 10000"
* Kill the AM in the middle of job running
* Check the RM UI to wait for the yarn app finished, then restart RM


> Tez Client keeps on showing old status when application is finished but RM is shutdown
> --------------------------------------------------------------------------------------
>
>                 Key: TEZ-2724
>                 URL: https://issues.apache.org/jira/browse/TEZ-2724
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.4
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt
>
>
> From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. 
> {code}
> 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45
> Deleted /user/hadoopqa/Input1
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2
> RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs  -rm -r -skipTrash /user/hadoopqa/Input2
> 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45
> {code}
> Configuration to reproduce this issue
> * disable generic application history (yarn.timeline-service.generic-application-history.enabled)
> * disable rm recovery (yarn.resourcemanager.recovery.enabled)
> * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval & ipc.client.connect.max.retries)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)