You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2020/02/10 14:04:00 UTC
[jira] [Comment Edited] (TEZ-4123) TestMRRJobsDAGApi flaky timeout

    [ https://issues.apache.org/jira/browse/TEZ-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033585#comment-17033585 ] 

László Bodor edited comment on TEZ-4123 at 2/10/20 2:03 PM:
------------------------------------------------------------

I cannot see that TEZ-3664 contains any fix regarding this test class flakiness

UPDATE: it turned out that yarn disk healthcheck disabled the node, in logs:
{code}
2020-02-10 14:36:15,160 INFO  [RM Event dispatcher] rmnode.RMNodeImpl (RMNodeImpl.java:transition(1209)) - Node abstractdog-440s:42154 reported UNHEALTHY with details: 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /home/abstractdog/apache/tez/tez-tests/target/org.apache.tez.mapreduce.TestMRRJobsDAGApi/org.apache.tez.mapreduce.TestMRRJobsDAGApi-localDir-nm-0_0 : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /home/abstractdog/apache/tez/tez-tests/target/org.apache.tez.mapreduce.TestMRRJobsDAGApi/org.apache.tez.mapreduce.TestMRRJobsDAGApi-logDir-nm-0_0 : used space above threshold of 90.0% ] 
{code}

seems like setting the percentage threshold solved the issue (my disk is on 93%)
{code}
conf.setInt("yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage", 99);
{code}

[~jeagles]: what do you think about setting this property for 99% in every cases where MiniTezCluster is used in tests? I think the current 90% could bring random flakiness into tests, where the environment itself is healthy (and suitable for testing simple yarn based tez unit tests), but yarn silently disables the testing node...we used to see similar issues in hive preCommit tests as well



was (Author: abstractdog):
I cannot see that TEZ-3664 contains any fix regarding this test class flakiness

> TestMRRJobsDAGApi flaky timeout
> -------------------------------
>
>                 Key: TEZ-4123
>                 URL: https://issues.apache.org/jira/browse/TEZ-4123
>             Project: Apache Tez
>          Issue Type: Test
>            Reporter: László Bodor
>            Priority: Major
>         Attachments: TestMRRJobsDAGApi.out, org.apache.tez.mapreduce.TestMRRJobsDAGApi-output.txt
>
>
> Failed in both precommit and on master locally:
> {code}
> mvn clean install -pl ./tez-tests -Dtest=TestMRRJobsDAGApi
> {code}
> surefire process thread dump:  [^TestMRRJobsDAGApi.out] 
> test output:  [^org.apache.tez.mapreduce.TestMRRJobsDAGApi-output.txt] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)