You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Zhiyuan Yang (JIRA)" <ji...@apache.org> on 2017/03/23 03:29:41 UTC

[jira] [Comment Edited] (TEZ-3616) TestMergeManager#testLocalDiskMergeMultipleTasks fails intermittently

    [ https://issues.apache.org/jira/browse/TEZ-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937112#comment-15937112 ] 

Zhiyuan Yang edited comment on TEZ-3616 at 3/23/17 3:29 AM:
------------------------------------------------------------

Thanks [~ferhui] for working on this! As you said, the issue is caused by early finished merge. TEZ-2859 tried to fix the same problem, but unfortunately the artificial delay wasn't introduced at the right place. 
{code}
     tmpDir = new Path(inputContext.getUniqueIdentifier());
      try {
        ....
        writer.close();
        additionalBytesWritten.increment(writer.getCompressedLength());
      } catch (IOException e) {
        localFS.delete(outputPath, true);
        throw e;
      }

      final long outputLen = localFS.getFileStatus(outputPath).getLen();
      closeOnDiskFile(new FileChunk(outputPath, 0, outputLen));
{code}

The interrupt is supposed to happen when onDiskMerger thread is inside the try-catch block. Adding more data for merger can be a workaround, but a more promising fix is to prolong the try-catch. Maybe we can introduce the desired delay by using mock TezCounter for additionalBytesWritten.


was (Author: aplusplus):
Thanks [~ferhui] for working on this! As you said, the issue is caused by early finished merge. TEZ-3859 tried to fix the same problem, but unfortunately the artificial delay wasn't introduced at the right place. 
{code}
     tmpDir = new Path(inputContext.getUniqueIdentifier());
      try {
        ....
        writer.close();
        additionalBytesWritten.increment(writer.getCompressedLength());
      } catch (IOException e) {
        localFS.delete(outputPath, true);
        throw e;
      }

      final long outputLen = localFS.getFileStatus(outputPath).getLen();
      closeOnDiskFile(new FileChunk(outputPath, 0, outputLen));
{code}

The interrupt is supposed to happen when onDiskMerger thread is inside the try-catch block. Adding more data for merger can be a workaround, but a more promising fix is to prolong the try-catch. Maybe we can introduce the desired delay by using mock TezCounter for additionalBytesWritten.

> TestMergeManager#testLocalDiskMergeMultipleTasks fails intermittently 
> ----------------------------------------------------------------------
>
>                 Key: TEZ-3616
>                 URL: https://issues.apache.org/jira/browse/TEZ-3616
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: Ubuntu 14.04 
>            Reporter: Sonia Garudi
>            Assignee: Fei Hui
>              Labels: ppc64le, x86
>         Attachments: TEZ-3616.001.patch
>
>
> In tez-runtime-library project, the TestMergeManager#testLocalDiskMergeMultipleTasks test fails intermittently with the following error:
> testLocalDiskMergeMultipleTasks(org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestMergeManager)  Time elapsed: 1.395 sec  <<< FAILURE!
> java.lang.AssertionError: Values should be different. Actual: 1
>         at org.junit.Assert.fail(Assert.java:88)
>         at org.junit.Assert.failEquals(Assert.java:185)
>         at org.junit.Assert.assertNotEquals(Assert.java:161)
>         at org.junit.Assert.assertNotEquals(Assert.java:198)
>         at org.junit.Assert.assertNotEquals(Assert.java:209)
>         at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestMergeManager.testLocalDiskMergeMultipleTasks(TestMergeManager.java:878)
>         at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestMergeManager.testLocalDiskMergeMultipleTasks(TestMergeManager.java:628)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)