You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Zhiyuan Yang (JIRA)" <ji...@apache.org> on 2017/03/23 03:29:41 UTC
[jira] [Comment Edited] (TEZ-3616)
TestMergeManager#testLocalDiskMergeMultipleTasks fails intermittently
[ https://issues.apache.org/jira/browse/TEZ-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937112#comment-15937112 ]
Zhiyuan Yang edited comment on TEZ-3616 at 3/23/17 3:29 AM:
------------------------------------------------------------
Thanks [~ferhui] for working on this! As you said, the issue is caused by early finished merge. TEZ-2859 tried to fix the same problem, but unfortunately the artificial delay wasn't introduced at the right place.
{code}
tmpDir = new Path(inputContext.getUniqueIdentifier());
try {
....
writer.close();
additionalBytesWritten.increment(writer.getCompressedLength());
} catch (IOException e) {
localFS.delete(outputPath, true);
throw e;
}
final long outputLen = localFS.getFileStatus(outputPath).getLen();
closeOnDiskFile(new FileChunk(outputPath, 0, outputLen));
{code}
The interrupt is supposed to happen when onDiskMerger thread is inside the try-catch block. Adding more data for merger can be a workaround, but a more promising fix is to prolong the try-catch. Maybe we can introduce the desired delay by using mock TezCounter for additionalBytesWritten.
was (Author: aplusplus):
Thanks [~ferhui] for working on this! As you said, the issue is caused by early finished merge. TEZ-3859 tried to fix the same problem, but unfortunately the artificial delay wasn't introduced at the right place.
{code}
tmpDir = new Path(inputContext.getUniqueIdentifier());
try {
....
writer.close();
additionalBytesWritten.increment(writer.getCompressedLength());
} catch (IOException e) {
localFS.delete(outputPath, true);
throw e;
}
final long outputLen = localFS.getFileStatus(outputPath).getLen();
closeOnDiskFile(new FileChunk(outputPath, 0, outputLen));
{code}
The interrupt is supposed to happen when onDiskMerger thread is inside the try-catch block. Adding more data for merger can be a workaround, but a more promising fix is to prolong the try-catch. Maybe we can introduce the desired delay by using mock TezCounter for additionalBytesWritten.
> TestMergeManager#testLocalDiskMergeMultipleTasks fails intermittently
> ----------------------------------------------------------------------
>
> Key: TEZ-3616
> URL: https://issues.apache.org/jira/browse/TEZ-3616
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.0
> Environment: Ubuntu 14.04
> Reporter: Sonia Garudi
> Assignee: Fei Hui
> Labels: ppc64le, x86
> Attachments: TEZ-3616.001.patch
>
>
> In tez-runtime-library project, the TestMergeManager#testLocalDiskMergeMultipleTasks test fails intermittently with the following error:
> testLocalDiskMergeMultipleTasks(org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestMergeManager) Time elapsed: 1.395 sec <<< FAILURE!
> java.lang.AssertionError: Values should be different. Actual: 1
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failEquals(Assert.java:185)
> at org.junit.Assert.assertNotEquals(Assert.java:161)
> at org.junit.Assert.assertNotEquals(Assert.java:198)
> at org.junit.Assert.assertNotEquals(Assert.java:209)
> at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestMergeManager.testLocalDiskMergeMultipleTasks(TestMergeManager.java:878)
> at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.TestMergeManager.testLocalDiskMergeMultipleTasks(TestMergeManager.java:628)
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)