You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2015/06/24 05:41:43 UTC

[jira] [Comment Edited] (TEZ-2378) In case Fetcher (unordered) fails to do local fetch, log in debug mode to reduce log size

    [ https://issues.apache.org/jira/browse/TEZ-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598819#comment-14598819 ] 

Rajesh Balamohan edited comment on TEZ-2378 at 6/24/15 3:41 AM:
----------------------------------------------------------------

- For broadcast, fetcher tries to download and cache it locally when shared-fetch is enabled. Subsequent tasks scheduled on the same node would be able to read from local disk as opposed to downloading from remote machine. For example, When InputHost has got 4 srcAttempts, it is quite possible that couple of them were already downloaded in local disks (and rest are yet to be downloaded).  Yet to be downloaded are scheduled via doHttpFetch. Just before this step, it tries to optimize by reading whatever is available in local disks and logs disk exceptions for whatever tasks are not available. It falls back to http fetch when data is not available locally. This leads to increased log size in large jobs and distracts debugging. 

[~sseth], [~gopalv], [~hitesh] - Please review when you find time. 



was (Author: rajesh.balamohan):
- For broadcast, fetcher tries to download and cache it locally when shared-fetch is enabled. Subsequent tasks scheduled on the same node would be able to read from local disk as opposed to downloading from remote machine. For example, When InputHost has got 4 srcAttempts, it is quite possible that couple of them were already downloaded in local disks (and rest are yet to be downloaded).  Yet to be downloaded are scheduled via doHttpFetch. Just before this step, it tries to optimize by reading whatever is available in local disks and logs disk exceptions for whatever tasks are not available. It falls back to http fetch when data is not available locally. This leads to increased log size in large jobs and distracts debugging. 



> In case Fetcher (unordered) fails to do local fetch, log in debug mode to reduce log size
> -----------------------------------------------------------------------------------------
>
>                 Key: TEZ-2378
>                 URL: https://issues.apache.org/jira/browse/TEZ-2378
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: TEZ-2378.1.patch
>
>
> Following can be logged as debug mode as opposed to WARN level. May be counters can be added later to track the number of times it failed to do local-fetch.
> {noformat}
> 2015-04-28 05:41:45,487 WARN [Fetcher [Map_5] #15] shuffle.Fetcher: Failed to shuffle output of InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=81], attemptNumber=0, pathComponent=attempt_1429683757595_0485_1_03_000081_0_10003, fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] from cn047-10.l42scl.hortonworks.com(local fetch)
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find output/attempt_1429683757595_0485_1_03_000081_0_10003/file.out.index in any of the configured local directories
>         at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:449)
>         at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:164)
>         at org.apache.tez.runtime.library.common.shuffle.Fetcher.getShuffleInputFileName(Fetcher.java:612)
>         at org.apache.tez.runtime.library.common.shuffle.Fetcher.getTezIndexRecord(Fetcher.java:592)
>         at org.apache.tez.runtime.library.common.shuffle.Fetcher.doLocalDiskFetch(Fetcher.java:537)
>         at org.apache.tez.runtime.library.common.shuffle.Fetcher.doSharedFetch(Fetcher.java:353)
>         at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:192)
>         at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:72)
>         at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)