You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2015/07/29 23:10:05 UTC
[jira] [Commented] (TEZ-2552) CRC errors can cause job to run for very long time in large jobs

    [ https://issues.apache.org/jira/browse/TEZ-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646765#comment-14646765 ] 

Bikas Saha commented on TEZ-2552:
---------------------------------

[~rajesh.balamohan] Should we backport this fix to 0.5? Seems localized enough.

> CRC errors can cause job to run for very long time in large jobs
> ----------------------------------------------------------------
>
>                 Key: TEZ-2552
>                 URL: https://issues.apache.org/jira/browse/TEZ-2552
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>             Fix For: 0.8.0
>
>         Attachments: Screen Shot 2015-06-11 at 3.07.47 PM.png, TEZ-2552.1.patch, TEZ-2552.2.patch, TEZ-2552.3.patch, TEZ-2552.wip.patch, with_patch.png, without_patch.png
>
>
> Ran a fairly large job at 10 TB scale which had 1009 reducers.
> One of the machine had bad disk and NM did not delist that disk.  Machine hosting NM has disk issues (sdf & sde holds shuffle data).  exceptions.
> {noformat}
> Info fld=0x8960894
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 96 08 90 00 00 08 00
> end_request: critical medium error, dev sdf, sector 144050320
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x895a2b9
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 95 a2 b8 00 00 08 00
> end_request: critical medium error, dev sdf, sector 144024248
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x895a2b9
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 95 a2 b8 00 00 08 00
> end_request: critical medium error, dev sdf, sector 144024248
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x8849edb
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 84 9e d8 00 00 08 00
> end_request: critical medium error, dev sdf, sector 142909144
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x8849edb
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 08 84 9e d8 00 00 08 00
> end_request: critical medium error, dev sdf, sector 142909144
> {noformat}
> In-memory fetches start throwing CRC as follows.  
> {noformat}
> 2015-06-11 01:01:03,728 INFO [ShuffleAndMergeRunner [Map_11]] orderedgrouped.ShuffleScheduler: PendingHosts=[]
> 2015-06-11 01:01:03,730 INFO [Fetcher [Map_11] #0] http.HttpConnection: for url=http://cn056-10.l42scl.hortonworks.com:13562/mapOutput?job=job_1433813751839_0124&reduce=3&map=attempt_1433813751839_0124_1_04_000446_0_10027&keepAlive=true sent hash and receievd reply 0 ms
> 2015-06-11 01:01:03,730 INFO [Fetcher [Map_11] #0] orderedgrouped.FetcherOrderedGrouped: fetcher#439 about to shuffle output of map InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=446], attemptNumber=0, pathComponent=attempt_1433813751839_0124_1_04_000446_0_10027, fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] decomp: 45475 len: 23974 to MEMORY
> 2015-06-11 01:01:07,206 INFO [Fetcher [Map_11] #0] impl.IFileInputStream:  CurrentOffset=2510, offset=2510, off=2502, dataLength=23966, origLen=21456, len=21456, length=23970, checksumSize=4
> 2015-06-11 01:01:07,207 INFO [Fetcher [Map_11] #0] impl.IFileInputStream:  CurrentOffset=2510, offset=2510, off=0, dataLength=23966, origLen=21456, len=21456, length=23970, checksumSize=4
> 2015-06-11 01:01:07,207 WARN [Fetcher [Map_11] #0] orderedgrouped.FetcherOrderedGrouped: Failed to shuffle output of InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=446], attemptNumber=0, pathComponent=attempt_1433813751839_0124_1_04_000446_0_10027, fetchTypeInfo=FINAL_MERGE_ENABLED, spillEventId=-1] from cn056-10.l42scl.hortonworks.com:13562
> org.apache.hadoop.fs.ChecksumException: Checksum Error:  CurrentOffset=2510, offset=2510, off=2502, dataLength=23966, origLen=21456, len=21456, length=23970, checksumSize=4
> 	at org.apache.tez.runtime.library.common.sort.impl.IFileInputStream.doRead(IFileInputStream.java:255)
> 	at org.apache.tez.runtime.library.common.sort.impl.IFileInputStream.read(IFileInputStream.java:185)
> 	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:127)
> 	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
> 	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
> 	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
> 	at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.readToMemory(IFile.java:619)
> 	at org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToMemory(ShuffleUtils.java:113)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:471)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:267)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:164)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:177)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:52)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> {noformat}
> TaskAttemptImpl didn't fail it due to the following code
> {noformat}
> float failureFraction = ((float) attempt.uniquefailedOutputReports.size())
>           / outputFailedEvent.getConsumerTaskNumber();
> {noformat}
> In this case, reducer ran in 180 slot waves.  So even if all 180 tasks report the error, it would be around 180/1009 = 0.17 (which is less than 0.25 MAX_ALLOWED_OUTPUT_FAILURES_FRACTION) and the job runs for ever (killed the job after 2 hours; normally run in couple of minutes)
> In fetcher side, reducer state would be healthy and it would continue to wait.
> Env: Tez master & Hive master
> Ref: Query_88 @ 10 TB scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)