You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2023/02/28 08:38:00 UTC

[jira] [Resolved] (TEZ-4460) Read timed out in shuffle handler - incorrect usage of EMPTY_LAST_CONTENT and channel write

     [ https://issues.apache.org/jira/browse/TEZ-4460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

László Bodor resolved TEZ-4460.
-------------------------------
    Resolution: Fixed

> Read timed out in shuffle handler - incorrect usage of EMPTY_LAST_CONTENT and channel write
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4460
>                 URL: https://issues.apache.org/jira/browse/TEZ-4460
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.2
>            Reporter: Shailesh Gupta
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.3
>
>         Attachments: nm_logs_6_45-6_53.zip, syslog_attempt_1669954729098_0007_1_01_000003_0.gz, syslog_attempt_1669954729098_0007_1_01_000005_0.gz
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We are seeing issue in tez shuffle handle in tez-0.10.2. Would need help in debugging it.
> We are running TPC-DS 1TB scale queries (q7) on EC2 with 2 worker nodes of M5.8xl. We are migrating tez 0.9.2 to 0.10.2. Things are working fine on tez 0.9.2 but in tez 0.10.2 , we are seeing these logs while using tez shuffle handler.
> {{2022-11-30 13:19:26627 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] |orderedgrouped.FetcherOrderedGrouped|: }}
> {{345 2022-11-30 13:19:26628 [INFO] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] |orderedgrouped.FetcherOrderedGrouped|: Read timed out}}
> {{java.net.SocketInputStream.socketRead0(Native Method) }}
> {{java.net.SocketInputStream.socketRead(SocketInputStream.java:116) }}
> {{java.net.SocketInputStream.read(SocketInputStream.java:171)}}
> {{java.net.SocketInputStream.read(SocketInputStream.java:141)}}
> {{java.io.BufferedInputStream.read1(BufferedInputStream.java:284)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:345)}}
> {{java.io.FilterInputStream.read(FilterInputStream.java:133)}}
> {{sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)}}
> {{java.io.BufferedInputStream.fill(BufferedInputStream.java:246)}}
> {{java.io.BufferedInputStream.read(BufferedInputStream.java:265)}}
> {{java.io.DataInputStream.readByte(DataInputStream.java:265)}}
> {{org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:310)}}
> {{org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:331)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:447)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:286)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:186)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:198)}}
> {{org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:61)}}
> {{org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)}}
> {{com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)}}
> {{com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)}}
> {{java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
> {{java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
> {{java.lang.Thread.run(Thread.java:750)]}}
> {{2022-11-30 13:19:26628 [WARN] [Fetcher_O \{Map_4 -> Reducer_5} #5|#5] |orderedgrouped.FetcherOrderedGrouped|: Shuffle output from XXXX:13563 failed retry it.}}
>  
> In the node manager logs, i can see and other logs
>  
> {{2022-12-01 13:52:35,198 DEBUG auxservices.ShuffleHandler: Fetcher request verfied. enc_str=13563/mapOutput?job=job_1669871261925_0011&dag=1&reduce=8-11&attempt_1669871261925_0011_1_02_000016_1_10005....}}
>  
> There could be some issue in closing the connections while upgrading netty version.
> The case is reproducible every time, let me know if you need more logs or want me to check something.
>  
>  
> cc: [~abstractdog] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)