You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/09/10 13:45:28 UTC

[jira] [Updated] (TEZ-1543) Shuffle Errors on heavy load

     [ https://issues.apache.org/jira/browse/TEZ-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan updated TEZ-1543:
----------------------------------
    Attachment: TEZ-1543.1.patch
                with_patch.svg
                syn_app_with_issue.svg

- Under heavy load, connection.connect() throws socketTimeout exception.  However, HttpConnection.connect() would return false instead of throwing IOException (because cleanup flag check wasn't done properly).
- Uploading the runtime SVG before & after the fix.
- "syn_app_with_issue.svg" job had lots of reducer retries due to this and job ended up failing.
- "with_patch.svg" doesn't have any task retries and job completes successfully.


> Shuffle Errors on heavy load
> ----------------------------
>
>                 Key: TEZ-1543
>                 URL: https://issues.apache.org/jira/browse/TEZ-1543
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: TEZ-1543.1.patch, syn_app_with_issue.svg, with_patch.svg
>
>
> org.apache.tez.runtime.library.common.shuffle.impl.Shuffle: ShuffleRunner failed with error
> org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$ShuffleError: error in shuffle in fetcher [initialmap] #13
>         at org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$RunShuffleCallable.call(Shuffle.java:336)
>         at org.apache.tez.runtime.library.common.shuffle.impl.Shuffle$RunShuffleCallable.call(Shuffle.java:318)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:722)
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>         at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>         at org.apache.hadoop.io.WritableUtils.readStringSafely(WritableUtils.java:475)
>         at org.apache.tez.runtime.library.common.shuffle.impl.ShuffleHeader.readFields(ShuffleHeader.java:82)
>         at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyMapOutput(Fetcher.java:350)
>         at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.copyFromHost(Fetcher.java:294)
>         at org.apache.tez.runtime.library.common.shuffle.impl.Fetcher.run(Fetcher.java:160)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)