You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "TezQA (JIRA)" <ji...@apache.org> on 2015/05/02 03:49:05 UTC

[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)

    [ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524449#comment-14524449 ] 

TezQA commented on TEZ-2237:
----------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment
  http://issues.apache.org/jira/secure/attachment/12729883/TEZ-2237.2.master.txt
  against master revision 9f09027.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 4 new or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/607//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/607//console

This message is automatically generated.

> Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-2237
>                 URL: https://issues.apache.org/jira/browse/TEZ-2237
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.6.0
>         Environment: Debian Linux "jessie"
> OpenJDK Runtime Environment (build 1.8.0_40-internal-b27)
> OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode)
> 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system disk + 4*1 or 2 TiB HDD for HDFS & local  (on-prem, dedicated hardware)
> Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 to run Cascading 3.0.0-wip-90 with TEZ 0.6.0
>            Reporter: Cyrille Chépélov
>            Assignee: Siddharth Seth
>            Priority: Critical
>         Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, TEZ-2237.1.master.txt, TEZ-2237.2.branch6.txt, TEZ-2237.2.master.txt, TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png, alloc_vcores.png, application_1427324000018_1444.yarn-logs.red.txt.gz, application_1427324000018_1908.red.txt.bz2, application_1427964335235_2070.txt.red.txt.bz2, appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt, oneOutOfTwoOutputsStarted.txt, ordered-grouped-kv-input-traces.diff, output-starts.txt, start_containers.png, stop_containers.png, syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png
>
>
> On a specific DAG with many vertices (actually part of a larger meta-DAG), after about a hour of processing, several BufferTooSmallException are raised in UnorderedPartitionedKVWriter (about one every two or three spills).
> Once these exceptions are raised, the DAG remains indefinitely "active", tying up memory and CPU resources as far as YARN is concerned, while little if any actual processing takes place. 
> It seems two separate issues are at hand:
>   1. BufferTooSmallException are raised even though, small as the actually allocated buffers seem to be (around a couple megabytes were allotted whereas 100MiB were requested), the actual keys and values are never bigger than 24 and 1024 bytes respectively.
>   2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop (stop requests appear to be sent 7 hours after the BTSE exceptions are raised, but 9 hours after these stop requests, the DAG was still lingering on with all containers present tying up memory and CPU allocations)
> The emergence of the BTSE prevent the Cascade to complete, preventing from validating the results compared to traditional MR1-based results. The lack of conclusion renders the cluster queue unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)