You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Michael Smith (Jira)" <ji...@apache.org> on 2022/06/21 17:36:00 UTC

[jira] [Created] (HIVE-26346) Default Tez memory limits occasionally result in killing container

Michael Smith created HIVE-26346:
------------------------------------

             Summary: Default Tez memory limits occasionally result in killing container
                 Key: HIVE-26346
                 URL: https://issues.apache.org/jira/browse/HIVE-26346
             Project: Hive
          Issue Type: Improvement
          Components: Tez
    Affects Versions: 3.1.3
            Reporter: Michael Smith


When inserting data into Hive, the insert occasionally fails with messages like
{quote}
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1605060173780_0039_2_00, diagnostics=[Task failed, taskId=task_1605060173780_0039_2_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Container container_1605060173780_0039_01_000002 finished with diagnostics set to [Container failed, exitCode=-104. [2020-11-11 02:35:11.768]Container [pid=16810,containerID=container_1605060173780_0039_01_000002] is running 7729152B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container.
{quote}

Specifically that the TezChild container is using some small amount of physical memory beyond its limit, so Tez kills the container.

Identifying how to resolve this is somewhat fraught:
- There's no clear troubleshooting advice around this error from our docs. Googling led to several forums that had some good and some awful advice. https://community.cloudera.com/t5/Community-Articles/Demystify-Apache-Tez-Memory-Tuning-Step-by-Step/ta-p/245279 is probably the best one.
- The issue itself comes down to Tez allocating 80% of the memory limit to Java heap (Xmx), which depending on other memory usage (stack memory, JIT, other JVM overhead) can be too little. By comparison: when running in a cgroup, Java defaults Xmx to 25% of the memory limit.
- Identifying the right parameters to tune, and verifying they've been set correctly, was a bit challenging. We ended up playing with {{tez.container.max.java.heap.fraction}}, {{hive.tez.container.size}}, and {{yarn.scheduler.minimum-allocation-mb}}. I would then verify those took effect by monitoring process arguments (with {{htop}}) for any changes in Xmx. Definitely had some missteps figuring out when it's {{hive.tez.container}} vs {{tez.container}}.

In the end, any of the following seems to have worked for us
* {{SET yarn.scheduler.minimum-allocation-mb=2048}}
* {{SET tez.container.max.java.heap.fraction=0.75}}
* {{SET hive.tez.container.size=2048}}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)