You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Ankur Raj (Jira)" <ji...@apache.org> on 2023/08/01 06:15:00 UTC

[jira] [Updated] (TEZ-4507) Task failure due to memory issues in 0.10+

     [ https://issues.apache.org/jira/browse/TEZ-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ankur Raj updated TEZ-4507:
---------------------------
    Description: 
This issue is seen on clusters which have tez-0.10.2 branch. on running insert overwrite queries on huge amount of data; user in this case was using insert overwrite on 9 TB data. 

*Reduce tasks fail to read shuffle data and Hive query ultimately fails with corrupted memory errors.*

Logs:

 
{code:java}
Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 Status: Failed Vertex failed, vertexName=Reducer 2, vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 failed, info=[Container container_1686949228586_0003_01_000050 finished with diagnostics set to [Container failed, exitCode=134. [2023-06-16 21:54:04.463]Exception from container-launch. Container id: container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 21:54:04.468]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:07 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:58 Starting to run new task attempt: attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0
 
{code}
Container logs :

 
{code:java}
[2023-06-22 12:50:56.419]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 33001 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000099/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000099 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stderr
Last 4096 bytes of stderr :
2023-06-22 12:50:55 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000732_0
malloc(): memory corruption{code}
 

 
{code:java}
[2023-06-22 12:51:07.026]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 42026 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000265/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000265 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stderr
Last 4096 bytes of stderr :
2023-06-22 12:51:06 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000732_2
free(): invalid next size (normal){code}
 

 

 
{code:java}
[2023-06-22 12:50:56.401]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 35840 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000105/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000105 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stderr
Last 4096 bytes of stderr :
2023-06-22 12:50:55 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000857_0
java: malloc.c:4032: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
 
{code}
This seems to point to memory corruption.

A private patch was created by reverting changes in TEZ-4135 and related commits, which fixed this issue. 

This is not reproduced on using around 100 GB of data. We are trying to reproduce this in our test environment, and have been unsuccesful so far. But issue always occurs in customer's env. User was unable to share more logs. I will add more logs once i am able to reproduce it in my environment. Opening this ticket now to get initial ideas from the community.

  was:
This issue is seen on AWS EMR-6.9.0 and above clusters which have tez-0.10.2 branch. on running insert overwrite queries on huge amount of data; user in this case was using insert overwrite on 9 TB data. 

*Reduce tasks fail to read shuffle data and Hive query ultimately fails with corrupted memory errors.*

Logs:

 
{code:java}
Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 Status: Failed Vertex failed, vertexName=Reducer 2, vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 failed, info=[Container container_1686949228586_0003_01_000050 finished with diagnostics set to [Container failed, exitCode=134. [2023-06-16 21:54:04.463]Exception from container-launch. Container id: container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 21:54:04.468]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:07 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:58 Starting to run new task attempt: attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0
 
{code}
Container logs :

 
{code:java}
[2023-06-22 12:50:56.419]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 33001 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000099/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000099 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stderr
Last 4096 bytes of stderr :
2023-06-22 12:50:55 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000732_0
malloc(): memory corruption{code}
 

 
{code:java}
[2023-06-22 12:51:07.026]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 42026 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000265/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000265 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stderr
Last 4096 bytes of stderr :
2023-06-22 12:51:06 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000732_2
free(): invalid next size (normal){code}
 

 

 
{code:java}
[2023-06-22 12:50:56.401]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 35840 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000105/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000105 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stderr
Last 4096 bytes of stderr :
2023-06-22 12:50:55 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000857_0
java: malloc.c:4032: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
 
{code}
This seems to point to memory corruption.

A private patch was created by reverting changes in TEZ-4135 and related commits, which fixed this issue. 

This is not reproduced on using around 100 GB of data. We are trying to reproduce this in our test environment, and have been unsuccesful so far. But issue always occurs in customer's env. User was unable to share more logs. I will add more logs once i am able to reproduce it in my environment. Opening this ticket now to get initial ideas from the community.


> Task failure due to memory issues in 0.10+
> ------------------------------------------
>
>                 Key: TEZ-4507
>                 URL: https://issues.apache.org/jira/browse/TEZ-4507
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.10.1, 0.10.2
>            Reporter: Ankur Raj
>            Priority: Major
>
> This issue is seen on clusters which have tez-0.10.2 branch. on running insert overwrite queries on huge amount of data; user in this case was using insert overwrite on 9 TB data. 
> *Reduce tasks fail to read shuffle data and Hive query ultimately fails with corrupted memory errors.*
> Logs:
>  
> {code:java}
> Map 1: 1392(+495)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1392(+490)/1922 Reducer 2: 0(+0,-67)/78414 Map 1: 1402(+0)/1922 Reducer 2: 0(+0,-67)/78414 Status: Failed Vertex failed, vertexName=Reducer 2, vertexId=vertex_1686949228586_0003_1_01, diagnostics=[Task failed, taskId=task_1686949228586_0003_1_01_002253, diagnostics=[TaskAttempt 0 failed, info=[Container container_1686949228586_0003_01_000050 finished with diagnostics set to [Container failed, exitCode=134. [2023-06-16 21:54:04.463]Exception from container-launch. Container id: container_1686949228586_0003_01_000050 Exit code: 134 [2023-06-16 21:54:04.468]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:07 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:57 Completed running task attempt: attempt_1686949228586_0003_1_00_001424_0 2023-06-16 21:53:58 Starting to run new task attempt: attempt_1686949228586_0003_1_01_002253_0 [2023-06-16 21:54:04.469]Container exited with a non-zero exit code 134. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : /bin/bash: line 1: 28148 Aborted /usr/lib/jvm/java-1.8.0/bin/java -Xmx14745m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1686949228586_0003/container_1686949228586_0003_01_000050/tmp org.apache.tez.runtime.task.TezChild ip-10-1-14-249.ec2.internal 35871 container_1686949228586_0003_01_000050 application_1686949228586_0003 1 > /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stdout 2> /var/log/hadoop-yarn/containers/application_1686949228586_0003/container_1686949228586_0003_01_000050/stderr Last 4096 bytes of stderr : 2023-06-16 21:03:28 Starting to run new task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:21 Completed running task attempt: attempt_1686949228586_0003_1_00_000062_0 2023-06-16 21:29:22 Starting to run new task attempt: attempt_1686949228586_0003_1_00_001083_0 2023-06-16 21:43:06 Completed running task attempt: attempt_1686949228586_0003_1_00_001083_0
>  
> {code}
> Container logs :
>  
> {code:java}
> [2023-06-22 12:50:56.419]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 33001 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000099/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000099 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000099/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:50:55 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000732_0
> malloc(): memory corruption{code}
>  
>  
> {code:java}
> [2023-06-22 12:51:07.026]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 42026 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000265/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000265 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000265/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:51:06 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000732_2
> free(): invalid next size (normal){code}
>  
>  
>  
> {code:java}
> [2023-06-22 12:50:56.401]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /bin/bash: line 1: 35840 Aborted                 /usr/lib/jvm/java-1.8.0/bin/java -Xmx16384m -server -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 -Dtez.root.logger=INFO,CLA -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105 -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1687437498460_0002/container_1687437498460_0002_01_000105/tmp org.apache.tez.runtime.task.TezChild ip-10-1-15-66.ec2.internal 36193 container_1687437498460_0002_01_000105 application_1687437498460_0002 1 > /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stdout 2> /var/log/hadoop-yarn/containers/application_1687437498460_0002/container_1687437498460_0002_01_000105/stderr
> Last 4096 bytes of stderr :
> 2023-06-22 12:50:55 Starting to run new task attempt: attempt_1687437498460_0002_1_01_000857_0
> java: malloc.c:4032: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
>  
> {code}
> This seems to point to memory corruption.
> A private patch was created by reverting changes in TEZ-4135 and related commits, which fixed this issue. 
> This is not reproduced on using around 100 GB of data. We are trying to reproduce this in our test environment, and have been unsuccesful so far. But issue always occurs in customer's env. User was unable to share more logs. I will add more logs once i am able to reproduce it in my environment. Opening this ticket now to get initial ideas from the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)