You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2019/11/06 02:30:00 UTC

[jira] [Updated] (SPARK-29767) Core dump happening on executors while doing simple union of Data Frames

     [ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Udit Mehrotra updated SPARK-29767:
----------------------------------
    Description: 
Running a union operation on two DataFrames through both Scala Spark Shell and PySpark, resulting in executor contains doing a *core dump* and existing with Exit code 134.

The trace from the *Driver*:
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container from a bad node: container_1572981097605_0021_01_000077 on host: ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_000077
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderrStack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderr	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
	at org.apache.hadoop.util.Shell.run(Shell.java:869)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 134{noformat}
From the *stdout* logs of the exiting container we see:
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f825e3b0e92, pid=12611, tid=0x00007f822b5fb700
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0xa9ae92]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/hs_err_pid12611.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#{noformat}
Also, I am unable to enable *core dump* even though *ulimit -c* is set to *unlimited*. Can you help on how to go about this issue, and also how to get the *core dump* ?

Will be adding steps to reproduce the issue.

 

 

  was:
Running a union operation on two DataFrames through both Scala Spark Shell and PySpark, resulting in executor contains doing a *core dump* and existing with Exit code 134.

The trace from the *Driver*:

 
{noformat}
Container exited with a non-zero exit code 134
.
19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container from a bad node: container_1572981097605_0021_01_000077 on host: ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1572981097605_0021_01_000077
Exit code: 134
Exception message: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderrStack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderr	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
	at org.apache.hadoop.util.Shell.run(Shell.java:869)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 134{noformat}
From the *stdout* logs of the exiting container we see:
{noformat}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f825e3b0e92, pid=12611, tid=0x00007f822b5fb700
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0xa9ae92]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/hs_err_pid12611.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#{noformat}
Also, I am unable to enable *core dump* even though *ulimit -c* is set to *unlimited*. Can you help on how to go about this issue, and also how to get the *core dump* ?

Will be adding steps to reproduce the issue.

 

 


> Core dump happening on executors while doing simple union of Data Frames
> ------------------------------------------------------------------------
>
>                 Key: SPARK-29767
>                 URL: https://issues.apache.org/jira/browse/SPARK-29767
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 2.4.4
>         Environment: AWS EMR 5.27.0, Spark 2.4.4
>            Reporter: Udit Mehrotra
>            Priority: Major
>
> Running a union operation on two DataFrames through both Scala Spark Shell and PySpark, resulting in executor contains doing a *core dump* and existing with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container from a bad node: container_1572981097605_0021_01_000077 on host: ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from container-launch.
> Container id: container_1572981097605_0021_01_000077
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderrStack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted                 LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderr	at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:869)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> 	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Container exited with a non-zero exit code 134{noformat}
> From the *stdout* logs of the exiting container we see:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f825e3b0e92, pid=12611, tid=0x00007f822b5fb700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
> # Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0xa9ae92]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/hs_err_pid12611.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{noformat}
> Also, I am unable to enable *core dump* even though *ulimit -c* is set to *unlimited*. Can you help on how to go about this issue, and also how to get the *core dump* ?
> Will be adding steps to reproduce the issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org