You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/20 05:50:58 UTC
[jira] [Resolved] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

     [ https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin resolved SPARK-18458.
---------------------------------
       Resolution: Fixed
         Assignee: Kazuaki Ishizaki
    Fix Version/s: 2.1.0

> core dumped running Spark SQL on large data volume (100TB)
> ----------------------------------------------------------
>
>                 Key: SPARK-18458
>                 URL: https://issues.apache.org/jira/browse/SPARK-18458
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: JESSE CHEN
>            Assignee: Kazuaki Ishizaki
>            Priority: Critical
>              Labels: core, dump
>             Fix For: 2.1.0
>
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e68_1478924651089_0018_01_000074 on host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from container-launch.
> Container id: container_e68_1478924651089_0018_01_000074
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted                 (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 Aborted                 (core dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server -Xmx24576m -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 --hostname mer05x.svl.ibm.com --cores 2 --app-id application_1478924651089_0018 --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/__app__.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.databricks_spark-csv_2.10-1.3.0.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/org.apache.commons_commons-csv-1.1.jar --user-class-path file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/com.univocity_univocity-parsers-1.5.1.jar > /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stdout 2> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_000074/stderr
>         at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
>         at org.apache.hadoop.util.Shell.run(Shell.java:456)
>         at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
>         at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 134
> {noformat}
> According to the source code, exit code 134 is 128+6, and 6 is        SIGABRT       6       Core    Abort signal from abort(3). The external signal killed executors. 
> On the YARN side, the log is more clear:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007fffe29e6bac, pid=3694385, tid=140735430203136
> #
> # JRE version: OpenJDK Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03)
> # Java VM: OpenJDK 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # J 10342% C2 org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V (386 bytes) @ 0x00007fffe29e6bac [0x00007fffe29e43c0+0x27ec]
> #
> # Core dump written. Default location: /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/core or core.3694385
> #
> # An error report file with more information is saved as:
> # /data2/hadoop/yarn/local/usercache/spark/appcache/application_1479156026828_0006/container_e69_1479156026828_0006_01_000825/hs_err_pid3694385.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {noformat}
> And the  hs_err_pid3694385.log shows the stack:
> {noformat}
> Stack: [0x00007fff85432000,0x00007fff85533000],  sp=0x00007fff85530ce0,  free space=1019k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> J 3896 C1 org.apache.spark.unsafe.Platform.putLong(Ljava/lang/Object;JJ)V (10 bytes) @ 0x00007fffe1d3cdec [0x00007fffe1d3cde0+0xc]
> j  org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArrayAtByte(Lorg/apache/spark/unsafe/array/LongArray;I[JIIIZZ)V+138
> j  org.apache.spark.util.collection.unsafe.sort.RadixSort.sortKeyPrefixArray(Lorg/apache/spark/unsafe/array/LongArray;IIIIZZ)I+209
> j  org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+56
> j  org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator()Lorg/apache/spark/util/collection/unsafe/sort/UnsafeSorterIterator;+62
> j  org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort()Lscala/collection/Iterator;+4
> j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+24
> j  org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z+11
> j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext()Z+4
> j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;Lscala/collection/Iterator;Lscala/collection/Iterator;)Z+147
> j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V+552
> j  org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext()V+17
> J 3849 C1 org.apache.spark.sql.execution.BufferedRowIterator.hasNext()Z (30 bytes) @ 0x00007fffe1d5679c [0x00007fffe1d56520+0x27c]
> j  org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext()Z+4
> j  scala.collection.Iterator$$anon$11.hasNext()Z+4
> j  org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V+3
> j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+222
> j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2
> j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+152
> j  org.apache.spark.executor.Executor$TaskRunner.run()V+423
> j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
> j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> V  [libjvm.so+0x63d6ba]
> V  [libjvm.so+0x63ab74]
> V  [libjvm.so+0x63b189]
> V  [libjvm.so+0x67e6a1]
> V  [libjvm.so+0x9b3f5a]
> V  [libjvm.so+0x869722]
> C  [libpthread.so.0+0x7dc5]  start_thread+0xc5
> {noformat}
> This is not easily reproducible on smaller data volumes, e.g., 1TB or 10TB, but easily reproducible on 100TB+...so look into data types that may not be big enough to handle hundreds of billion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org