You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Udit Mehrotra (Jira)" <ji...@apache.org> on 2019/11/07 03:42:03 UTC
[jira] [Comment Edited] (SPARK-29767) Core dump happening on
executors while doing simple union of Data Frames
[ https://issues.apache.org/jira/browse/SPARK-29767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968895#comment-16968895 ]
Udit Mehrotra edited comment on SPARK-29767 at 11/7/19 3:41 AM:
----------------------------------------------------------------
[~hyukjin.kwon] was finally able to get the core dump of crashing executors. Attached *hs_err_pid13885.log* the error report written along with core dump.
In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: [0x00007fbe8850f000,0x00007fbe88610000], sp=0x00007fbe8860dad0, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xa9ae92]
J 4331 sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 0x00007fbea94ffabe [0x00007fbea94ffa00+0xbe]
j org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13
j scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22
j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78
j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8
j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187
j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210
j org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37
j org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3
j org.apache.spark.executor.Executor$TaskRunner.run()V+383
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
V [libjvm.so+0x680c5e]
V [libjvm.so+0x67e024]
V [libjvm.so+0x67e639]
V [libjvm.so+0x6c3d41]
V [libjvm.so+0xa77c22]
V [libjvm.so+0x8c3b12]
C [libpthread.so.0+0x7de5] start_thread+0xc5{noformat}
Also attached the core dump file *coredump.zip*
was (Author: uditme):
[~hyukjin.kwon] was finally able to get the core dump of crashing executors. Attached *hs_err_pid13885.log* the error report written along with core dump.
In that I notice the following trace:
{noformat}
RAX=
[error occurred during error reporting (printing register info), id 0xb]Stack: [0x00007fbe8850f000,0x00007fbe88610000], sp=0x00007fbe8860dad0, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0xa9ae92]
J 4331 sun.misc.Unsafe.getLong(Ljava/lang/Object;J)J (0 bytes) @ 0x00007fbea94ffabe [0x00007fbea94ffa00+0xbe]
j org.apache.spark.unsafe.Platform.getLong(Ljava/lang/Object;J)J+5
j org.apache.spark.unsafe.bitset.BitSetMethods.isSet(Ljava/lang/Object;JI)Z+66
j org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(I)Z+14
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.fieldToString_0_2$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/expressions/codegen/UTF8StringBuilder;)V+160
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V+76
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;+25
j org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j scala.collection.Iterator$$anon$11.next()Ljava/lang/Object;+13
j scala.collection.Iterator$$anon$10.next()Ljava/lang/Object;+22
j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Lscala/collection/Iterator;)Lscala/collection/Iterator;+78
j org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(Ljava/lang/Object;)Ljava/lang/Object;+5
j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Lorg/apache/spark/TaskContext;ILscala/collection/Iterator;)Lscala/collection/Iterator;+8
j org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+13
j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+27
j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j org.apache.spark.rdd.MapPartitionsRDD.compute(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+24
j org.apache.spark.rdd.RDD.computeOrReadCheckpoint(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+26
j org.apache.spark.rdd.RDD.iterator(Lorg/apache/spark/Partition;Lorg/apache/spark/TaskContext;)Lscala/collection/Iterator;+33
j org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+187
j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object;+210
j org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply()Ljava/lang/Object;+37
j org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+3
j org.apache.spark.executor.Executor$TaskRunner.run()V+383
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
V [libjvm.so+0x680c5e]
V [libjvm.so+0x67e024]
V [libjvm.so+0x67e639]
V [libjvm.so+0x6c3d41]
V [libjvm.so+0xa77c22]
V [libjvm.so+0x8c3b12]
C [libpthread.so.0+0x7de5] start_thread+0xc5{noformat}
> Core dump happening on executors while doing simple union of Data Frames
> ------------------------------------------------------------------------
>
> Key: SPARK-29767
> URL: https://issues.apache.org/jira/browse/SPARK-29767
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Spark Core
> Affects Versions: 2.4.4
> Environment: AWS EMR 5.27.0, Spark 2.4.4
> Reporter: Udit Mehrotra
> Priority: Major
> Attachments: coredump.zip, hs_err_pid13885.log, part-00000-0189b5c2-7f7b-4d0e-bdb8-506380253597-c000.snappy.parquet
>
>
> Running a union operation on two DataFrames through both Scala Spark Shell and PySpark, resulting in executor contains doing a *core dump* and existing with Exit code 134.
> The trace from the *Driver*:
> {noformat}
> Container exited with a non-zero exit code 134
> .
> 19/11/06 02:21:35 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5, ip-172-30-6-79.ec2.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container from a bad node: container_1572981097605_0021_01_000077 on host: ip-172-30-6-79.ec2.internal. Exit status: 134. Diagnostics: Exception from container-launch.
> Container id: container_1572981097605_0021_01_000077
> Exit code: 134
> Exception message: /bin/bash: line 1: 12611 Aborted LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderrStack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 12611 Aborted LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native" /usr/lib/jvm/java-openjdk/bin/java -server -Xmx2743m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Djava.io.tmpdir=/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/tmp '-Dspark.history.ui.port=18080' '-Dspark.driver.port=42267' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@ip-172-30-6-103.ec2.internal:42267 --executor-id 11 --hostname ip-172-30-6-79.ec2.internal --cores 2 --app-id application_1572981097605_0021 --user-class-path file:/mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/__app__.jar > /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stdout 2> /var/log/hadoop-yarn/containers/application_1572981097605_0021/container_1572981097605_0021_01_000077/stderr at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
> at org.apache.hadoop.util.Shell.run(Shell.java:869)
> at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
> at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
> at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Container exited with a non-zero exit code 134{noformat}
> From the *stdout* logs of the exiting container we see:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00007f825e3b0e92, pid=12611, tid=0x00007f822b5fb700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-b10)
> # Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # V [libjvm.so+0xa9ae92]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /mnt1/yarn/usercache/hadoop/appcache/application_1572981097605_0021/container_1572981097605_0021_01_000077/hs_err_pid12611.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> #{noformat}
> Also, I am unable to enable *core dump* even though *ulimit -c* is set to *unlimited*. Can you help on how to go about this issue, and also how to get the *core dump* ?
> Steps to reproduce the issue:
> * Upload the attached parquet data file to S3 *s3://<bucket>/tables/spark_29767_parquet_table/inserted_at=201910/*
> * Create a partitioned hive table
> {code:java}
> CREATE EXTERNAL TABLE `spark_29767_parquet_table`(
> `hour` bigint,
> `title` string,
> `__deleted` string,
> `status` string,
> `transformationid` string,
> `roomid` string,
> `day` bigint,
> `notes` string,
> `nunitsfromaudit` bigint,
> `ts_ms` bigint,
> `liability` string,
> `_class` string,
> `month` bigint,
> `updatedate` struct<`date`:bigint>,
> `_id` struct<oid:string>,
> `year` bigint,
> `item` struct<name:string,brandname:string,perunitpricefromaudit:struct<currency:string,amount:string>,actualPerUnitPrice:struct<currency:string,amount:string>,category:string,itemType:string,roomAmenityId:bigint>,
> `createddate` struct<`date`:bigint>,
> `actualunits` bigint,
> `description` string)
> PARTITIONED BY (
> `inserted_at` string)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
> 's3://<bucket>/tables/spark_29767_parquet_table'
> {code}
> * Sync partition
> {code:java}
> ALTER TABLE spark_29767_parquet_table ADD PARTITION (inserted_at='201910') location 's3://<bucket>/tables/spark_29767_parquet_table/inserted_at=201910/'
> {code}
> * In pyspark run the following:
> {code:java}
> // Read the base data frame
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession, HiveContext
> from pyspark.sql.functions import lit
> sparkSession = (SparkSession
> .builder
> .appName('example-pyspark-read-and-write-from-hive')
> .enableHiveSupport()
> .getOrCreate())base_df = sparkSession.table("spark_29767_parquet_table")
> base_df = sparkSession.table("spark_29767_parquet_table")
> base_df = base_df.select("_id", "_class", "roomid", "item", "inserted_at")
> // Create a new dataframe with one row for union
> from pyspark.sql import *
> import pyspark.sql.types
> from pyspark.sql.types import *
> schema = StructType([
> StructField("_id",StructType([StructField("oid",StringType(),True)]),True),
> StructField("_class",StringType(),True),
> StructField("roomid",StringType(),True),
> StructField("item",StructType([
> StructField("name",StringType(),True),
> StructField("brandname",StringType(),True),
> StructField("perunitpricefromaudit",
> StructType([
> StructField("currency",StringType(),True),
> StructField("amount",StringType(),True)]),True),
> StructField("actualperunitprice",StructType([
> StructField("currency",StringType(),True),
> StructField("amount",StringType(),True)]),True),
> StructField("category",StringType(),True),
> StructField("itemtype",StringType(),True),
> StructField("roomamenityid",LongType(),True)]),True),
> StructField("inserted_at",StringType(),True)])
> data = [
> Row(Row("5daff5ca43b8a36756c23b0f"),
> "com.oyo.transformations.tasks.model.implementations.AuditItemTaskImpl",
> None,
> Row("Geyser Installation(with accessories)",None,Row("INR", "425.0"),None,"INFRASTRUCTURE","PMC",None),
> "201910"
> )
> ]
> inc_df = spark.createDataFrame(
> spark.sparkContext.parallelize(data),
> schema
> )
> inc_df.union(base_df).show()
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org