You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Waleed Fateem (JIRA)" <ji...@apache.org> on 2019/04/12 15:52:00 UTC
[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

    [ https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816384#comment-16816384 ] 

Waleed Fateem commented on SPARK-23801:
---------------------------------------

I just wanted to add a comment here in case it's useful. It's kind of peculiar that this happened consistently after a Spark upgrade but according to the HotSpot Error log the crash seems to be more of a JDK issue rather than a Spark issue. Specifically, it looks like it has something to do with garbage collection (moving objects to survivor space). 

In our case, we were able to workaround the issue by changing the garbage collection policy, for example:

--conf spark.executor.extraJavaOptions='-XX+UseG1GC'
--conf spark.driver.extraJavaOptions='-XX+UseG1GC'

Also coincidentally, in our case the same Java version was being used:


{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1467427fdc, pid=1315, tid=0x00007f1464f2d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 1.8.0_161-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 )
# Problematic frame:
# V  [libjvm.so+0x995fdc]  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
{code}

I would try upgrading the JDK to see if the issue occurs there or not.

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --------------------------------------------------
>
>                 Key: SPARK-23801
>                 URL: https://issues.apache.org/jira/browse/SPARK-23801
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0
>         Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>            Reporter: Nathan Kleyn
>            Priority: Major
>         Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f1467427fdc, pid=1315, tid=0x00007f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
> #
> # Core dump written. Default location: /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---------------  T H R E A D  ---------------
> Current thread (0x00007f146005b000):  GCTaskThread [stack: 0x00007f1464e2d000,0x00007f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x00007ef9c035f8c8, RCX=0x00007f1464f2c9f0, RDX=0x0000000000000000
> RSP=0x00007f1464f2c1a0, RBP=0x00007f1464f2c210, RSI=0x0000000000000068, RDI=0x00007ef7bc30bda8
> R8 =0x00007f1464f2c3d0, R9 =0x0000000000001741, R10=0x00007f1467a52819, R11=0x00007f14671240e0
> R12=0x00007f130912c998, R13=0x17e907feccbc6d20, R14=0x0000000000000002, R15=0x000000000000000d
> RIP=0x00007f1467427fdc, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000000
>   TRAPNO=0x000000000000000d
> Top of Stack: (sp=0x00007f1464f2c1a0)
> 0x00007f1464f2c1a0:   00007f146005b000 0000000000000001
> 0x00007f1464f2c1b0:   0000000000000004 00007f14600bb640
> 0x00007f1464f2c1c0:   00007f1464f2c210 00007f14673aeed6
> 0x00007f1464f2c1d0:   00007f1464f2c2c0 00007f1464f2c250
> 0x00007f1464f2c1e0:   00007f11bde31b70 00007ef9c035f8c8
> 0x00007f1464f2c1f0:   00007ef8a80a7060 0000000000001741
> 0x00007f1464f2c200:   0000000000000002 00000000ffffffff
> 0x00007f1464f2c210:   00007f1464f2c230 00007f146742b005
> 0x00007f1464f2c220:   00007ef8a80a7050 0000000000001741
> 0x00007f1464f2c230:   00007f1464f2c2d0 00007f14673ae9fb
> 0x00007f1464f2c240:   00007f1467a5d880 00007f14673ad9a0
> 0x00007f1464f2c250:   00007f1464f2c9f0 00007f1464f2c3d0
> 0x00007f1464f2c260:   00007f1464f2c3a0 00007f146005b620
> 0x00007f1464f2c270:   00007ef8b843d7c8 ffff000200000006
> 0x00007f1464f2c280:   00007f1464f2c340 00007f14600bb640
> 0x00007f1464f2c290:   17417f1453fb9cec 00007f1453fbffff
> 0x00007f1464f2c2a0:   00007f1453fb819e 00007f1464f2c3a0
> 0x00007f1464f2c2b0:   0000000000000001 0000000000000000
> 0x00007f1464f2c2c0:   00007f1464f2c3d0 00007f1464f2c9d0
> 0x00007f1464f2c2d0:   00007f1464f2c340 00007f1467025f22
> 0x00007f1464f2c2e0:   00007f145427cb5c 00007f1464f2c3a0
> 0x00007f1464f2c2f0:   00007f1464f2c370 00007f146005b000
> 0x00007f1464f2c300:   00007f1464f2c9f0 00007ef850009800
> 0x00007f1464f2c310:   00007f1464f2c9f0 00007f1464f2c3a0
> 0x00007f1464f2c320:   00007f1464f2c3d0 00007f146005b000
> 0x00007f1464f2c330:   00007f1464f2c9f0 00007ef850009800
> 0x00007f1464f2c340:   00007f1464f2c9c0 00007f1467508191
> 0x00007f1464f2c350:   00007ef9c16f7890 00007f1464f2c370
> 0x00007f1464f2c360:   00007f1464f2c9d0 0000000000000000
> 0x00007f1464f2c370:   00007ef9c035f8c0 00007f145427cb5c
> 0x00007f1464f2c380:   00007f145427ba90 00007ef900000000
> 0x00007f1464f2c390:   0000000000000078 00007ef9c035f8c0 
> Instructions: (pc=0x00007f1467427fdc)
> 0x00007f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x00007f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x00007f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x00007f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory mapping:
> RAX=0x17e907feccbc6d20 is an unknown value
> RBX=0x00007ef9c035f8c8 is pointing into the stack for thread: 0x00007ef850009800
> RCX=0x00007f1464f2c9f0 is an unknown value
> RDX=0x0000000000000000 is an unknown value
> RSP=0x00007f1464f2c1a0 is an unknown value
> RBP=0x00007f1464f2c210 is an unknown value
> RSI=0x0000000000000068 is an unknown value
> RDI=0x00007ef7bc30bda8 is pointing into metadata
> R8 =0x00007f1464f2c3d0 is an unknown value
> R9 =0x0000000000001741 is an unknown value
> R10=0x00007f1467a52819: <offset 0xfc0819> in /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x00007f1466a92000
> R11=0x00007f14671240e0: <offset 0x6920e0> in /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x00007f1466a92000
> R12=0x00007f130912c998 is an oop
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50 
>  - klass: 'org/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage50'
> R13=0x17e907feccbc6d20 is an unknown value
> R14=0x0000000000000002 is an unknown value
> R15=0x000000000000000d is an unknown value
> Stack: [0x00007f1464e2d000,0x00007f1464f2e000],  sp=0x00007f1464f2c1a0,  free space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> V  [libjvm.so+0x995fdc]  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
> V  [libjvm.so+0x999005]  PSRootsClosure<false>::do_oop(oopDesc**)+0x35
> V  [libjvm.so+0x91c9fb]  OopMapSet::all_do(frame const*, RegisterMap const*, OopClosure*, void (*)(oopDesc**, oopDesc**), OopClosure*)+0x2fb
> V  [libjvm.so+0x593f22]  frame::oops_do_internal(OopClosure*, CLDClosure*, CodeBlobClosure*, RegisterMap*, bool)+0xa2
> V  [libjvm.so+0xa76191]  JavaThread::oops_do(OopClosure*, CLDClosure*, CodeBlobClosure*)+0x161
> V  [libjvm.so+0x99926f]  ThreadRootsTask::do_it(GCTaskManager*, unsigned int)+0x6f
> V  [libjvm.so+0x5dbfef]  GCTaskThread::run()+0x12f
> V  [libjvm.so+0x92da28]  java_start(Thread*)+0x108
> JavaThread 0x00007ef850009800 (nid = 1558) was being processed
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> J 2336  sun.misc.Unsafe.putLong(Ljava/lang/Object;JJ)V (0 bytes) @ 0x00007f14518c70cc [0x00007f14518c7080+0x4c]
> J 20102 C2 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50.processNext()V (1030 bytes) @ 0x00007f145427cb5c [0x00007f145427c020+0xb3c]
> J 9304 C2 scala.collection.Iterator$$anon$11.hasNext()Z (10 bytes) @ 0x00007f145280da10 [0x00007f145280d460+0x5b0]
> J 15346 C2 org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V (117 bytes) @ 0x00007f145227172c [0x00007f1452271680+0xac]
> J 16755 C1 org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus; (293 bytes) @ 0x00007f14534a1dbc [0x00007f145349f820+0x259c]
> J 16754 C1 org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object; (6 bytes) @ 0x00007f14536cf5cc [0x00007f14536cf540+0x8c]
> J 15858 C1 org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object; (399 bytes) @ 0x00007f1452eccd44 [0x00007f1452eca8a0+0x24a4]
> J 16786 C1 org.apache.spark.executor.Executor$TaskRunner.run()V (2984 bytes) @ 0x00007f1453a4c97c [0x00007f1453a495e0+0x339c]
> J 18919 C1 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f1453fb91cc [0x00007f1453fb81c0+0x100c]
> j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub{code}
> Unfortunately, this job is so large that it's pretty impossible for us to narrow down to a reproducible test case. What I can say though is that:
>  * We are running on Mesos using coarse grained scheduling.
>  * We can make it fail every time, consistently.
>  * It only happened after we upgraded to v2.3.0.
>  * All inputs and options to the job are _exactly_ the same before as after.
> Please let me know if we can provide any other information!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org