You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kazuaki Ishizaki (JIRA)" <ji...@apache.org> on 2018/04/02 18:44:00 UTC

[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

    [ https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422955#comment-16422955 ] 

Kazuaki Ishizaki commented on SPARK-23801:
------------------------------------------

If you use WebUI, you can easily see SQL plans in a graph.
It may help you correlate stage with SQL statements.

https://spark.apache.org/docs/latest/monitoring.html

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --------------------------------------------------
>
>                 Key: SPARK-23801
>                 URL: https://issues.apache.org/jira/browse/SPARK-23801
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.3.0
>         Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>            Reporter: Nathan Kleyn
>            Priority: Major
>         Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f1467427fdc, pid=1315, tid=0x00007f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
> #
> # Core dump written. Default location: /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---------------  T H R E A D  ---------------
> Current thread (0x00007f146005b000):  GCTaskThread [stack: 0x00007f1464e2d000,0x00007f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x00007ef9c035f8c8, RCX=0x00007f1464f2c9f0, RDX=0x0000000000000000
> RSP=0x00007f1464f2c1a0, RBP=0x00007f1464f2c210, RSI=0x0000000000000068, RDI=0x00007ef7bc30bda8
> R8 =0x00007f1464f2c3d0, R9 =0x0000000000001741, R10=0x00007f1467a52819, R11=0x00007f14671240e0
> R12=0x00007f130912c998, R13=0x17e907feccbc6d20, R14=0x0000000000000002, R15=0x000000000000000d
> RIP=0x00007f1467427fdc, EFLAGS=0x0000000000010202, CSGSFS=0x002b000000000033, ERR=0x0000000000000000
>   TRAPNO=0x000000000000000d
> Top of Stack: (sp=0x00007f1464f2c1a0)
> 0x00007f1464f2c1a0:   00007f146005b000 0000000000000001
> 0x00007f1464f2c1b0:   0000000000000004 00007f14600bb640
> 0x00007f1464f2c1c0:   00007f1464f2c210 00007f14673aeed6
> 0x00007f1464f2c1d0:   00007f1464f2c2c0 00007f1464f2c250
> 0x00007f1464f2c1e0:   00007f11bde31b70 00007ef9c035f8c8
> 0x00007f1464f2c1f0:   00007ef8a80a7060 0000000000001741
> 0x00007f1464f2c200:   0000000000000002 00000000ffffffff
> 0x00007f1464f2c210:   00007f1464f2c230 00007f146742b005
> 0x00007f1464f2c220:   00007ef8a80a7050 0000000000001741
> 0x00007f1464f2c230:   00007f1464f2c2d0 00007f14673ae9fb
> 0x00007f1464f2c240:   00007f1467a5d880 00007f14673ad9a0
> 0x00007f1464f2c250:   00007f1464f2c9f0 00007f1464f2c3d0
> 0x00007f1464f2c260:   00007f1464f2c3a0 00007f146005b620
> 0x00007f1464f2c270:   00007ef8b843d7c8 ffff000200000006
> 0x00007f1464f2c280:   00007f1464f2c340 00007f14600bb640
> 0x00007f1464f2c290:   17417f1453fb9cec 00007f1453fbffff
> 0x00007f1464f2c2a0:   00007f1453fb819e 00007f1464f2c3a0
> 0x00007f1464f2c2b0:   0000000000000001 0000000000000000
> 0x00007f1464f2c2c0:   00007f1464f2c3d0 00007f1464f2c9d0
> 0x00007f1464f2c2d0:   00007f1464f2c340 00007f1467025f22
> 0x00007f1464f2c2e0:   00007f145427cb5c 00007f1464f2c3a0
> 0x00007f1464f2c2f0:   00007f1464f2c370 00007f146005b000
> 0x00007f1464f2c300:   00007f1464f2c9f0 00007ef850009800
> 0x00007f1464f2c310:   00007f1464f2c9f0 00007f1464f2c3a0
> 0x00007f1464f2c320:   00007f1464f2c3d0 00007f146005b000
> 0x00007f1464f2c330:   00007f1464f2c9f0 00007ef850009800
> 0x00007f1464f2c340:   00007f1464f2c9c0 00007f1467508191
> 0x00007f1464f2c350:   00007ef9c16f7890 00007f1464f2c370
> 0x00007f1464f2c360:   00007f1464f2c9d0 0000000000000000
> 0x00007f1464f2c370:   00007ef9c035f8c0 00007f145427cb5c
> 0x00007f1464f2c380:   00007f145427ba90 00007ef900000000
> 0x00007f1464f2c390:   0000000000000078 00007ef9c035f8c0 
> Instructions: (pc=0x00007f1467427fdc)
> 0x00007f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x00007f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x00007f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x00007f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory mapping:
> RAX=0x17e907feccbc6d20 is an unknown value
> RBX=0x00007ef9c035f8c8 is pointing into the stack for thread: 0x00007ef850009800
> RCX=0x00007f1464f2c9f0 is an unknown value
> RDX=0x0000000000000000 is an unknown value
> RSP=0x00007f1464f2c1a0 is an unknown value
> RBP=0x00007f1464f2c210 is an unknown value
> RSI=0x0000000000000068 is an unknown value
> RDI=0x00007ef7bc30bda8 is pointing into metadata
> R8 =0x00007f1464f2c3d0 is an unknown value
> R9 =0x0000000000001741 is an unknown value
> R10=0x00007f1467a52819: <offset 0xfc0819> in /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x00007f1466a92000
> R11=0x00007f14671240e0: <offset 0x6920e0> in /usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x00007f1466a92000
> R12=0x00007f130912c998 is an oop
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50 
>  - klass: 'org/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage50'
> R13=0x17e907feccbc6d20 is an unknown value
> R14=0x0000000000000002 is an unknown value
> R15=0x000000000000000d is an unknown value
> Stack: [0x00007f1464e2d000,0x00007f1464f2e000],  sp=0x00007f1464f2c1a0,  free space=1020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> V  [libjvm.so+0x995fdc]  oopDesc* PSPromotionManager::copy_to_survivor_space<false>(oopDesc*)+0x7c
> V  [libjvm.so+0x999005]  PSRootsClosure<false>::do_oop(oopDesc**)+0x35
> V  [libjvm.so+0x91c9fb]  OopMapSet::all_do(frame const*, RegisterMap const*, OopClosure*, void (*)(oopDesc**, oopDesc**), OopClosure*)+0x2fb
> V  [libjvm.so+0x593f22]  frame::oops_do_internal(OopClosure*, CLDClosure*, CodeBlobClosure*, RegisterMap*, bool)+0xa2
> V  [libjvm.so+0xa76191]  JavaThread::oops_do(OopClosure*, CLDClosure*, CodeBlobClosure*)+0x161
> V  [libjvm.so+0x99926f]  ThreadRootsTask::do_it(GCTaskManager*, unsigned int)+0x6f
> V  [libjvm.so+0x5dbfef]  GCTaskThread::run()+0x12f
> V  [libjvm.so+0x92da28]  java_start(Thread*)+0x108
> JavaThread 0x00007ef850009800 (nid = 1558) was being processed
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> J 2336  sun.misc.Unsafe.putLong(Ljava/lang/Object;JJ)V (0 bytes) @ 0x00007f14518c70cc [0x00007f14518c7080+0x4c]
> J 20102 C2 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage50.processNext()V (1030 bytes) @ 0x00007f145427cb5c [0x00007f145427c020+0xb3c]
> J 9304 C2 scala.collection.Iterator$$anon$11.hasNext()Z (10 bytes) @ 0x00007f145280da10 [0x00007f145280d460+0x5b0]
> J 15346 C2 org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(Lscala/collection/Iterator;)V (117 bytes) @ 0x00007f145227172c [0x00007f1452271680+0xac]
> J 16755 C1 org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus; (293 bytes) @ 0x00007f14534a1dbc [0x00007f145349f820+0x259c]
> J 16754 C1 org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object; (6 bytes) @ 0x00007f14536cf5cc [0x00007f14536cf540+0x8c]
> J 15858 C1 org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;)Ljava/lang/Object; (399 bytes) @ 0x00007f1452eccd44 [0x00007f1452eca8a0+0x24a4]
> J 16786 C1 org.apache.spark.executor.Executor$TaskRunner.run()V (2984 bytes) @ 0x00007f1453a4c97c [0x00007f1453a495e0+0x339c]
> J 18919 C1 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f1453fb91cc [0x00007f1453fb81c0+0x100c]
> j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub{code}
> Unfortunately, this job is so large that it's pretty impossible for us to narrow down to a reproducible test case. What I can say though is that:
>  * We are running on Mesos using coarse grained scheduling.
>  * We can make it fail every time, consistently.
>  * It only happened after we upgraded to v2.3.0.
>  * All inputs and options to the job are _exactly_ the same before as after.
> Please let me know if we can provide any other information!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org