You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saisai Shao (JIRA)" <ji...@apache.org> on 2016/08/08 01:48:20 UTC

[jira] [Comment Edited] (SPARK-16914) NodeManager crash when spark are registering executor infomartion into leveldb

    [ https://issues.apache.org/jira/browse/SPARK-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411159#comment-15411159 ] 

Saisai Shao edited comment on SPARK-16914 at 8/8/16 1:48 AM:
-------------------------------------------------------------

So from your description, is this exception mainly due to the problem of disk1 that leveldb fail to write data into it?

Maybe this JIRA SPARK-14963 could address your problem, it uses NM's recovery dir to store aux-service data. And I guess NM will handle this disk failure problem if you configure multiple disks for NM local dir.


was (Author: jerryshao):
So from your description, is this exception mainly due to the problem of disk1 that leveldb fail to write data into it?

Maybe this JIRA SPARK-16917 could address your problem, it uses NM's recovery dir to store aux-service data. And I guess NM will handle this disk failure problem if you configure multiple disks for NM local dir.

> NodeManager crash when spark are registering executor infomartion into leveldb
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-16914
>                 URL: https://issues.apache.org/jira/browse/SPARK-16914
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.6.2
>            Reporter: cen yuhai
>
> {noformat}
> Stack: [0x00007fb5b53de000,0x00007fb5b54df000],  sp=0x00007fb5b54dcba8,  free space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> C  [libc.so.6+0x896b1]  memcpy+0x11
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j  org.fusesource.leveldbjni.internal.NativeDB$DBJNI.Put(JLorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)J+0
> j  org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeSlice;Lorg/fusesource/leveldbjni/internal/NativeSlice;)V+11
> j  org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;Lorg/fusesource/leveldbjni/internal/NativeBuffer;Lorg/fusesource/leveldbjni/internal/NativeBuffer;)V+18
> j  org.fusesource.leveldbjni.internal.NativeDB.put(Lorg/fusesource/leveldbjni/internal/NativeWriteOptions;[B[B)V+36
> j  org.fusesource.leveldbjni.internal.JniDB.put([B[BLorg/iq80/leveldb/WriteOptions;)Lorg/iq80/leveldb/Snapshot;+28
> j  org.fusesource.leveldbjni.internal.JniDB.put([B[B)V+10
> j  org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.registerExecutor(Ljava/lang/String;Ljava/lang/String;Lorg/apache/spark/network/shuffle/protocol/ExecutorShuffleInfo;)V+61
> J 8429 C2 org.apache.spark.network.server.TransportRequestHandler.handle(Lorg/apache/spark/network/protocol/RequestMessage;)V (100 bytes) @ 0x00007fb5f27ff6cc [0x00007fb5f27fdde0+0x18ec]
> J 8371 C2 org.apache.spark.network.server.TransportChannelHandler.channelRead0(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (10 bytes) @ 0x00007fb5f242df20 [0x00007fb5f242de80+0xa0]
> J 6853 C2 io.netty.channel.SimpleChannelInboundHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (74 bytes) @ 0x00007fb5f215587c [0x00007fb5f21557e0+0x9c]
> J 5872 C2 io.netty.handler.timeout.IdleStateHandler.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (42 bytes) @ 0x00007fb5f2183268 [0x00007fb5f2183100+0x168]
> J 5849 C2 io.netty.handler.codec.MessageToMessageDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (158 bytes) @ 0x00007fb5f2191524 [0x00007fb5f218f5a0+0x1f84]
> J 5941 C2 org.apache.spark.network.util.TransportFrameDecoder.channelRead(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V (170 bytes) @ 0x00007fb5f220a230 [0x00007fb5f2209fc0+0x270]
> J 7747 C2 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read()V (363 bytes) @ 0x00007fb5f264465c [0x00007fb5f2644140+0x51c]
> J 8008% C2 io.netty.channel.nio.NioEventLoop.run()V (162 bytes) @ 0x00007fb5f26f6764 [0x00007fb5f26f63c0+0x3a4]
> j  io.netty.util.concurrent.SingleThreadEventExecutor$2.run()V+13
> j  java.lang.Thread.run()V+11
> v  ~StubRoutines::call_stub
> {noformat}
> The target code in spark is in ExternalShuffleBlockResolver
> {code}
>   /** Registers a new Executor with all the configuration we need to find its shuffle files. */
>   public void registerExecutor(
>       String appId,
>       String execId,
>       ExecutorShuffleInfo executorInfo) {
>     AppExecId fullId = new AppExecId(appId, execId);
>     logger.info("Registered executor {} with {}", fullId, executorInfo);
>     try {
>       if (db != null) {
>         byte[] key = dbAppExecKey(fullId);
>         byte[] value =  mapper.writeValueAsString(executorInfo).getBytes(Charsets.UTF_8);
>         db.put(key, value);
>       }
>     } catch (Exception e) {
>       logger.error("Error saving registered executors", e);
>     }
>     executors.put(fullId, executorInfo);
>   }
> {code}
> There is a problem with disk1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org