You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tal Sliwowicz (JIRA)" <ji...@apache.org> on 2014/10/22 22:48:34 UTC

[jira] [Comment Edited] (SPARK-4006) Spark Driver crashes whenever an Executor is registered twice

    [ https://issues.apache.org/jira/browse/SPARK-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180512#comment-14180512 ] 

Tal Sliwowicz edited comment on SPARK-4006 at 10/22/14 8:48 PM:
----------------------------------------------------------------

Cool! Would be very interesting to know.
For us it's hard to force reproduce this (because you need two registers without a remove in between), but it always happens eventually, so I can tell for sure that the fix resolved our issue.


was (Author: sliwo):
Cool! Would be very interesting to know.
For us it's hard to force reproduce this (because you need to registers without a remove in between), but it always happens eventually, so I can tell for sure that the fix resolved our issue.

> Spark Driver crashes whenever an Executor is registered twice
> -------------------------------------------------------------
>
>                 Key: SPARK-4006
>                 URL: https://issues.apache.org/jira/browse/SPARK-4006
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager, Spark Core
>    Affects Versions: 0.9.2, 1.0.2, 1.1.0, 1.2.0
>         Environment: Mesos, Coarse Grained
>            Reporter: Tal Sliwowicz
>            Priority: Critical
>
> This is a huge robustness issue for us (Taboola), in mission critical , time sensitive (real time) spark jobs.
> We have long running spark drivers and even though we have state of the art hardware, from time to time executors disconnect. In many cases, the RemoveExecutor is not received, and when the new executor registers, the driver crashes. In mesos coarse grained, executor ids are fixed. 
> The issue is with the System.exit(1) in BlockManagerMasterActor
> {code}
> private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) {
>     if (!blockManagerInfo.contains(id)) {
>       blockManagerIdByExecutor.get(id.executorId) match {
>         case Some(manager) =>
>           // A block manager of the same executor already exists.
>           // This should never happen. Let's just quit.
>           logError("Got two different block manager registrations on " + id.executorId)
>           System.exit(1)
>         case None =>
>           blockManagerIdByExecutor(id.executorId) = id
>       }
>       logInfo("Registering block manager %s with %s RAM".format(
>         id.hostPort, Utils.bytesToString(maxMemSize)))
>       blockManagerInfo(id) =
>         new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor)
>     }
>     listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org