You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@predictionio.apache.org by Bansari Shah <ba...@gmail.com> on 2016/12/14 11:25:51 UTC

Spark Executor lost error

Hi all,

I am trying to use spark environment in predict function of ML engine for
text analysis. It extends P2LAlgorithm algorithm. System works on
standalone cluster.

Predict function for new query is as below :

> override def predict(model: NaiveBayesModel, query: Query):
> PredictedResult = {
> val sc_new = SparkContext.getOrCreate()
> val sqlContext = SQLContext.getOrCreate(sc_new)
> val phraseDataframe = sqlContext.createDataFrame(Seq(query)).toDF("text")
> val dpObj = new DataPreparator
> val tf = dpObj.processPhrase(phraseDataframe)

tf.show()

val labeledpoints = tf.map(row => row.getAs[Vector]("rowFeatures"))
> val predictedResult = model.predict(labeledpoints)
> *return *predictedResult


it trains properly in pio train and while deploying as well it predicts
results properly for single query.

 But in case of pio eval, when i try to check accuracy of model it runs
upto tf.show() properly but when forming labelled point statement comes, it
stuck and after waiting for long it shows error that it lost spark executor
and no heartbeat received. Here it is error log :

WARN org.apache.spark.HeartbeatReceiver
> [sparkDriver-akka.actor.default-dispatcher-14] - Removing executor driver
> with no recent heartbeats: 686328 ms exceeds timeout 120000 ms
>


> ERROR org.apache.spark.scheduler.TaskSchedulerImpl
> [sparkDriver-akka.actor.default-dispatcher-14] - Lost executor driver on
> localhost: Executor heartbeat timed out after 686328 ms
>


> WARN org.apache.spark.scheduler.TaskSetManager
> [sparkDriver-akka.actor.default-dispatcher-14] - Lost task 3.0 in stage
> 103.0 (TID 237, localhost): ExecutorLostFailure (executor driver lost)
>


> ERROR org.apache.spark.scheduler.TaskSetManager
> [sparkDriver-akka.actor.default-dispatcher-14] - Task 3 in stage 103.0
> failed 1 times; aborting job
> ......
> org.apache.spark.SparkException: Job cancelled because SparkContext was
> shut down


Please suggest me how to solve this issue.

Thank you.
Regards,
Bansari Shah

Re: Spark Executor lost error

Posted by Bansari Shah <ba...@gmail.com>.
Hi,
Thank you. Here i am putting detail log of that executor error.
Please consider it.

Here is detail log :

> [Stage 103:==>(3 + 1) / 4][Stage 105:==>(3 + 1) / 4][Stage 107:==>(3 + 1)
> / 4][WARN] [HeartbeatReceiver] Removing executor driver with no recent
> heartbeats: 143771 ms exceeds timeout 120000 ms
> [WARN] [tcp] RMI TCP Accept-0: accept loop for ServerSocket[addr=
> 0.0.0.0/0.0.0.0,localport=54740] throws
> [ERROR] [TaskSchedulerImpl] Lost executor driver on localhost: Executor
> heartbeat timed out after 143771 ms
> [WARN] [tcp] RMI TCP Accept-0: accept loop for ServerSocket[addr=
> 0.0.0.0/0.0.0.0,localport=54740] throws
> [WARN] [TaskSetManager] Lost task 3.0 in stage 103.0 (TID 237, localhost):
> ExecutorLostFailure (executor driver lost)
> [ERROR] [TaskSetManager] Task 3 in stage 103.0 failed 1 times; aborting job
> [WARN] [TaskSetManager] Lost task 3.0 in stage 105.0 (TID 245, localhost):
> ExecutorLostFailure (executor driver lost)
> [ERROR] [TaskSetManager] Task 3 in stage 105.0 failed 1 times; aborting job
> [WARN] [TaskSetManager] Lost task 3.0 in stage 107.0 (TID 253, localhost):
> ExecutorLostFailure (executor driver lost)
> [ERROR] [TaskSetManager] Task 3 in stage 107.0 failed 1 times; aborting job
> [WARN] [TaskSetManager] Lost task 3.0 in stage 110.0 (TID 261, localhost):
> ExecutorLostFailure (executor driver lost)
> [ERROR] [TaskSetManager] Task 3 in stage 110.0 failed 1 times; aborting job
> [Stage 103:=>(3 + -3) / 4][Stage 105:==>(3 + 1) / 4][Stage 107:==>(3 + 1)
> / 4][WARN] [SparkContext] Killing executors is only supported in
> coarse-grained mode
> Exception in thread "main" scala.collection.parallel.CompositeThrowable:
> Multiple exceptions thrown during a parallel computation:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3
> in stage 110.0 failed 1 times, most recent failure: Lost task 3.0 in stage
> 110.0 (TID 261, localhost): ExecutorLostFailure (executor driver lost)
> Driver stacktrace:
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> scala.Option.foreach(Option.scala:236)
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> .
> .
> .
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3
> in stage 103.0 failed 1 times, most recent failure: Lost task 3.0 in stage
> 103.0 (TID 237, localhost): ExecutorLostFailure (executor driver lost)
> Driver stacktrace:
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> scala.Option.foreach(Option.scala:236)
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> .
> .
> .
>         at
> scala.collection.parallel.package$$anon$1.alongWith(package.scala:88)
>         at
> scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
>         at
> scala.collection.parallel.ParIterableLike$Map.mergeThrowables(ParIterableLike.scala:1054)
>         at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
>         at
> scala.collection.parallel.ParIterableLike$Map.tryMerge(ParIterableLike.scala:1054)
>         at
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
>         at
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
>         at
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
>         at
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>         at
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
>         at
> scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:444)
>         at
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:514)
>         at
> scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:492)
>         at
> scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:64)
>         at
> scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:961)
>         at
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
>         at
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>         at
> scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
>         at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
>         at
> scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:956)
>         at
> scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
>         at
> scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
>         at
> scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [Stage 116:>  (0 + 0) / 4][Stage 117:>  (0 + 0) / 4][Stage 118:>  (0 + 0)
> / 4][ERROR] [BlockManager] Failed to report rdd_123_0 to master; giving up.
> [ERROR] [Executor] Exception in task 3.0 in stage 103.0 (TID 237)
> [ERROR] [Executor] Exception in task 3.0 in stage 110.0 (TID 261)
> [ERROR] [Executor] Exception in task 3.0 in stage 107.0 (TID 253)
> [ERROR] [Executor] Exception in task 3.0 in stage 105.0 (TID 245)
> [ERROR] [AkkaRpcEnv] Ignore error: Task
> org.apache.spark.executor.Executor$TaskRunner@187b1861 rejected from
> java.util.concurrent.ThreadPoolExecutor@15da1dcd[Shutting down, pool size
> = 3, active threads = 3, queued tasks = 0, completed tasks = 262]
> [ERROR] [AkkaRpcEnv] Ignore error: Task
> org.apache.spark.executor.Executor$TaskRunner@50f59ea3 rejected from
> java.util.concurrent.ThreadPoolExecutor@15da1dcd[Terminated, pool size =
> 0, active threads = 0, queued tasks = 0, completed tasks = 265]
> [ERROR] [AkkaRpcEnv] Ignore error: Task
> org.apache.spark.executor.Executor$TaskRunner@379764eb rejected from
> java.util.concurrent.ThreadPoolExecutor@15da1dcd[Terminated, pool size =
> 0, active threads = 0, queued tasks = 0, completed tasks = 265]
> [ERROR] [AkkaRpcEnv] Ignore error: Task
> org.apache.spark.executor.Executor$TaskRunner@2014e4fb rejected from
> java.util.concurrent.ThreadPoolExecutor@15da1dcd[Terminated, pool size =
> 0, active threads = 0, queued tasks = 0, completed tasks = 265]


Please suggest me how to solve or what can be approaches to resolve this
error..

Thank you.
Regards,
Bansari Shah

On Fri, Jan 20, 2017 at 2:54 AM, Felipe Oliveira <fe...@gmail.com> wrote:

> I would investigate GC using JVM of your executors, very common to see
> that error during large GC pauses.
>
> On Thu, Jan 19, 2017 at 9:44 AM, Donald Szeto <do...@apache.org> wrote:
>
>> Do you have more detail logs from Spark executors?
>>
>> Regards,
>> Donald
>>
>> On Wed, Dec 14, 2016 at 3:26 AM Bansari Shah <ba...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to use spark environment in predict function of ML engine
>>> for text analysis. It extends P2LAlgorithm algorithm. System works on
>>> standalone cluster.
>>>
>>> Predict function for new query is as below :
>>>
>>> override def predict(model: NaiveBayesModel, query: Query):
>>> PredictedResult = {
>>> val sc_new = SparkContext.getOrCreate()
>>> val sqlContext = SQLContext.getOrCreate(sc_new)
>>> val phraseDataframe = sqlContext.createDataFrame(Seq(query)).toDF("text"
>>> )
>>> val dpObj = new DataPreparator
>>> val tf = dpObj.processPhrase(phraseDataframe)
>>>
>>> tf.show()
>>>
>>> val labeledpoints = tf.map(row => row.getAs[Vector]("rowFeatures"))
>>> val predictedResult = model.predict(labeledpoints)
>>> *return *predictedResult
>>>
>>>
>>> it trains properly in pio train and while deploying as well it predicts
>>> results properly for single query.
>>>
>>>  But in case of pio eval, when i try to check accuracy of model it runs
>>> upto tf.show() properly but when forming labelled point statement comes, it
>>> stuck and after waiting for long it shows error that it lost spark executor
>>> and no heartbeat received. Here it is error log :
>>>
>>> WARN org.apache.spark.HeartbeatReceiver [sparkDriver-akka.actor.default-dispatcher-14]
>>> - Removing executor driver with no recent heartbeats: 686328 ms exceeds
>>> timeout 120000 ms
>>>
>>>
>>>
>>> ERROR org.apache.spark.scheduler.TaskSchedulerImpl
>>> [sparkDriver-akka.actor.default-dispatcher-14] - Lost executor driver
>>> on localhost: Executor heartbeat timed out after 686328 ms
>>>
>>>
>>>
>>> WARN org.apache.spark.scheduler.TaskSetManager
>>> [sparkDriver-akka.actor.default-dispatcher-14] - Lost task 3.0 in stage
>>> 103.0 (TID 237, localhost): ExecutorLostFailure (executor driver lost)
>>>
>>>
>>>
>>> ERROR org.apache.spark.scheduler.TaskSetManager
>>> [sparkDriver-akka.actor.default-dispatcher-14] - Task 3 in stage 103.0
>>> failed 1 times; aborting job
>>> ......
>>> org.apache.spark.SparkException: Job cancelled because SparkContext was
>>> shut down
>>>
>>>
>>> Please suggest me how to solve this issue.
>>>
>>> Thank you.
>>> Regards,
>>> Bansari Shah
>>>
>>>
>>>
>>>
>
>
> --
> Thank you,
> Felipe
>
> http://geeks.aretotally.in
> http://twitter.com/_felipera
>

Re: Spark Executor lost error

Posted by Felipe Oliveira <fe...@gmail.com>.
I would investigate GC using JVM of your executors, very common to see that
error during large GC pauses.

On Thu, Jan 19, 2017 at 9:44 AM, Donald Szeto <do...@apache.org> wrote:

> Do you have more detail logs from Spark executors?
>
> Regards,
> Donald
>
> On Wed, Dec 14, 2016 at 3:26 AM Bansari Shah <ba...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am trying to use spark environment in predict function of ML engine for
>> text analysis. It extends P2LAlgorithm algorithm. System works on
>> standalone cluster.
>>
>> Predict function for new query is as below :
>>
>> override def predict(model: NaiveBayesModel, query: Query):
>> PredictedResult = {
>> val sc_new = SparkContext.getOrCreate()
>> val sqlContext = SQLContext.getOrCreate(sc_new)
>> val phraseDataframe = sqlContext.createDataFrame(Seq(query)).toDF("text")
>> val dpObj = new DataPreparator
>> val tf = dpObj.processPhrase(phraseDataframe)
>>
>> tf.show()
>>
>> val labeledpoints = tf.map(row => row.getAs[Vector]("rowFeatures"))
>> val predictedResult = model.predict(labeledpoints)
>> *return *predictedResult
>>
>>
>> it trains properly in pio train and while deploying as well it predicts
>> results properly for single query.
>>
>>  But in case of pio eval, when i try to check accuracy of model it runs
>> upto tf.show() properly but when forming labelled point statement comes, it
>> stuck and after waiting for long it shows error that it lost spark executor
>> and no heartbeat received. Here it is error log :
>>
>> WARN org.apache.spark.HeartbeatReceiver [sparkDriver-akka.actor.default-dispatcher-14]
>> - Removing executor driver with no recent heartbeats: 686328 ms exceeds
>> timeout 120000 ms
>>
>>
>>
>> ERROR org.apache.spark.scheduler.TaskSchedulerImpl
>> [sparkDriver-akka.actor.default-dispatcher-14] - Lost executor driver on
>> localhost: Executor heartbeat timed out after 686328 ms
>>
>>
>>
>> WARN org.apache.spark.scheduler.TaskSetManager [sparkDriver-akka.actor.default-dispatcher-14]
>> - Lost task 3.0 in stage 103.0 (TID 237, localhost): ExecutorLostFailure
>> (executor driver lost)
>>
>>
>>
>> ERROR org.apache.spark.scheduler.TaskSetManager [sparkDriver-akka.actor.default-dispatcher-14]
>> - Task 3 in stage 103.0 failed 1 times; aborting job
>> ......
>> org.apache.spark.SparkException: Job cancelled because SparkContext was
>> shut down
>>
>>
>> Please suggest me how to solve this issue.
>>
>> Thank you.
>> Regards,
>> Bansari Shah
>>
>>
>>
>>


-- 
Thank you,
Felipe

http://geeks.aretotally.in
http://twitter.com/_felipera

Re: Spark Executor lost error

Posted by Donald Szeto <do...@apache.org>.
Do you have more detail logs from Spark executors?

Regards,
Donald

On Wed, Dec 14, 2016 at 3:26 AM Bansari Shah <ba...@gmail.com>
wrote:

> Hi all,
>
> I am trying to use spark environment in predict function of ML engine for
> text analysis. It extends P2LAlgorithm algorithm. System works on
> standalone cluster.
>
> Predict function for new query is as below :
>
> override def predict(model: NaiveBayesModel, query: Query):
> PredictedResult = {
> val sc_new = SparkContext.getOrCreate()
> val sqlContext = SQLContext.getOrCreate(sc_new)
> val phraseDataframe = sqlContext.createDataFrame(Seq(query)).toDF("text")
> val dpObj = new DataPreparator
> val tf = dpObj.processPhrase(phraseDataframe)
>
> tf.show()
>
> val labeledpoints = tf.map(row => row.getAs[Vector]("rowFeatures"))
> val predictedResult = model.predict(labeledpoints)
> *return *predictedResult
>
>
> it trains properly in pio train and while deploying as well it predicts
> results properly for single query.
>
>  But in case of pio eval, when i try to check accuracy of model it runs
> upto tf.show() properly but when forming labelled point statement comes, it
> stuck and after waiting for long it shows error that it lost spark executor
> and no heartbeat received. Here it is error log :
>
> WARN org.apache.spark.HeartbeatReceiver
> [sparkDriver-akka.actor.default-dispatcher-14] - Removing executor driver
> with no recent heartbeats: 686328 ms exceeds timeout 120000 ms
>
>
>
> ERROR org.apache.spark.scheduler.TaskSchedulerImpl
> [sparkDriver-akka.actor.default-dispatcher-14] - Lost executor driver on
> localhost: Executor heartbeat timed out after 686328 ms
>
>
>
> WARN org.apache.spark.scheduler.TaskSetManager
> [sparkDriver-akka.actor.default-dispatcher-14] - Lost task 3.0 in stage
> 103.0 (TID 237, localhost): ExecutorLostFailure (executor driver lost)
>
>
>
> ERROR org.apache.spark.scheduler.TaskSetManager
> [sparkDriver-akka.actor.default-dispatcher-14] - Task 3 in stage 103.0
> failed 1 times; aborting job
> ......
> org.apache.spark.SparkException: Job cancelled because SparkContext was
> shut down
>
>
> Please suggest me how to solve this issue.
>
> Thank you.
> Regards,
> Bansari Shah
>
>
>
>