You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Hryhoriev (JIRA)" <ji...@apache.org> on 2017/05/12 06:33:04 UTC
[jira] [Comment Edited] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

    [ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004733#comment-16004733 ] 

Nick Hryhoriev edited comment on SPARK-5594 at 5/12/17 6:32 AM:
----------------------------------------------------------------

Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of  SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4 times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job 1494423050000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 of broadcast_141
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
	at com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
	at com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
	at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
	at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
	at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
	at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
	at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
	at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
	at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
	at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
	at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 of broadcast_141
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:218)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
{quote}

P.S during this time was performed downscale cluster.
 But i don't know can this add any trouble because this is Streamin app, and  node where executor run was not in process of downscale.

P.S.S I can confirm that down scaling of yarn do not have any effects on this.
Problem appeared few time late even with out it. The only thing that help is remove checkpoint data from hdfs. Right now i\am turn off checkpointing on stream, and look like issue not appeared any more.
But i'am still do not understand why it's happens, what wrong with checkpoints. I'am really need it. Because some part of my business logic related to  updates work relate to checkpoint functionality.


was (Author: hryhoriev.nick):
Hi,
i have the same issue.
But in spark 2.1. But i can't remove spark.cleaner.ttl configuration because of  SPARK-7689. It's already removed.
What strange, issue appeared after 3 week works. and reproduced even after restart job.
Env: YARN - EMR 5.3, Spark 2.1. Checkpoint used.
Stack trace
{quote}
2017-05-10 13:50:55 ERROR TaskSetManager:70 - Task 1 in stage 2.0 failed 4 times; aborting job
2017-05-10 13:50:55 ERROR JobScheduler:91 - Error running job streaming job 1494423050000 ms.2
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 46, ip-10-191-116-244.eu-west-1.compute.internal, executor 1): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 of broadcast_141
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
	at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
	at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
	at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
	at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
	at com.playtech.bit.rtv.Converter$.toEventDetailsRow(HitsUpdater.scala:183)
	at com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:138)
	at com.playtech.bit.rtv.HitsUpdater$$anonfun$saveEventDetails$4$$anonfun$11.apply(HitsUpdater.scala:137)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at com.datastax.spark.connector.util.CountingIterator.next(CountingIterator.scala:16)
	at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
	at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:198)
	at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:175)
	at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
	at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
	at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
	at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
	at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:175)
	at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:162)
	at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:149)
	at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
	at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_141_piece0 of broadcast_141
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:178)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:150)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:150)
	at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:218)
	at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
{quote}

P.S during this time was performed downscale cluster.
 But i don't know can this add any trouble because this is Streamin app, and  node where executor run was not in process of downscale/

> SparkException: Failed to get broadcast (TorrentBroadcast)
> ----------------------------------------------------------
>
>                 Key: SPARK-5594
>                 URL: https://issues.apache.org/jira/browse/SPARK-5594
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: John Sandiford
>            Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below when running on a cluster (works locally), and have no idea what is causing it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: Lost task 3.3 in stage 6.0 (TID 24, <removed>): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
>         at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
>         at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>         at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>         at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
>         at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6
>         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
>         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
>         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
>         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
>         at scala.collection.immutable.List.foreach(List.scala:318)
>         at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
>         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
>         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
>         ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
>         at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>         at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>         at scala.Option.foreach(Option.scala:236)
>         at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>         at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org