You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pei-Lun Lee <pl...@appier.com> on 2014/06/26 12:41:43 UTC

LiveListenerBus throws exception and weird web UI bug

Hi,

We have a long running spark application runs on spark 1.0 standalone
server and after it runs several hours the following exception shows up:


14/06/25 23:13:08 ERROR LiveListenerBus: Listener JobProgressListener threw
an exception
java.util.NoSuchElementException: key not found: 6375
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
        at
org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:78)
        at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
        at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
        at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
        at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
        at
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
        at
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
        at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
        at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
        at scala.Option.foreach(Option.scala:236)
        at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
        at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
        at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
        at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
        at
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)


And then the web UI (driver:4040) starts showing weird results like: (see
attached screenshots)
1. negative active tasks number
2. complete stages still in active section or showing tasks incomplete
3. unpersisted rdd still in storage page and having fraction cached < 100%

Eventually the application crashed but this is usually the first exception
shows up.
Any idea how to fix it?

--
Pei-Lun Lee


[image: 內置圖片 2][image: 內置圖片 4][image: 內置圖片 3][image: 內置圖片 1]

答复: LiveListenerBus throws exception and weird web UI bug

Posted by "余根茂(木艮)" <ge...@alibaba-inc.com>.
Hi all, 

         Here is my fix https://github.com/apache/spark/pull/1356, although not handsome, but work well.  Any Suggestions?

 

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

 


Re: LiveListenerBus throws exception and weird web UI bug

Posted by Andrew Or <an...@databricks.com>.
Hi all,

This error happens because we receive a "completed" event for a particular
stage that we don't know about, i.e. a stage we haven't received a
"submitted" event for. The root cause of this, as Baoxu explained, is
usually because the event queue is full and the listener begins to drop
events. In this case we are dropping the "submitted" event. This particular
exception should be fixed in the latest master, as we now check for whether
the key exists before indexing directly into it. Unfortunately, this is not
in Spark 1.0.1, but will be fixed in Spark 1.1. There is currently no
bullet-proof workaround for this issue, but you might try to reduce the
number of concurrently running tasks (partitions) to avoid emitting too
many events. The root cause of the listener queue taking too much time to
process events is recorded in SPARK-2316, which we also intend to fix by
Spark 1.1.

Andrew


2014-07-21 10:23 GMT-07:00 mrm <ma...@skimlinks.com>:

> I have the same error! Did you manage to fix it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: LiveListenerBus throws exception and weird web UI bug

Posted by mrm <ma...@skimlinks.com>.
I have the same error! Did you manage to fix it?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LiveListenerBus-throws-exception-and-weird-web-UI-bug-tp8330p10324.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: LiveListenerBus throws exception and weird web UI bug

Posted by Pei-Lun Lee <pl...@appier.com>.
Hi Baoxu, thanks for sharing.


2014-06-26 22:51 GMT+08:00 Baoxu Shi(Dash) <bs...@nd.edu>:

> Hi Pei-Lun,
>
> I have the same problem there. The Issue is SPARK-2228, there also someone
> posted a pull request on that, but he only eliminate this exception but not
> the side effects.
>
> I think the problem may due to the hard-coded   private val
> EVENT_QUEUE_CAPACITY = 10000
>
> in core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala.
> There may have a chance that when the event_queue is full, the system start
> dropping events, and causing key not found because those events never been
> submitted.
>
> Don’t know if that can help.
>
> On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee <pl...@appier.com> wrote:
>
> >
> > Hi,
> >
> > We have a long running spark application runs on spark 1.0 standalone
> server and after it runs several hours the following exception shows up:
> >
> >
> > 14/06/25 23:13:08 ERROR LiveListenerBus: Listener JobProgressListener
> threw an exception
> > java.util.NoSuchElementException: key not found: 6375
> >         at scala.collection.MapLike$class.default(MapLike.scala:228)
> >         at scala.collection.AbstractMap.default(Map.scala:58)
> >         at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
> >         at
> org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:78)
> >         at
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
> >         at
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
> >         at
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
> >         at
> org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
> >         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >         at
> org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
> >         at
> org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
> >         at
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
> >         at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
> >         at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
> >         at scala.Option.foreach(Option.scala:236)
> >         at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
> >         at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
> >         at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
> >         at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
> >         at
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
> >
> >
> > And then the web UI (driver:4040) starts showing weird results like:
> (see attached screenshots)
> > 1. negative active tasks number
> > 2. complete stages still in active section or showing tasks incomplete
> > 3. unpersisted rdd still in storage page and having fraction cached <
> 100%
> >
> > Eventually the application crashed but this is usually the first
> exception shows up.
> > Any idea how to fix it?
> >
> > --
> > Pei-Lun Lee
> >
> >
> > <Screen Shot 2014-06-26 at 12.52.38 PM.png><Screen Shot 2014-06-26 at
> 12.52.21 PM.png><Screen Shot 2014-06-26 at 12.52.07 PM.png><Screen Shot
> 2014-06-26 at 12.51.15 PM.png>
>
>

Re: LiveListenerBus throws exception and weird web UI bug

Posted by "Baoxu Shi(Dash)" <bs...@nd.edu>.
Hi Pei-Lun,

I have the same problem there. The Issue is SPARK-2228, there also someone posted a pull request on that, but he only eliminate this exception but not the side effects.

I think the problem may due to the hard-coded   private val EVENT_QUEUE_CAPACITY = 10000

in core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala. There may have a chance that when the event_queue is full, the system start dropping events, and causing key not found because those events never been submitted.

Don’t know if that can help.

On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee <pl...@appier.com> wrote:

> 
> Hi,
> 
> We have a long running spark application runs on spark 1.0 standalone server and after it runs several hours the following exception shows up:
> 
> 
> 14/06/25 23:13:08 ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> java.util.NoSuchElementException: key not found: 6375
>         at scala.collection.MapLike$class.default(MapLike.scala:228)
>         at scala.collection.AbstractMap.default(Map.scala:58)
>         at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>         at org.apache.spark.ui.jobs.JobProgressListener.onStageCompleted(JobProgressListener.scala:78)
>         at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>         at org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$2.apply(SparkListenerBus.scala:48)
>         at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
>         at org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:79)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:79)
>         at org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:48)
>         at org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
>         at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>         at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
>         at scala.Option.foreach(Option.scala:236)
>         at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
>         at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>         at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
>         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
>         at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)
> 
> 
> And then the web UI (driver:4040) starts showing weird results like: (see attached screenshots)
> 1. negative active tasks number
> 2. complete stages still in active section or showing tasks incomplete
> 3. unpersisted rdd still in storage page and having fraction cached < 100%
> 
> Eventually the application crashed but this is usually the first exception shows up.
> Any idea how to fix it?
> 
> --
> Pei-Lun Lee
> 
> 
> <Screen Shot 2014-06-26 at 12.52.38 PM.png><Screen Shot 2014-06-26 at 12.52.21 PM.png><Screen Shot 2014-06-26 at 12.52.07 PM.png><Screen Shot 2014-06-26 at 12.51.15 PM.png>