You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by David McWhorter <mc...@ccri.com> on 2014/12/19 20:35:10 UTC
DAGScheduler StackOverflowError

Hi all,

I'm developing a spark application where I need to iteratively update an 
RDD over a large number of iterations (1000+).  From reading online, 
I've found that I should use .checkpoint() to keep the graph from 
growing too large.  Even when doing this, I keep getting 
StackOverflowError's in DAGScheduler such as the one below. I've 
attached a sample application that illustrates what I'm trying to do.
Can anyone point out how I can keep the DAG from growing so large that 
spark is not able to process it?

Thank you,
David


java.lang.StackOverflowError
         at 
scala.collection.generic.GenMapFactory$MapCanBuildFrom.scala$collection$generic$GenMapFactory$MapCanBuil
dFrom$$$outer(GenMapFactory.scala:57)
         at 
scala.collection.generic.GenMapFactory$MapCanBuildFrom.apply(GenMapFactory.scala:58)
         at 
scala.collection.generic.GenMapFactory$MapCanBuildFrom.apply(GenMapFactory.scala:57)
         at 
scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:154)
         at 
scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
         at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
         at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
         at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
         at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
         at 
scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
         at 
scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
         at 
scala.collection.TraversableOnce$$anonfun$toMap$1.apply(TraversableOnce.scala:280)
         at 
scala.collection.TraversableOnce$$anonfun$toMap$1.apply(TraversableOnce.scala:279)
         at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
         at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
         at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
         at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
         at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
         at 
scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
         at 
scala.collection.AbstractTraversable.toMap(Traversable.scala:105)
         at 
org.apache.spark.storage.BlockManager$.blockIdsToBlockManagers(BlockManager.scala:1264)
         at 
org.apache.spark.scheduler.DAGScheduler.getCacheLocs(DAGScheduler.scala:199)
         at 
org.apache.spark.scheduler.DAGScheduler.visit$3(DAGScheduler.scala:372)
         at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getMissingParentStages(
DAGScheduler.scala:389)
         at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGSchedule
r.scala:774)
         at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.
apply(DAGScheduler.scala:781)
         at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.
apply(DAGScheduler.scala:780)
         at scala.collection.immutable.List.foreach(List.scala:318)
         at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGSchedule
r.scala:780)
         at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.
apply(DAGScheduler.scala:781)
         at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.
apply(DAGScheduler.scala:780)
         at scala.collection.immutable.List.foreach(List.scala:318)
         at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGSchedule
r.scala:780)
         at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.
apply(DAGScheduler.scala:781)
         at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.
apply(DAGScheduler.scala:780)
        .... (last 4 lines repeating 1000's of times)


-- 

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhorter@ccri.com | 434.299.0090x204