You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Madhusudanan Kandasamy (JIRA)" <ji...@apache.org> on 2015/06/29 22:32:04 UTC
[jira] [Commented] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions

    [ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606334#comment-14606334 ] 

Madhusudanan Kandasamy commented on SPARK-8707:
-----------------------------------------------

One possible solution is to have a new version of getRDDStorageInfo which would accept RDD id as an argument and calls RDDInfo.fromRdd only for the concerned RDD. The RDD.toDebugString can pass RDD.id as a argument to this method.

> RDD#toDebugString fails if any cached RDD has invalid partitions
> ----------------------------------------------------------------
>
>                 Key: SPARK-8707
>                 URL: https://issues.apache.org/jira/browse/SPARK-8707
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.4.0, 1.4.1
>            Reporter: Aaron Davidson
>              Labels: starter
>
> Repro:
> {code}
> sc.textFile("/ThisFileDoesNotExist").cache()
> sc.parallelize(0 until 100).toDebugString
> {code}
> Output:
> {code}
> java.io.IOException: Not a file: /ThisFileDoesNotExist
> 	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
> 	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> 	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> 	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> 	at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
> 	at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
> 	at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
> 	at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
> 	at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
> 	at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
> {code}
> This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org