You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Madhusudanan Kandasamy (JIRA)" <ji...@apache.org> on 2015/06/29 22:32:04 UTC
[jira] [Commented] (SPARK-8707) RDD#toDebugString fails if any
cached RDD has invalid partitions
[ https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606334#comment-14606334 ]
Madhusudanan Kandasamy commented on SPARK-8707:
-----------------------------------------------
One possible solution is to have a new version of getRDDStorageInfo which would accept RDD id as an argument and calls RDDInfo.fromRdd only for the concerned RDD. The RDD.toDebugString can pass RDD.id as a argument to this method.
> RDD#toDebugString fails if any cached RDD has invalid partitions
> ----------------------------------------------------------------
>
> Key: SPARK-8707
> URL: https://issues.apache.org/jira/browse/SPARK-8707
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.4.0, 1.4.1
> Reporter: Aaron Davidson
> Labels: starter
>
> Repro:
> {code}
> sc.textFile("/ThisFileDoesNotExist").cache()
> sc.parallelize(0 until 100).toDebugString
> {code}
> Output:
> {code}
> java.io.IOException: Not a file: /ThisFileDoesNotExist
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
> at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
> at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
> at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
> at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
> at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
> at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
> {code}
> This is because toDebugString gets all the partitions from all RDDs, which fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be resilient to other RDDs being invalid (and getRDDStorageInfo should probably also be).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org