You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/03/10 16:45:41 UTC
[jira] [Resolved] (SPARK-13758) Error message is misleading when
RDD refer to null spark context
[ https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-13758.
-------------------------------
Resolution: Fixed
Fix Version/s: 2.0.0
Issue resolved by pull request 11595
[https://github.com/apache/spark/pull/11595]
> Error message is misleading when RDD refer to null spark context
> ----------------------------------------------------------------
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
> Issue Type: Improvement
> Components: Documentation, Spark Core, Streaming
> Reporter: Mao, Wei
> Priority: Trivial
> Fix For: 2.0.0
>
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could be executed correctly at first time, but throw following exception when restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
> at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
> at org.apache.spark.rdd.RDD.union(RDD.scala:565)
> at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
> at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
> at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in other transformations, but I did not. The real reason is that I used external RDD in DStream operation. External RDD data is not stored in checkpoint, so that during recovering, the initial value of _sc in this RDD is assigned to null and hit above exception. But you can find the error message is misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
> def createContext(ip: String, port: Int, checkpointDirectory: String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
> val res = rdd.map(word => (word, word.length)).collect()
> println("words: " + res.mkString(", "))
> cached = cached.union(rdd)
> cached.checkpoint()
> println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
> }
> def main(args: Array[String]) {
> val ip = "localhost"
> val port = 9999
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
> () => {
> createContext(ip, port, dir)
> })
> ssc.start()
> ssc.awaitTermination()
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org