You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marius Soutier <mp...@gmail.com> on 2015/03/11 18:35:18 UTC

Spark Streaming recover from Checkpoint with Spark SQL

Hi,

I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added checkpointing; everything works fine when starting from scratch. When starting from a checkpoint however, the job doesn’t work and produces the following exception in the foreachRDD:

ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1426093830000 ms.2
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
	at org.apache.spark.rdd.RDD.sc(RDD.scala:90)
	at org.apache.spark.rdd.RDD.<init>(RDD.scala:143)
	at org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:108)
	at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
	at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
	at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)




Cheers
- Marius


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark Streaming recover from Checkpoint with Spark SQL

Posted by Marius Soutier <mp...@gmail.com>.

Thanks, the new guide did help - instantiating the SQLContext inside foreachRDD did the trick for me, but the SQLContext singleton works as well.

Now the only problem left is that spark.driver.port is not retained after starting from a checkpoint, so my Actor receivers are running on a random port...


> On 12.03.2015, at 02:35, Tathagata Das <td...@databricks.com> wrote:
> 
> Can you show us the code that you are using?
> 
> This might help. This is the updated streaming programming guide for 1.3, soon to be up, this is a quick preview. 
> http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations <http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations>
> 
> TD
> 
> On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier <mps.dev@gmail.com <ma...@gmail.com>> wrote:
> Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).
> 
> > On 11.03.2015, at 18:35, Marius Soutier <mps.dev@gmail.com <ma...@gmail.com>> wrote:
> >
> > Hi,
> >
> > I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added checkpointing; everything works fine when starting from scratch. When starting from a checkpoint however, the job doesn’t work and produces the following exception in the foreachRDD:
> >
> > ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1426093830000 ms.2
> > org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
> >       at org.apache.spark.rdd.RDD.sc <http://org.apache.spark.rdd.rdd.sc/>(RDD.scala:90)
> >       at org.apache.spark.rdd.RDD.<init>(RDD.scala:143)
> >       at org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:108)
> >       at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
> >       at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> >       at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> >       at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> >       at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> >       at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
> >       at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> >       at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> >
> >
> >
> >
> > Cheers
> > - Marius
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Spark Streaming recover from Checkpoint with Spark SQL

Posted by Tathagata Das <td...@databricks.com>.

Can you show us the code that you are using?

This might help. This is the updated streaming programming guide for 1.3,
soon to be up, this is a quick preview.
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations

TD

On Wed, Mar 11, 2015 at 12:20 PM, Marius Soutier <mp...@gmail.com> wrote:

> Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).
>
> > On 11.03.2015, at 18:35, Marius Soutier <mp...@gmail.com> wrote:
> >
> > Hi,
> >
> > I’ve written a Spark Streaming Job that inserts into a Parquet, using
> stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added
> checkpointing; everything works fine when starting from scratch. When
> starting from a checkpoint however, the job doesn’t work and produces the
> following exception in the foreachRDD:
> >
> > ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running
> job streaming job 1426093830000 ms.2
> > org.apache.spark.SparkException: RDD transformations and actions can
> only be invoked by the driver, not inside of other transformations; for
> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
> values transformation and count action cannot be performed inside of the
> rdd1.map transformation. For more information, see SPARK-5063.
> >       at org.apache.spark.rdd.RDD.sc(RDD.scala:90)
> >       at org.apache.spark.rdd.RDD.<init>(RDD.scala:143)
> >       at org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:108)
> >       at
> org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
> >       at
> MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> >       at
> MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> >       at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> >       at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> >       at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
> >       at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> >       at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> >
> >
> >
> >
> > Cheers
> > - Marius
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark Streaming recover from Checkpoint with Spark SQL

Posted by Marius Soutier <mp...@gmail.com>.

Forgot to mention, it works when using .foreachRDD(_.saveAsTextFile(“”)).

> On 11.03.2015, at 18:35, Marius Soutier <mp...@gmail.com> wrote:
> 
> Hi,
> 
> I’ve written a Spark Streaming Job that inserts into a Parquet, using stream.foreachRDD(_insertInto(“table”, overwrite = true). Now I’ve added checkpointing; everything works fine when starting from scratch. When starting from a checkpoint however, the job doesn’t work and produces the following exception in the foreachRDD:
> 
> ERROR org.apache.spark.streaming.scheduler.JobScheduler: Error running job streaming job 1426093830000 ms.2
> org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
> 	at org.apache.spark.rdd.RDD.sc(RDD.scala:90)
> 	at org.apache.spark.rdd.RDD.<init>(RDD.scala:143)
> 	at org.apache.spark.sql.SchemaRDD.<init>(SchemaRDD.scala:108)
> 	at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:114)
> 	at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> 	at MyStreamingJob$$anonfun$createComputation$3.apply(MyStreamingJob:167)
> 	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> 	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:535)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> 	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> 
> 
> 
> 
> Cheers
> - Marius
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org