You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Liang-Chi Hsieh <vi...@gmail.com> on 2017/03/03 02:58:26 UTC

Re: How to cache SparkPlan.execute for reusing?

Internally, in each partition of the resulting RDD[InternalRow], you will
get the same UnsafeRow when iterating the rows. Typical RDD.cache doesn't
work for it. You will get the output with the same rows. Not sure why you
get empty output.

Dataset.cache() is used for caching SQL query results. Even you really cache
RDD[InternalRow] by RDD.cache with the trick which copies the rows (with
significant performance penalty), a new query (plan) will not automatically
reuse the cached RDD, because new RDDs will be created.


summerDG wrote
> We are optimizing the Spark SQL for adaptive execution. So the SparkPlan
> maybe reused for strategy choice. But we find that  once the result of
> SparkPlan.execute, RDD[InternalRow], is cached using RDD.cache, the query
> output is empty.
> 1. How to cache the result of SparkPlan.execute?
> 2. Why is RDD.cache invalid for RDD[InternalRow]?





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-cache-SparkPlan-execute-for-reusing-tp21097p21098.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: How to cache SparkPlan.execute for reusing?

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

Not sure what you mean in "its parents have to reuse it by creating new
RDDs".

As SparkPlan.execute returns new RDD every time, you won't expect the cached
RDD can be reused automatically, even you reuse the SparkPlan in several
queries.

Btw, is there any existing ways to reuse SparkPlan?



summerDG wrote
> Thank you very much. The reason why the output is empty is that the query
> involves join. I forgot to mention it in the question. So even I succeed
> in caching the RDD, the following SparkPlans in the query will not reuse
> it.
> If there is a SparkPlan of the query, which has several "parent" nodes,
> its "parents" have to reuse it by creating new RDDs?





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-cache-SparkPlan-execute-for-reusing-tp21097p21100.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org