You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Chang Chen <ba...@gmail.com> on 2020/02/01 06:48:30 UTC

[DISCUSS] Caching SparkPlan

I'd like to start a discussion on caching SparkPlan

From what I benchmark, if sql execution time is less than 1 second, then we
cannot ignore the following overheads , especially if we cache data in
memory

   1. Paring, analysing, optimizing SQL
   2. Generating Physical Plan (SparkPlan)
   3. Generating Codes
   4. Generating RDD

snappydata(https://github.com/SnappyDataInc/snappydata) already implemented
plan cache for same query pattern, for example ,given the sql: SELECT
intField, count(*), sum(intField) FROM t WHERE intField = 1 group by
intField, for any sql which is only difference on Literal, we can re-use
the same SparkPlan. snappydata's implementation is based on caching the
final RDD graph, since spark will cache shuffle data internally, I believe
it can not run the same  query pattern concurrently.

My idea is caching SparkPlan(and hence, caching generated codes), and
creating RDD graph every time(the only overhead we can not ignore, but if
we read data from memory, the overhead is OK). I did some experiment by
extracting ParamLiteral from snappydata, and the experiment looks fine.

I want to get some comments before writing a jira.  Would appreciate
comments and discussions from the community.

Thanks
Chang