You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Daoyuan Wang <me...@daoyuan.wang> on 2019/02/26 01:05:54 UTC

回复:Re: [DISCUSS] SPIP: Relational Cache

Thanks for your advise. I'm fine to stay third-party for current stage and wait for some customer feedback to finally decide if we should have it in upstream. Relational cache is more than materialized view. The rewrite also works with spark‘s in memory cache, and we will also implement some lazily caching strategy.


Best Regards,
Daoyuan

--------------原始邮件--------------
发件人:"Xiao Li "<li...@databricks.com>;
发送时间:2019年2月26日(星期二) 凌晨3:45
收件人:"Reynold Xin" <rx...@databricks.com>;
抄送:"Daoyuan Wang "<me...@daoyuan.wang>;"dev "<de...@spark.apache.org>;
主题:Re: [DISCUSS] SPIP: Relational Cache
-----------------------------------

 Implementing materialized views is complex. How about doing this as the third-part package in the current stage until the solution is completely ready for the end users of Apache Spark? 


We can plug in an optimization rule for implementing such query rewriting, right? Since the last release, we also allow users to specify the options in CACHE TABLE SQL command. You can reuse this interface if you do not want to formally call it as materialized views. 

Thanks,


Xiao








 





On Sun, Feb 24, 2019 at 4:35 PM Reynold Xin <rx...@databricks.com> wrote:

How is this different from materialized views?


On Sun, Feb 24, 2019 at 3:44 PM Daoyuan Wang <me...@daoyuan.wang> wrote:

Hi everyone,


We'd like to discuss our proposal of Spark relational cache in this thread. Spark has native command for RDD caching, but the use of CACHE command in Spark SQL is limited, as we cannot use the cache cross session, as well as we have to rewrite queries by ourselves to make use of existing cache.
To resolve this, we have done some initial work to do the following:


 1. allow user to persist cache on HDFS in format of Parquet.
 2. rewrite user queries in Catalyst, to utilize any existing cache (on HDFS or defined as in memory in current session) if possible.


I have created a jira ticket(https://issues.apache.org/jira/browse/SPARK-26764) for this and attached an official SPIP document.


Thanks for taking a look at the proposal.


Best Regards,
Daoyuan



 



--