You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@calcite.apache.org by JD Zheng <jd...@gmail.com> on 2017/06/20 02:36:21 UTC

Use spark as Calcite execution engine

Hi, 

We are using calcite druid adaptor to query data stored in druid. But lots of operations are not pushed down to druid, the built-in enumerable execution engine becomes the bottleneck for some of the queries that’s not pushed down. As we also have some use-case to join data in druid from outside data source, it seems using spark execution engine is the way to go. 

Has anyone used the spark-adaptor to make spark as the calcite execution engine? How mature is the spark-adaptor? Is there any document about how to use the spark-adaptor? How does it compare to using hive? 

Thanks,
-JD

Re: Use spark as Calcite execution engine

Posted by Julian Hyde <jh...@apache.org>.
JD,

What you want to do is very reasonable. Druid could benefit from a distributed execution engine that is able to do shuffles, parallel sorts, and parallel UDFs. Spark, Hive, Drill, Presto are examples of such engines.

Phoenix has a similar requirement, and “Drillix” (a combination of Phoenix with Drill) was proposed[1].

As has been discussed on this list previously, the Spark adapter is not production quality. Contributions welcome.

Julian

[1] http://phoenix.apache.org/presentations/Drillix.pdf <http://phoenix.apache.org/presentations/Drillix.pdf>

> On Jun 20, 2017, at 4:34 PM, JD Zheng <jd...@gmail.com> wrote:
> 
> Hi, Vladimir,
> 
> 
>> On Jun 20, 2017, at 2:29 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
>> 
>> JD>we still face the join problem though
>> 
>> Can you please clarify what is the typical dataset you are trying to join
>> (in the number of rows/bytes)?
> 
> Our dataset size varies, can go up to 100+GB. Since druid does not support join, we definitely can not push down the join. If we do the join in calcite, then the query is limited by the one host memory. That’s why I am thinking of using SPARK engine to address this concern.
> 
>> Am I right you somehow struggle with "fetch everything from Druid and join
>> via Enumerable" and you are completely fine with "fetch everything from
>> Druid and join via Spark”?
> I am trying to see if we could do something similar as Hive does, directly pull Druid segments and do the rest in spark.
> 
>> I'm not sure Spark itself would make things way faster.
>> 
> At least won’t have the one host memory bound issue.
> 
>> Could you share some queries along with dataset sizes and expected/actual
>> execution plans?
>> 
> 
>> Vladimir
> 


Re: Use spark as Calcite execution engine

Posted by JD Zheng <jd...@gmail.com>.
Hi, Vladimir,

 
> On Jun 20, 2017, at 2:29 PM, Vladimir Sitnikov <si...@gmail.com> wrote:
> 
> JD>we still face the join problem though
> 
> Can you please clarify what is the typical dataset you are trying to join
> (in the number of rows/bytes)?

Our dataset size varies, can go up to 100+GB. Since druid does not support join, we definitely can not push down the join. If we do the join in calcite, then the query is limited by the one host memory. That’s why I am thinking of using SPARK engine to address this concern.

> Am I right you somehow struggle with "fetch everything from Druid and join
> via Enumerable" and you are completely fine with "fetch everything from
> Druid and join via Spark”?
I am trying to see if we could do something similar as Hive does, directly pull Druid segments and do the rest in spark.

> I'm not sure Spark itself would make things way faster.
> 
At least won’t have the one host memory bound issue.

> Could you share some queries along with dataset sizes and expected/actual
> execution plans?
> 

> Vladimir


Re: Use spark as Calcite execution engine

Posted by Vladimir Sitnikov <si...@gmail.com>.
JD>we still face the join problem though

Can you please clarify what is the typical dataset you are trying to join
(in the number of rows/bytes)?
Am I right you somehow struggle with "fetch everything from Druid and join
via Enumerable" and you are completely fine with "fetch everything from
Druid and join via Spark"?

I'm not sure Spark itself would make things way faster.

Could you share some queries along with dataset sizes and expected/actual
execution plans?

Vladimir

Re: Use spark as Calcite execution engine

Posted by JD Zheng <jd...@gmail.com>.
Hi, Josh,

Thank you very much for your response. We are trying to push as many ops down to druid as possible. While we are tuning the performance to make it as good as directly call druid api, we still face the join problem though. It’ll be nice if we could make the spark adaptor work. It will make calcite much more powerful than it can be. 

I looked up the code base a little bit, the spark adaptor code has not had any serious commits for years. I’m just curious what is the roadmap for the spark adaptor.

-JD

> On Jun 20, 2017, at 1:29 PM, Josh Elser <el...@apache.org> wrote:
> 
> Hi JD,
> 
> I don't have a lot of expertise to comment definitively, but let me try to give you a response with what limited knowledge I do have :)
> 
> I would venture a guess that the calcite-druid adapter could be extended to push more operations down to Druid in order to do less in the Calcite layer itself.
> 
> I'm not aware of any sort of "production ready" example of using spark as the execution engine, either. Maybe there are some examples people have elsewhere -- asking that question on this mailing list as you have is a good way to get help. Hopefully someone with more experience than I have can comment!
> 
> - Josh
> 
> On 6/19/17 10:36 PM, JD Zheng wrote:
>> Hi,
>> We are using calcite druid adaptor to query data stored in druid. But lots of operations are not pushed down to druid, the built-in enumerable execution engine becomes the bottleneck for some of the queries that’s not pushed down. As we also have some use-case to join data in druid from outside data source, it seems using spark execution engine is the way to go.
>> Has anyone used the spark-adaptor to make spark as the calcite execution engine? How mature is the spark-adaptor? Is there any document about how to use the spark-adaptor? How does it compare to using hive?
>> Thanks,
>> -JD


Re: Use spark as Calcite execution engine

Posted by Josh Elser <el...@apache.org>.
Hi JD,

I don't have a lot of expertise to comment definitively, but let me try 
to give you a response with what limited knowledge I do have :)

I would venture a guess that the calcite-druid adapter could be extended 
to push more operations down to Druid in order to do less in the Calcite 
layer itself.

I'm not aware of any sort of "production ready" example of using spark 
as the execution engine, either. Maybe there are some examples people 
have elsewhere -- asking that question on this mailing list as you have 
is a good way to get help. Hopefully someone with more experience than I 
have can comment!

- Josh

On 6/19/17 10:36 PM, JD Zheng wrote:
> Hi,
> 
> We are using calcite druid adaptor to query data stored in druid. But lots of operations are not pushed down to druid, the built-in enumerable execution engine becomes the bottleneck for some of the queries that’s not pushed down. As we also have some use-case to join data in druid from outside data source, it seems using spark execution engine is the way to go.
> 
> Has anyone used the spark-adaptor to make spark as the calcite execution engine? How mature is the spark-adaptor? Is there any document about how to use the spark-adaptor? How does it compare to using hive?
> 
> Thanks,
> -JD
>