You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Eli Levine <el...@gmail.com> on 2016/11/01 18:35:40 UTC

Re: Calcite with Phoenix and Spark

Thank you for the pointers, Julian and James! I have a requirement that the
main execution engine is a fault-tolerant one and at this point the main
contenders are Pig and Spark. Drillx is great as a source of example usages
of Calcite, so it will definitely be useful.

And yes, the hope is to contribute any Spark and/or Pig adapter code that
gets developed to Calcite.

Eli


On Sat, Oct 22, 2016 at 9:56 PM, Julian Hyde <jh...@apache.org> wrote:

> Well, to correct James slightly, there is SOME support for Spark in
> Calcite, but it’s fair to say that it hasn’t had much love. If you would
> like to get something working then Drillix (Drill + Phoenix + Calcite) is
> the way to go.
>
> That said, Spark is an excellent and hugely popular execution environment,
> so I would very much like to improve the Spark adapter. A few people on
> this list have talked about that over the past couple of months. If you
> would like to join that effort, it would be most welcome, but there’s more
> work to be done before you start getting results.
>
> Julian
>
>
> > On Oct 22, 2016, at 4:41 PM, James Taylor <ja...@apache.org>
> wrote:
> >
> > Hi Eli,
> > With the calcite branch of Phoenix you're part way there. I think a good
> > way to approach this would be to create a new set of operators that
> > correspond to Spark operations and the corresponding rules that know when
> > to use them. These could then be costed with the other Phoenix operators
> at
> > planning time. Spark would work especially well to store intermediate
> > results in more complex queries.
> >
> > Since Spark doesn't integrate natively with Calcite, I think using Spark
> > directly may not get you where you need to go. In the same way, the
> > Phoenix-Spark integration is higher level, built on top of Phoenix and
> has
> > no direct integration with Calcite.
> >
> > Another alternative to consider would be using Drillix (Drill + Phoenix)
> > which uses Calcite underneath[1].
> >
> > Thanks,
> > James
> >
> > [1]
> > https://apurtell.s3.amazonaws.com/phoenix/Drillix+Combined+
> Operational+%26+Analytical+SQL+at+Scale.pdf
> >
> > On Sat, Oct 22, 2016 at 1:02 PM, Eli Levine <el...@gmail.com> wrote:
> >
> >> Greetings, Calcite devs. First of all, thank you for your work on
> Calcite!
> >>
> >> I am working on a federated query engine that will use Spark (or
> something
> >> similar) as the main execution engine. Among other data sources the
> query
> >> engine will read from Apache Phoenix tables/views. The hope is to
> utilize
> >> Calcite as the query planner and optimizer component of this query
> engine.
> >>
> >> At a high level, I am trying to build the following using Calcite:
> >> 1. Generate a relational algebra expression tree using RelBuilder based
> on
> >> user input. I plan to implement custom schema and table classes based
> on my
> >> metadata.
> >> 2. Provide Calcite with query optimization rules.
> >> 3. Traverse the optimized expression tree to generate a set of Spark
> >> instructions.
> >> 4. Execute query instructions via Spark.
> >>
> >> A few questions regarding the above:
> >> 1. Are there existing examples of code that does #3 above? I looked at
> the
> >> Spark submodule and it seems pretty bare-bones. What would be great to
> see
> >> is an example of a RelNode tree being traversed to create a plan for
> >> asynchronous execution via something like Spark or Pig.
> >> 2. An important query optimization that is planned initially is to be
> able
> >> to push down simple filters to Phoenix (the plan is to use Phoenix-Spark
> >> <http://phoenix.apache.org/phoenix_spark.html> integration for reading
> >> data). Any examples of such push-downs to specific data sources in a
> >> federated query scenario would be much appreciated.
> >>
> >> Thank you! Looking forward to working with the Calcite community.
> >>
> >> -------------
> >> Eli Levine
> >> Software Engineering Architect -- Salesforce.com
> >>
>
>

Re: Calcite with Phoenix and Spark

Posted by Eli Levine <el...@gmail.com>.

It's a fairly loose term. For us it generally means being able to recover
from node failures without having to rerun the process from the beginning.
M/R, Spark fall broadly into that category.

Thanks,

Eli


On Tue, Nov 1, 2016 at 11:46 AM, James Taylor <ja...@apache.org>
wrote:

> Eli,
> Can you define what you mean by "fault-tolerant"? Phoenix+HBase are fault
> tolerant through the retries that HBase does.
> Thanks,
> James
>
> On Tue, Nov 1, 2016 at 11:35 AM, Eli Levine <el...@gmail.com> wrote:
>
> > Thank you for the pointers, Julian and James! I have a requirement that
> the
> > main execution engine is a fault-tolerant one and at this point the main
> > contenders are Pig and Spark. Drillx is great as a source of example
> usages
> > of Calcite, so it will definitely be useful.
> >
> > And yes, the hope is to contribute any Spark and/or Pig adapter code that
> > gets developed to Calcite.
> >
> > Eli
> >
> >
> > On Sat, Oct 22, 2016 at 9:56 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> > > Well, to correct James slightly, there is SOME support for Spark in
> > > Calcite, but it’s fair to say that it hasn’t had much love. If you
> would
> > > like to get something working then Drillix (Drill + Phoenix + Calcite)
> is
> > > the way to go.
> > >
> > > That said, Spark is an excellent and hugely popular execution
> > environment,
> > > so I would very much like to improve the Spark adapter. A few people on
> > > this list have talked about that over the past couple of months. If you
> > > would like to join that effort, it would be most welcome, but there’s
> > more
> > > work to be done before you start getting results.
> > >
> > > Julian
> > >
> > >
> > > > On Oct 22, 2016, at 4:41 PM, James Taylor <ja...@apache.org>
> > > wrote:
> > > >
> > > > Hi Eli,
> > > > With the calcite branch of Phoenix you're part way there. I think a
> > good
> > > > way to approach this would be to create a new set of operators that
> > > > correspond to Spark operations and the corresponding rules that know
> > when
> > > > to use them. These could then be costed with the other Phoenix
> > operators
> > > at
> > > > planning time. Spark would work especially well to store intermediate
> > > > results in more complex queries.
> > > >
> > > > Since Spark doesn't integrate natively with Calcite, I think using
> > Spark
> > > > directly may not get you where you need to go. In the same way, the
> > > > Phoenix-Spark integration is higher level, built on top of Phoenix
> and
> > > has
> > > > no direct integration with Calcite.
> > > >
> > > > Another alternative to consider would be using Drillix (Drill +
> > Phoenix)
> > > > which uses Calcite underneath[1].
> > > >
> > > > Thanks,
> > > > James
> > > >
> > > > [1]
> > > > https://apurtell.s3.amazonaws.com/phoenix/Drillix+Combined+
> > > Operational+%26+Analytical+SQL+at+Scale.pdf
> > > >
> > > > On Sat, Oct 22, 2016 at 1:02 PM, Eli Levine <el...@gmail.com>
> > wrote:
> > > >
> > > >> Greetings, Calcite devs. First of all, thank you for your work on
> > > Calcite!
> > > >>
> > > >> I am working on a federated query engine that will use Spark (or
> > > something
> > > >> similar) as the main execution engine. Among other data sources the
> > > query
> > > >> engine will read from Apache Phoenix tables/views. The hope is to
> > > utilize
> > > >> Calcite as the query planner and optimizer component of this query
> > > engine.
> > > >>
> > > >> At a high level, I am trying to build the following using Calcite:
> > > >> 1. Generate a relational algebra expression tree using RelBuilder
> > based
> > > on
> > > >> user input. I plan to implement custom schema and table classes
> based
> > > on my
> > > >> metadata.
> > > >> 2. Provide Calcite with query optimization rules.
> > > >> 3. Traverse the optimized expression tree to generate a set of Spark
> > > >> instructions.
> > > >> 4. Execute query instructions via Spark.
> > > >>
> > > >> A few questions regarding the above:
> > > >> 1. Are there existing examples of code that does #3 above? I looked
> at
> > > the
> > > >> Spark submodule and it seems pretty bare-bones. What would be great
> to
> > > see
> > > >> is an example of a RelNode tree being traversed to create a plan for
> > > >> asynchronous execution via something like Spark or Pig.
> > > >> 2. An important query optimization that is planned initially is to
> be
> > > able
> > > >> to push down simple filters to Phoenix (the plan is to use
> > Phoenix-Spark
> > > >> <http://phoenix.apache.org/phoenix_spark.html> integration for
> > reading
> > > >> data). Any examples of such push-downs to specific data sources in a
> > > >> federated query scenario would be much appreciated.
> > > >>
> > > >> Thank you! Looking forward to working with the Calcite community.
> > > >>
> > > >> -------------
> > > >> Eli Levine
> > > >> Software Engineering Architect -- Salesforce.com
> > > >>
> > >
> > >
> >
>

Re: Calcite with Phoenix and Spark

Posted by James Taylor <ja...@apache.org>.

Eli,
Can you define what you mean by "fault-tolerant"? Phoenix+HBase are fault
tolerant through the retries that HBase does.
Thanks,
James

On Tue, Nov 1, 2016 at 11:35 AM, Eli Levine <el...@gmail.com> wrote:

> Thank you for the pointers, Julian and James! I have a requirement that the
> main execution engine is a fault-tolerant one and at this point the main
> contenders are Pig and Spark. Drillx is great as a source of example usages
> of Calcite, so it will definitely be useful.
>
> And yes, the hope is to contribute any Spark and/or Pig adapter code that
> gets developed to Calcite.
>
> Eli
>
>
> On Sat, Oct 22, 2016 at 9:56 PM, Julian Hyde <jh...@apache.org> wrote:
>
> > Well, to correct James slightly, there is SOME support for Spark in
> > Calcite, but it’s fair to say that it hasn’t had much love. If you would
> > like to get something working then Drillix (Drill + Phoenix + Calcite) is
> > the way to go.
> >
> > That said, Spark is an excellent and hugely popular execution
> environment,
> > so I would very much like to improve the Spark adapter. A few people on
> > this list have talked about that over the past couple of months. If you
> > would like to join that effort, it would be most welcome, but there’s
> more
> > work to be done before you start getting results.
> >
> > Julian
> >
> >
> > > On Oct 22, 2016, at 4:41 PM, James Taylor <ja...@apache.org>
> > wrote:
> > >
> > > Hi Eli,
> > > With the calcite branch of Phoenix you're part way there. I think a
> good
> > > way to approach this would be to create a new set of operators that
> > > correspond to Spark operations and the corresponding rules that know
> when
> > > to use them. These could then be costed with the other Phoenix
> operators
> > at
> > > planning time. Spark would work especially well to store intermediate
> > > results in more complex queries.
> > >
> > > Since Spark doesn't integrate natively with Calcite, I think using
> Spark
> > > directly may not get you where you need to go. In the same way, the
> > > Phoenix-Spark integration is higher level, built on top of Phoenix and
> > has
> > > no direct integration with Calcite.
> > >
> > > Another alternative to consider would be using Drillix (Drill +
> Phoenix)
> > > which uses Calcite underneath[1].
> > >
> > > Thanks,
> > > James
> > >
> > > [1]
> > > https://apurtell.s3.amazonaws.com/phoenix/Drillix+Combined+
> > Operational+%26+Analytical+SQL+at+Scale.pdf
> > >
> > > On Sat, Oct 22, 2016 at 1:02 PM, Eli Levine <el...@gmail.com>
> wrote:
> > >
> > >> Greetings, Calcite devs. First of all, thank you for your work on
> > Calcite!
> > >>
> > >> I am working on a federated query engine that will use Spark (or
> > something
> > >> similar) as the main execution engine. Among other data sources the
> > query
> > >> engine will read from Apache Phoenix tables/views. The hope is to
> > utilize
> > >> Calcite as the query planner and optimizer component of this query
> > engine.
> > >>
> > >> At a high level, I am trying to build the following using Calcite:
> > >> 1. Generate a relational algebra expression tree using RelBuilder
> based
> > on
> > >> user input. I plan to implement custom schema and table classes based
> > on my
> > >> metadata.
> > >> 2. Provide Calcite with query optimization rules.
> > >> 3. Traverse the optimized expression tree to generate a set of Spark
> > >> instructions.
> > >> 4. Execute query instructions via Spark.
> > >>
> > >> A few questions regarding the above:
> > >> 1. Are there existing examples of code that does #3 above? I looked at
> > the
> > >> Spark submodule and it seems pretty bare-bones. What would be great to
> > see
> > >> is an example of a RelNode tree being traversed to create a plan for
> > >> asynchronous execution via something like Spark or Pig.
> > >> 2. An important query optimization that is planned initially is to be
> > able
> > >> to push down simple filters to Phoenix (the plan is to use
> Phoenix-Spark
> > >> <http://phoenix.apache.org/phoenix_spark.html> integration for
> reading
> > >> data). Any examples of such push-downs to specific data sources in a
> > >> federated query scenario would be much appreciated.
> > >>
> > >> Thank you! Looking forward to working with the Calcite community.
> > >>
> > >> -------------
> > >> Eli Levine
> > >> Software Engineering Architect -- Salesforce.com
> > >>
> >
> >
>

Re: Calcite with Phoenix and Spark

Posted by Daniel Dai <da...@gmail.com>.

If you want to go down the Pig adapter path, I will help on the Pig side.

Thanks,
Daniel

On Wed, Nov 2, 2016 at 3:21 PM, Eli Levine <el...@gmail.com> wrote:

> Will follow your suggested model when I start development. Thanks for
> offering to potentially include that work in Calcite, Julian.
>
> Eli
>
>
> On Tue, Nov 1, 2016 at 11:50 AM, Julian Hyde <jh...@apache.org> wrote:
>
>> If it helps make your “hope” a bit more likely to happen, you should
>> consider doing your Spark or Pig adapters in the Calcite code base, that
>> is, as a fork of the Calcite repo on GitHub from which you periodically
>> submit pull requests.  I would welcome that development model. For big,
>> important features like this, I am comfortable including alpha or beta
>> quality code in the Calcite release.
>>
>> If you do the work as part of the Calcite project, almost certainly other
>> developers will want to help out. You’ll do less work  yourself, and end up
>> with a more robust result.
>>
>> I am Cc:ing Daniel Dai. He and I have talked about a Pig adapter for
>> Calcite in the past. If you decide to go that route I’m Daniel may be able
>> to help out.
>>
>> Julian
>>
>> > On Nov 1, 2016, at 11:35 AM, Eli Levine <el...@gmail.com> wrote:
>> >
>> > Thank you for the pointers, Julian and James! I have a requirement that
>> the
>> > main execution engine is a fault-tolerant one and at this point the main
>> > contenders are Pig and Spark. Drillx is great as a source of example
>> usages
>> > of Calcite, so it will definitely be useful.
>> >
>> > And yes, the hope is to contribute any Spark and/or Pig adapter code
>> that
>> > gets developed to Calcite.
>> >
>> > Eli
>> >
>> >
>> > On Sat, Oct 22, 2016 at 9:56 PM, Julian Hyde <jh...@apache.org> wrote:
>> >
>> >> Well, to correct James slightly, there is SOME support for Spark in
>> >> Calcite, but it’s fair to say that it hasn’t had much love. If you
>> would
>> >> like to get something working then Drillix (Drill + Phoenix + Calcite)
>> is
>> >> the way to go.
>> >>
>> >> That said, Spark is an excellent and hugely popular execution
>> environment,
>> >> so I would very much like to improve the Spark adapter. A few people on
>> >> this list have talked about that over the past couple of months. If you
>> >> would like to join that effort, it would be most welcome, but there’s
>> more
>> >> work to be done before you start getting results.
>> >>
>> >> Julian
>> >>
>> >>
>> >>> On Oct 22, 2016, at 4:41 PM, James Taylor <ja...@apache.org>
>> >> wrote:
>> >>>
>> >>> Hi Eli,
>> >>> With the calcite branch of Phoenix you're part way there. I think a
>> good
>> >>> way to approach this would be to create a new set of operators that
>> >>> correspond to Spark operations and the corresponding rules that know
>> when
>> >>> to use them. These could then be costed with the other Phoenix
>> operators
>> >> at
>> >>> planning time. Spark would work especially well to store intermediate
>> >>> results in more complex queries.
>> >>>
>> >>> Since Spark doesn't integrate natively with Calcite, I think using
>> Spark
>> >>> directly may not get you where you need to go. In the same way, the
>> >>> Phoenix-Spark integration is higher level, built on top of Phoenix and
>> >> has
>> >>> no direct integration with Calcite.
>> >>>
>> >>> Another alternative to consider would be using Drillix (Drill +
>> Phoenix)
>> >>> which uses Calcite underneath[1].
>> >>>
>> >>> Thanks,
>> >>> James
>> >>>
>> >>> [1]
>> >>> https://apurtell.s3.amazonaws.com/phoenix/Drillix+Combined+
>> >> Operational+%26+Analytical+SQL+at+Scale.pdf
>> >>>
>> >>> On Sat, Oct 22, 2016 at 1:02 PM, Eli Levine <el...@gmail.com>
>> wrote:
>> >>>
>> >>>> Greetings, Calcite devs. First of all, thank you for your work on
>> >> Calcite!
>> >>>>
>> >>>> I am working on a federated query engine that will use Spark (or
>> >> something
>> >>>> similar) as the main execution engine. Among other data sources the
>> >> query
>> >>>> engine will read from Apache Phoenix tables/views. The hope is to
>> >> utilize
>> >>>> Calcite as the query planner and optimizer component of this query
>> >> engine.
>> >>>>
>> >>>> At a high level, I am trying to build the following using Calcite:
>> >>>> 1. Generate a relational algebra expression tree using RelBuilder
>> based
>> >> on
>> >>>> user input. I plan to implement custom schema and table classes based
>> >> on my
>> >>>> metadata.
>> >>>> 2. Provide Calcite with query optimization rules.
>> >>>> 3. Traverse the optimized expression tree to generate a set of Spark
>> >>>> instructions.
>> >>>> 4. Execute query instructions via Spark.
>> >>>>
>> >>>> A few questions regarding the above:
>> >>>> 1. Are there existing examples of code that does #3 above? I looked
>> at
>> >> the
>> >>>> Spark submodule and it seems pretty bare-bones. What would be great
>> to
>> >> see
>> >>>> is an example of a RelNode tree being traversed to create a plan for
>> >>>> asynchronous execution via something like Spark or Pig.
>> >>>> 2. An important query optimization that is planned initially is to be
>> >> able
>> >>>> to push down simple filters to Phoenix (the plan is to use
>> Phoenix-Spark
>> >>>> <http://phoenix.apache.org/phoenix_spark.html> integration for
>> reading
>> >>>> data). Any examples of such push-downs to specific data sources in a
>> >>>> federated query scenario would be much appreciated.
>> >>>>
>> >>>> Thank you! Looking forward to working with the Calcite community.
>> >>>>
>> >>>> -------------
>> >>>> Eli Levine
>> >>>> Software Engineering Architect -- Salesforce.com
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Calcite with Phoenix and Spark

Posted by Eli Levine <el...@gmail.com>.

Will follow your suggested model when I start development. Thanks for
offering to potentially include that work in Calcite, Julian.

Eli


On Tue, Nov 1, 2016 at 11:50 AM, Julian Hyde <jh...@apache.org> wrote:

> If it helps make your “hope” a bit more likely to happen, you should
> consider doing your Spark or Pig adapters in the Calcite code base, that
> is, as a fork of the Calcite repo on GitHub from which you periodically
> submit pull requests.  I would welcome that development model. For big,
> important features like this, I am comfortable including alpha or beta
> quality code in the Calcite release.
>
> If you do the work as part of the Calcite project, almost certainly other
> developers will want to help out. You’ll do less work  yourself, and end up
> with a more robust result.
>
> I am Cc:ing Daniel Dai. He and I have talked about a Pig adapter for
> Calcite in the past. If you decide to go that route I’m Daniel may be able
> to help out.
>
> Julian
>
> > On Nov 1, 2016, at 11:35 AM, Eli Levine <el...@gmail.com> wrote:
> >
> > Thank you for the pointers, Julian and James! I have a requirement that
> the
> > main execution engine is a fault-tolerant one and at this point the main
> > contenders are Pig and Spark. Drillx is great as a source of example
> usages
> > of Calcite, so it will definitely be useful.
> >
> > And yes, the hope is to contribute any Spark and/or Pig adapter code that
> > gets developed to Calcite.
> >
> > Eli
> >
> >
> > On Sat, Oct 22, 2016 at 9:56 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> Well, to correct James slightly, there is SOME support for Spark in
> >> Calcite, but it’s fair to say that it hasn’t had much love. If you would
> >> like to get something working then Drillix (Drill + Phoenix + Calcite)
> is
> >> the way to go.
> >>
> >> That said, Spark is an excellent and hugely popular execution
> environment,
> >> so I would very much like to improve the Spark adapter. A few people on
> >> this list have talked about that over the past couple of months. If you
> >> would like to join that effort, it would be most welcome, but there’s
> more
> >> work to be done before you start getting results.
> >>
> >> Julian
> >>
> >>
> >>> On Oct 22, 2016, at 4:41 PM, James Taylor <ja...@apache.org>
> >> wrote:
> >>>
> >>> Hi Eli,
> >>> With the calcite branch of Phoenix you're part way there. I think a
> good
> >>> way to approach this would be to create a new set of operators that
> >>> correspond to Spark operations and the corresponding rules that know
> when
> >>> to use them. These could then be costed with the other Phoenix
> operators
> >> at
> >>> planning time. Spark would work especially well to store intermediate
> >>> results in more complex queries.
> >>>
> >>> Since Spark doesn't integrate natively with Calcite, I think using
> Spark
> >>> directly may not get you where you need to go. In the same way, the
> >>> Phoenix-Spark integration is higher level, built on top of Phoenix and
> >> has
> >>> no direct integration with Calcite.
> >>>
> >>> Another alternative to consider would be using Drillix (Drill +
> Phoenix)
> >>> which uses Calcite underneath[1].
> >>>
> >>> Thanks,
> >>> James
> >>>
> >>> [1]
> >>> https://apurtell.s3.amazonaws.com/phoenix/Drillix+Combined+
> >> Operational+%26+Analytical+SQL+at+Scale.pdf
> >>>
> >>> On Sat, Oct 22, 2016 at 1:02 PM, Eli Levine <el...@gmail.com>
> wrote:
> >>>
> >>>> Greetings, Calcite devs. First of all, thank you for your work on
> >> Calcite!
> >>>>
> >>>> I am working on a federated query engine that will use Spark (or
> >> something
> >>>> similar) as the main execution engine. Among other data sources the
> >> query
> >>>> engine will read from Apache Phoenix tables/views. The hope is to
> >> utilize
> >>>> Calcite as the query planner and optimizer component of this query
> >> engine.
> >>>>
> >>>> At a high level, I am trying to build the following using Calcite:
> >>>> 1. Generate a relational algebra expression tree using RelBuilder
> based
> >> on
> >>>> user input. I plan to implement custom schema and table classes based
> >> on my
> >>>> metadata.
> >>>> 2. Provide Calcite with query optimization rules.
> >>>> 3. Traverse the optimized expression tree to generate a set of Spark
> >>>> instructions.
> >>>> 4. Execute query instructions via Spark.
> >>>>
> >>>> A few questions regarding the above:
> >>>> 1. Are there existing examples of code that does #3 above? I looked at
> >> the
> >>>> Spark submodule and it seems pretty bare-bones. What would be great to
> >> see
> >>>> is an example of a RelNode tree being traversed to create a plan for
> >>>> asynchronous execution via something like Spark or Pig.
> >>>> 2. An important query optimization that is planned initially is to be
> >> able
> >>>> to push down simple filters to Phoenix (the plan is to use
> Phoenix-Spark
> >>>> <http://phoenix.apache.org/phoenix_spark.html> integration for
> reading
> >>>> data). Any examples of such push-downs to specific data sources in a
> >>>> federated query scenario would be much appreciated.
> >>>>
> >>>> Thank you! Looking forward to working with the Calcite community.
> >>>>
> >>>> -------------
> >>>> Eli Levine
> >>>> Software Engineering Architect -- Salesforce.com
> >>>>
> >>
> >>
>
>

Re: Calcite with Phoenix and Spark

Posted by Julian Hyde <jh...@apache.org>.

If it helps make your “hope” a bit more likely to happen, you should consider doing your Spark or Pig adapters in the Calcite code base, that is, as a fork of the Calcite repo on GitHub from which you periodically submit pull requests.  I would welcome that development model. For big, important features like this, I am comfortable including alpha or beta quality code in the Calcite release.

If you do the work as part of the Calcite project, almost certainly other developers will want to help out. You’ll do less work  yourself, and end up with a more robust result.

I am Cc:ing Daniel Dai. He and I have talked about a Pig adapter for Calcite in the past. If you decide to go that route I’m Daniel may be able to help out.

Julian
 
> On Nov 1, 2016, at 11:35 AM, Eli Levine <el...@gmail.com> wrote:
> 
> Thank you for the pointers, Julian and James! I have a requirement that the
> main execution engine is a fault-tolerant one and at this point the main
> contenders are Pig and Spark. Drillx is great as a source of example usages
> of Calcite, so it will definitely be useful.
> 
> And yes, the hope is to contribute any Spark and/or Pig adapter code that
> gets developed to Calcite.
> 
> Eli
> 
> 
> On Sat, Oct 22, 2016 at 9:56 PM, Julian Hyde <jh...@apache.org> wrote:
> 
>> Well, to correct James slightly, there is SOME support for Spark in
>> Calcite, but it’s fair to say that it hasn’t had much love. If you would
>> like to get something working then Drillix (Drill + Phoenix + Calcite) is
>> the way to go.
>> 
>> That said, Spark is an excellent and hugely popular execution environment,
>> so I would very much like to improve the Spark adapter. A few people on
>> this list have talked about that over the past couple of months. If you
>> would like to join that effort, it would be most welcome, but there’s more
>> work to be done before you start getting results.
>> 
>> Julian
>> 
>> 
>>> On Oct 22, 2016, at 4:41 PM, James Taylor <ja...@apache.org>
>> wrote:
>>> 
>>> Hi Eli,
>>> With the calcite branch of Phoenix you're part way there. I think a good
>>> way to approach this would be to create a new set of operators that
>>> correspond to Spark operations and the corresponding rules that know when
>>> to use them. These could then be costed with the other Phoenix operators
>> at
>>> planning time. Spark would work especially well to store intermediate
>>> results in more complex queries.
>>> 
>>> Since Spark doesn't integrate natively with Calcite, I think using Spark
>>> directly may not get you where you need to go. In the same way, the
>>> Phoenix-Spark integration is higher level, built on top of Phoenix and
>> has
>>> no direct integration with Calcite.
>>> 
>>> Another alternative to consider would be using Drillix (Drill + Phoenix)
>>> which uses Calcite underneath[1].
>>> 
>>> Thanks,
>>> James
>>> 
>>> [1]
>>> https://apurtell.s3.amazonaws.com/phoenix/Drillix+Combined+
>> Operational+%26+Analytical+SQL+at+Scale.pdf
>>> 
>>> On Sat, Oct 22, 2016 at 1:02 PM, Eli Levine <el...@gmail.com> wrote:
>>> 
>>>> Greetings, Calcite devs. First of all, thank you for your work on
>> Calcite!
>>>> 
>>>> I am working on a federated query engine that will use Spark (or
>> something
>>>> similar) as the main execution engine. Among other data sources the
>> query
>>>> engine will read from Apache Phoenix tables/views. The hope is to
>> utilize
>>>> Calcite as the query planner and optimizer component of this query
>> engine.
>>>> 
>>>> At a high level, I am trying to build the following using Calcite:
>>>> 1. Generate a relational algebra expression tree using RelBuilder based
>> on
>>>> user input. I plan to implement custom schema and table classes based
>> on my
>>>> metadata.
>>>> 2. Provide Calcite with query optimization rules.
>>>> 3. Traverse the optimized expression tree to generate a set of Spark
>>>> instructions.
>>>> 4. Execute query instructions via Spark.
>>>> 
>>>> A few questions regarding the above:
>>>> 1. Are there existing examples of code that does #3 above? I looked at
>> the
>>>> Spark submodule and it seems pretty bare-bones. What would be great to
>> see
>>>> is an example of a RelNode tree being traversed to create a plan for
>>>> asynchronous execution via something like Spark or Pig.
>>>> 2. An important query optimization that is planned initially is to be
>> able
>>>> to push down simple filters to Phoenix (the plan is to use Phoenix-Spark
>>>> <http://phoenix.apache.org/phoenix_spark.html> integration for reading
>>>> data). Any examples of such push-downs to specific data sources in a
>>>> federated query scenario would be much appreciated.
>>>> 
>>>> Thank you! Looking forward to working with the Calcite community.
>>>> 
>>>> -------------
>>>> Eli Levine
>>>> Software Engineering Architect -- Salesforce.com
>>>> 
>> 
>>