You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by newroyker <ne...@gmail.com> on 2020/01/13 14:24:49 UTC

Why Apache Spark doesn't use Calcite?

Was there a qualitative or quantitative benchmark done before a design
decision was made not to use Calcite? 

Are there limitations (for heuristic based, cost based, * aware optimizer)
in Calcite, and frameworks built on top of Calcite? In the context of big
data / TCPH benchmarks.

I was unable to dig up anything concrete from user group / Jira. Appreciate
if any Catalyst veteran here can give me pointers. Trying to defend
Spark/Catalyst.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Why Apache Spark doesn't use Calcite?

Posted by Debajyoti Roy <ne...@gmail.com>.

Thanks Xiao, a more up to date publication in a conference like VLDB will
certainly turn the the tide for many of us trying to defend Spark's
Optimizer.

On Wed, Jan 15, 2020 at 9:39 AM Xiao Li <ga...@gmail.com> wrote:

> In the upcoming Spark 3.0, we introduced a new framework for Adaptive
> Query Execution in Catalyst. This can adjust the plans based on the runtime
> statistics. This is missing in Calcite based on my understanding.
>
> Catalyst is also very easy to enhance. We also use the dynamic programming
> approach in our cost-based join reordering. If needed, in the future, we
> also can improve the existing CBO and make it more general. The paper of
> Spark SQL was published 5 years ago. A lot of great contributions were made
> in the past 5 years.
>
> Cheers,
>
> Xiao
>
> Debajyoti Roy <ne...@gmail.com> 于2020年1月15日周三 上午9:23写道：
>
>> Thanks all, and Matei.
>>
>> TL;DR of the conclusion for my particular case:
>> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
>> maintenance burden, it lacks the dynamic programming approach used by
>> Calcite[2] and risks falling into local minima.
>> Quantitatively, there is no reproducible benchmark, that fairly compares
>> Optimizer frameworks, apples to apples (excluding execution).
>>
>> References:
>> [1] -
>> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
>> [2] - https://arxiv.org/pdf/1802.10233.pdf
>>
>> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia <ma...@gmail.com>
>> wrote:
>>
>>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>>> parallel. Calcite 1.0 was only released in 2015. From a technical
>>> standpoint, building Catalyst in Scala also made it more concise and easier
>>> to extend than an optimizer written in Java (you can find various
>>> presentations about how Catalyst works).
>>>
>>> Matei
>>>
>>> > On Jan 13, 2020, at 8:41 AM, Michael Mior <mm...@apache.org> wrote:
>>> >
>>> > It's fairly common for adapters (Calcite's abstraction of a data
>>> > source) to push down predicates. However, the API certainly looks a
>>> > lot different than Catalyst's.
>>> > --
>>> > Michael Mior
>>> > mmior@apache.org
>>> >
>>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>>> > <ja...@gmail.com> a écrit :
>>> >>
>>> >> The implementation they chose supports push down predicates, Datasets
>>> and other features that are not available in Calcite:
>>> >>
>>> >> https://databricks.com/glossary/catalyst-optimizer
>>> >>
>>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker <ne...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> Was there a qualitative or quantitative benchmark done before a
>>> design
>>> >>> decision was made not to use Calcite?
>>> >>>
>>> >>> Are there limitations (for heuristic based, cost based, * aware
>>> optimizer)
>>> >>> in Calcite, and frameworks built on top of Calcite? In the context
>>> of big
>>> >>> data / TCPH benchmarks.
>>> >>>
>>> >>> I was unable to dig up anything concrete from user group / Jira.
>>> Appreciate
>>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> >>> Spark/Catalyst.
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >> --
>>> >> Thanks,
>>> >> Jason
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> >
>>>
>>>

Re: Why Apache Spark doesn't use Calcite?

Posted by Xiao Li <ga...@gmail.com>.

In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query
Execution in Catalyst. This can adjust the plans based on the runtime
statistics. This is missing in Calcite based on my understanding.

Catalyst is also very easy to enhance. We also use the dynamic programming
approach in our cost-based join reordering. If needed, in the future, we
also can improve the existing CBO and make it more general. The paper of
Spark SQL was published 5 years ago. A lot of great contributions were made
in the past 5 years.

Cheers,

Xiao

Debajyoti Roy <ne...@gmail.com> 于2020年1月15日周三 上午9:23写道：

> Thanks all, and Matei.
>
> TL;DR of the conclusion for my particular case:
> Qualitatively, while Catalyst[1] tries to mitigate learning curve and
> maintenance burden, it lacks the dynamic programming approach used by
> Calcite[2] and risks falling into local minima.
> Quantitatively, there is no reproducible benchmark, that fairly compares
> Optimizer frameworks, apples to apples (excluding execution).
>
> References:
> [1] -
> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
> [2] - https://arxiv.org/pdf/1802.10233.pdf
>
> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia <ma...@gmail.com>
> wrote:
>
>> I’m pretty sure that Catalyst was built before Calcite, or at least in
>> parallel. Calcite 1.0 was only released in 2015. From a technical
>> standpoint, building Catalyst in Scala also made it more concise and easier
>> to extend than an optimizer written in Java (you can find various
>> presentations about how Catalyst works).
>>
>> Matei
>>
>> > On Jan 13, 2020, at 8:41 AM, Michael Mior <mm...@apache.org> wrote:
>> >
>> > It's fairly common for adapters (Calcite's abstraction of a data
>> > source) to push down predicates. However, the API certainly looks a
>> > lot different than Catalyst's.
>> > --
>> > Michael Mior
>> > mmior@apache.org
>> >
>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>> > <ja...@gmail.com> a écrit :
>> >>
>> >> The implementation they chose supports push down predicates, Datasets
>> and other features that are not available in Calcite:
>> >>
>> >> https://databricks.com/glossary/catalyst-optimizer
>> >>
>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker <ne...@gmail.com> wrote:
>> >>>
>> >>> Was there a qualitative or quantitative benchmark done before a design
>> >>> decision was made not to use Calcite?
>> >>>
>> >>> Are there limitations (for heuristic based, cost based, * aware
>> optimizer)
>> >>> in Calcite, and frameworks built on top of Calcite? In the context of
>> big
>> >>> data / TCPH benchmarks.
>> >>>
>> >>> I was unable to dig up anything concrete from user group / Jira.
>> Appreciate
>> >>> if any Catalyst veteran here can give me pointers. Trying to defend
>> >>> Spark/Catalyst.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >>>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Jason
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> >
>>
>>

Re: Why Apache Spark doesn't use Calcite?

Posted by Debajyoti Roy <ne...@gmail.com>.

Thanks all, and Matei.

TL;DR of the conclusion for my particular case:
Qualitatively, while Catalyst[1] tries to mitigate learning curve and
maintenance burden, it lacks the dynamic programming approach used by
Calcite[2] and risks falling into local minima.
Quantitatively, there is no reproducible benchmark, that fairly compares
Optimizer frameworks, apples to apples (excluding execution).

References:
[1] -
https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf
[2] - https://arxiv.org/pdf/1802.10233.pdf

On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia <ma...@gmail.com>
wrote:

> I’m pretty sure that Catalyst was built before Calcite, or at least in
> parallel. Calcite 1.0 was only released in 2015. From a technical
> standpoint, building Catalyst in Scala also made it more concise and easier
> to extend than an optimizer written in Java (you can find various
> presentations about how Catalyst works).
>
> Matei
>
> > On Jan 13, 2020, at 8:41 AM, Michael Mior <mm...@apache.org> wrote:
> >
> > It's fairly common for adapters (Calcite's abstraction of a data
> > source) to push down predicates. However, the API certainly looks a
> > lot different than Catalyst's.
> > --
> > Michael Mior
> > mmior@apache.org
> >
> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> > <ja...@gmail.com> a écrit :
> >>
> >> The implementation they chose supports push down predicates, Datasets
> and other features that are not available in Calcite:
> >>
> >> https://databricks.com/glossary/catalyst-optimizer
> >>
> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker <ne...@gmail.com> wrote:
> >>>
> >>> Was there a qualitative or quantitative benchmark done before a design
> >>> decision was made not to use Calcite?
> >>>
> >>> Are there limitations (for heuristic based, cost based, * aware
> optimizer)
> >>> in Calcite, and frameworks built on top of Calcite? In the context of
> big
> >>> data / TCPH benchmarks.
> >>>
> >>> I was unable to dig up anything concrete from user group / Jira.
> Appreciate
> >>> if any Catalyst veteran here can give me pointers. Trying to defend
> >>> Spark/Catalyst.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >>>
> >>
> >>
> >> --
> >> Thanks,
> >> Jason
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
>

Re: Why Apache Spark doesn't use Calcite?

Posted by Matei Zaharia <ma...@gmail.com>.

I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations about how Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior <mm...@apache.org> wrote:
> 
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> mmior@apache.org
> 
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
> <ja...@gmail.com> a écrit :
>> 
>> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>> 
>> https://databricks.com/glossary/catalyst-optimizer
>> 
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker <ne...@gmail.com> wrote:
>>> 
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>> 
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>> 
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> 
>> 
>> 
>> --
>> Thanks,
>> Jason
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Why Apache Spark doesn't use Calcite?

Posted by Michael Mior <mm...@apache.org>.

It's fairly common for adapters (Calcite's abstraction of a data
source) to push down predicates. However, the API certainly looks a
lot different than Catalyst's.
--
Michael Mior
mmior@apache.org

Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
<ja...@gmail.com> a écrit :
>
> The implementation they chose supports push down predicates, Datasets and other features that are not available in Calcite:
>
> https://databricks.com/glossary/catalyst-optimizer
>
> On Mon, Jan 13, 2020 at 8:24 AM newroyker <ne...@gmail.com> wrote:
>>
>> Was there a qualitative or quantitative benchmark done before a design
>> decision was made not to use Calcite?
>>
>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>> in Calcite, and frameworks built on top of Calcite? In the context of big
>> data / TCPH benchmarks.
>>
>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>> if any Catalyst veteran here can give me pointers. Trying to defend
>> Spark/Catalyst.
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
> --
> Thanks,
> Jason

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Why Apache Spark doesn't use Calcite?

Posted by Jason Nerothin <ja...@gmail.com>.

The implementation they chose supports push down predicates, Datasets and
other features that are not available in Calcite:

https://databricks.com/glossary/catalyst-optimizer

On Mon, Jan 13, 2020 at 8:24 AM newroyker <ne...@gmail.com> wrote:

> Was there a qualitative or quantitative benchmark done before a design
> decision was made not to use Calcite?
>
> Are there limitations (for heuristic based, cost based, * aware optimizer)
> in Calcite, and frameworks built on top of Calcite? In the context of big
> data / TCPH benchmarks.
>
> I was unable to dig up anything concrete from user group / Jira. Appreciate
> if any Catalyst veteran here can give me pointers. Trying to defend
> Spark/Catalyst.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 
Thanks,
Jason

Re: Why Apache Spark doesn't use Calcite?

Posted by Julian Hyde <jh...@apache.org>.

In the earliest days they had Shark (a Spark back-end hacked into Hive)[1]. So, they knew some people would want to use SQL. But I don’t think anyone realized how important SQL would become.

I think they knew what they were getting with Catalyst. They wanted to make it easy to write transformation rules. Because Catalyst is written in Scala, transformation rules are very concisely. But Catalyst rules fire destructively; this also makes them simpler to write, but it prevents the Volcano-style nondeterminism that allow true cost-based optimization.

Spark was, at that time, a project with huge momentum and lots of talented people who were itching to write a query optimizer. In that situation you can move faster if you build it yourself.

Julian

[1] https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html <https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html>

> On Jan 13, 2020, at 1:41 PM, Muhammad Gelbana <mg...@apache.org> wrote:
> 
> Interesting question.
> 
> Someone told me Spark didn't start (~2012) with SQL queries (Introduced
> ~2014) support in mind. Probably only python-based jobs so Catalyst was
> enough then which makes sense to me but I can't confirm that.
> 
> 
> 
> On Mon, Jan 13, 2020 at 4:30 PM Michael Mior <mm...@apache.org> wrote:
> 
>> This discussion on the Spark mailing list may be interesting to follow :)
>> 
>> --
>> Michael Mior
>> mmior@apache.org
>> 
>> 
>> ---------- Forwarded message ---------
>> De : newroyker <ne...@gmail.com>
>> Date: lun. 13 janv. 2020 à 09:25
>> Subject: Why Apache Spark doesn't use Calcite?
>> To: <us...@spark.apache.org>
>> 
>> 
>> Was there a qualitative or quantitative benchmark done before a design
>> decision was made not to use Calcite?
>> 
>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>> in Calcite, and frameworks built on top of Calcite? In the context of big
>> data / TCPH benchmarks.
>> 
>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>> if any Catalyst veteran here can give me pointers. Trying to defend
>> Spark/Catalyst.
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

Re: Why Apache Spark doesn't use Calcite?

Posted by Muhammad Gelbana <mg...@apache.org>.

Interesting question.

Someone told me Spark didn't start (~2012) with SQL queries (Introduced
~2014) support in mind. Probably only python-based jobs so Catalyst was
enough then which makes sense to me but I can't confirm that.



On Mon, Jan 13, 2020 at 4:30 PM Michael Mior <mm...@apache.org> wrote:

> This discussion on the Spark mailing list may be interesting to follow :)
>
> --
> Michael Mior
> mmior@apache.org
>
>
> ---------- Forwarded message ---------
> De : newroyker <ne...@gmail.com>
> Date: lun. 13 janv. 2020 à 09:25
> Subject: Why Apache Spark doesn't use Calcite?
> To: <us...@spark.apache.org>
>
>
> Was there a qualitative or quantitative benchmark done before a design
> decision was made not to use Calcite?
>
> Are there limitations (for heuristic based, cost based, * aware optimizer)
> in Calcite, and frameworks built on top of Calcite? In the context of big
> data / TCPH benchmarks.
>
> I was unable to dig up anything concrete from user group / Jira. Appreciate
> if any Catalyst veteran here can give me pointers. Trying to defend
> Spark/Catalyst.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Fwd: Why Apache Spark doesn't use Calcite?

Posted by Michael Mior <mm...@apache.org>.

This discussion on the Spark mailing list may be interesting to follow :)

--
Michael Mior
mmior@apache.org


---------- Forwarded message ---------
De : newroyker <ne...@gmail.com>
Date: lun. 13 janv. 2020 à 09:25
Subject: Why Apache Spark doesn't use Calcite?
To: <us...@spark.apache.org>


Was there a qualitative or quantitative benchmark done before a design
decision was made not to use Calcite?

Are there limitations (for heuristic based, cost based, * aware optimizer)
in Calcite, and frameworks built on top of Calcite? In the context of big
data / TCPH benchmarks.

I was unable to dig up anything concrete from user group / Jira. Appreciate
if any Catalyst veteran here can give me pointers. Trying to defend
Spark/Catalyst.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org