You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Julian Hyde <jh...@apache.org> on 2017/12/15 18:26:53 UTC

Recursive query, graph query, Datalog

I've been thinking about Datalog front end to Calcite. How much effort
would it be? Would there be an audience who would find it useful?

The genesis of the idea is talks by Frank McSherry[1] and Vasia
Kalavri[2] about graph queries and in particular Timely
Dataflow[3][4], and a paper about using Datalog for graph processing
[5].

A few observations:
* Graph queries require repeated (unbounded) joins, and so are beyond SQL:92.
* Datalog allows recursion, and can therefore compute things like
transitive closure, and is therefore powerful enough for graph
queries.
* SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
queries. However, for a variety of reasons, people tend not to use SQL
for querying graph databases.
* Datalog is more than just recursive queries. For instance, it is
popular with academics analyzing the power/complexity of languages.
And it can do deductive queries.
* There are different "strengths" of Datalog. Some variants support
only linearizable recursion; most variants only support queries whose
results are stratified (i.e. can be defined using well-founded
induction, necessary when negations are involved).
* The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
discussed how they can handle graph queries and iterative computation.
However they have mainly focused on improvements to their engine and
physical algebra, not looked at logical algebra or query language.
* If Calcite's algebra could express deductive query / recursive query
/ iteration, then not only would Datalog be possible, but also SQL's
WITH RECURSIVE, and also SPARQL.

So, questions on my mind:
* What new operator(s) would we add to Calcite's algebra to enable
recursive query?
* What optimization rules are possible/necessary for graph queries?
* How much effort would it be to add a Datalog parser to Calcite and
translate to Calcite's algebra?

Julian

[1] http://www.dataengconf.com/scalability-but-at-what-cost

[2] https://qconsf.com/sf2017/speakers/vasia-kalavri

[3] https://github.com/frankmcsherry/timely-dataflow

[4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf

[5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf

Re: Recursive query, graph query, Datalog

Posted by Julian Hyde <jh...@apache.org>.


On 2017-12-15 10:57, Riccardo Tommasini <ri...@polimi.it> wrote: 
> A final remark on efficiency. All the mentioned features are very interesting but not computationally nice. Idk whatâs calcite position on this, but in a big data community i think we should be careful. SPARQL did some crazy stuff and is still paying them (see gutierrez papers)

I don't worry too much about efficiency per se, but I focus on making the algebra clean enough that we can recognize the simple cases. I would make the algebra simple and moderately powerful at first, and make it more expressive/powerful later only if we have use cases that need it.

It's analogous to joins. Calcite's join operator can express theta-joins, left/right/full outer, semi-joins, and joins over sorted/bucketed data sets, but we can easily recognize an inner equi-join when we see one. Most of the transformation rules are written on inner equi-joins first, and generalized to other kinds of joins when we're sure it's safe. I imagine us taking a similar path with iteration, e.g. when is it safe to push a filter into, and through, an iteration?

Lastly, because it's algebra we take a 'RISC' approach. The "Iterate" operator doesn't have to do everything; we can have other operators such as Aggregate, Filter, Project before, after and inside it.

Re: Recursive query, graph query, Datalog

Posted by Riccardo Tommasini <ri...@polimi.it>.

Hello all,

Let me introduce myself again. I’m Riccardo Tommasini, Politecnico di Milano.
I do research on Stream Reasoning, graph stream processing.
In particular, RDF stream processing and i work with temporal datalog.

This is a quite exciting news.

Going technical, a datalog frontend that stays in the relational algebra world would be quite easy as java datalog parser exists and, for the non recursive fragment, is totally implementable within calcite.
Actually, University of Bolzano is already working on such a process using calcite. I cced Guohui Xiao, a postdoc who might be interested.

Regarding graph specific queries and SPARQL. I’m quite found of the subject and i think there might be some issues to deal with the query planning. I’m not super expert of calcite though but i would be happy to help.

A final remark on efficiency. All the mentioned features are very interesting but not computationally nice. Idk what’s calcite position on this, but in a big data community i think we should be careful. SPARQL did some crazy stuff and is still paying them (see gutierrez papers)

On 15 Dec 2017, 19:26 +0100, Julian Hyde <jh...@apache.org>, wrote:
I've been thinking about Datalog front end to Calcite. How much effort
would it be? Would there be an audience who would find it useful?

The genesis of the idea is talks by Frank McSherry[1] and Vasia
Kalavri[2] about graph queries and in particular Timely
Dataflow[3][4], and a paper about using Datalog for graph processing
[5].

A few observations:
* Graph queries require repeated (unbounded) joins, and so are beyond SQL:92.
* Datalog allows recursion, and can therefore compute things like
transitive closure, and is therefore powerful enough for graph
queries.
* SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
queries. However, for a variety of reasons, people tend not to use SQL
for querying graph databases.
* Datalog is more than just recursive queries. For instance, it is
popular with academics analyzing the power/complexity of languages.
And it can do deductive queries.
* There are different "strengths" of Datalog. Some variants support
only linearizable recursion; most variants only support queries whose
results are stratified (i.e. can be defined using well-founded
induction, necessary when negations are involved).
* The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
discussed how they can handle graph queries and iterative computation.
However they have mainly focused on improvements to their engine and
physical algebra, not looked at logical algebra or query language.
* If Calcite's algebra could express deductive query / recursive query
/ iteration, then not only would Datalog be possible, but also SQL's
WITH RECURSIVE, and also SPARQL.

So, questions on my mind:
* What new operator(s) would we add to Calcite's algebra to enable
recursive query?
* What optimization rules are possible/necessary for graph queries?
* How much effort would it be to add a Datalog parser to Calcite and
translate to Calcite's algebra?

Julian

[1] http://www.dataengconf.com/scalability-but-at-what-cost

[2] https://qconsf.com/sf2017/speakers/vasia-kalavri

[3] https://github.com/frankmcsherry/timely-dataflow

[4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf

[5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf

Re: Recursive query, graph query, Datalog

Posted by Julian Hyde <jh...@apache.org>.

Regarding UNION vs UNION ALL. Sure, let's go with UNION. I could
imagine cases where UNION is not sufficient (e.g. we want to generate
a multiset, or annotate each row with the stratum at which it was
generated) but UNION is good enough at first.

Regarding a parser/AST, did you see Riccardo's email in this thread[1]
mentioning the datalog parser at the University of Bolzano.

Julian

[1] https://lists.apache.org/thread.html/e80629a16883ce32b07be0e14002aef6734a68913fe420f6edddfc17@%3Cdev.calcite.apache.org%3E

On Thu, Dec 21, 2017 at 6:42 PM, Walaa Eldin Moustafa
<wa...@gmail.com> wrote:
> Agreed on the high level description of the iterate operator and table
> function. The table function is basically a "combiner" that will combine
> the delta with the result of past iteration somehow.
>
> I would say we need a UINION (versus UNION ALL) operator since we do not
> want to re-add facts that were already inferred in past iteration (in case
> they are re-inferred).
>
> Are you aware of anyone working on the parser/AST? I am giving them a try
> in case someone wants to collaborate.
>
> Thanks,
> Walaa.
>
>
> On Mon, Dec 18, 2017 at 12:02 AM, Julian Hyde <jh...@apache.org> wrote:
>
>> Yes, I agree.
>>
>> My first instinct is to add an Iterate operator whose arguments are (1)
>> the input, (2) a table function that applies to the delta at each
>> iteration. When the table function returns the empty set, iteration stops.
>> The “table function” is not a function per se, but a RelNode tree that
>> references a pseudo-table called “Delta”. Think of it as a relational
>> lambda expression, and the “Delta" is the argument.
>>
>> Intermediate results are combined using UNION ALL. Is this too
>> restrictive? I think maybe not, because you can always add a “finalizer”
>> such as an Aggregate after the Iterate operator.
>>
>> Julian
>>
>>
>> > On Dec 15, 2017, at 3:11 PM, Walaa Eldin Moustafa <wa...@gmail.com>
>> wrote:
>> >
>> > Yes, Magic sets is a very important and popular optimization as well. I
>> > guess once we can get a basic notion of recursion as a construct in
>> > Calcite, and get it to work correctly, we can start cracking
>> optimizations.
>> > One thing to note is that the convergence/fixed point depend on the data,
>> > and there is no way to know beforehand what the (complete) plan will look
>> > like (i.e., how many joins). It seems that there must be some sort of
>> > awareness in the host engine of the fact that the query is recursive, and
>> > it should keep iterating till fixed point, or at least tell Calcite if it
>> > converged or not, and if not, Calcite will ask it to keep trying, so
>> every
>> > iteration Calcite sends a traditional (non-recursive) RA plan, or ask the
>> > engine to stop. Do you agree?
>> >
>> >
>> > On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <jh...@apache.org> wrote:
>> >
>> >> (Moving Carl, Shrikanth, Vasanth to bcc.)
>> >>
>> >> Regarding optimizations. One one hand, it is daunting that there so
>> >> many optimizations are required to make graph queries run efficiently,
>> >> but on the other hand, it is good news for the project if those can be
>> >> expressed in relational algebra.
>> >>
>> >> Looking at the previous research, some of the optimizations applied
>> >> are genuinely only possible at run-time, but others should be thought
>> >> of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
>> >> can be expressed as a logical operation (very similar to incremental
>> >> view maintenance and streaming, by the way).
>> >>
>> >> (Untangling the capabilities of a particular engine from algebraic
>> >> rewrites is Calcite's gift to the world!)
>> >>
>> >> Another very important logical rewrite is "magic sets"[1], which can
>> >> be modeled as semi-join push-down and done entirely at planning
>> >> time[2] or (if the runtime supports it) as side-ways information
>> >> passing of bloom filters or similar. Magic sets are important for
>> >> graph queries but also very useful for star-schema queries with a
>> >> fixed number of joins.
>> >>
>> >> Julian
>> >>
>> >> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf
>> >>
>> >> [2] https://issues.apache.org/jira/browse/CALCITE-468
>> >>
>> >>
>> >> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <eb...@gmail.com>
>> wrote:
>> >>> Great initiative.
>> >>>
>> >>> I will also share some comparative performance studies we did at ORNL
>> on
>> >>> different graph processing engines. Could be useful.
>> >>>
>> >>> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <
>> >> wa.moustafa@gmail.com>
>> >>> wrote:
>> >>>
>> >>>> Hi Julian,
>> >>>>
>> >>>> Thanks for referencing our Datalog query processing paper [5]. I have
>> >> been
>> >>>> thinking about the same idea for a while now too :) I think Calcite is
>> >> very
>> >>>> well positioned as a generic query optimizer to add Datalog/recursive
>> >> query
>> >>>> support. Also, it makes a lot of sense since it opens a completely new
>> >>>> dimension for the kind of logic and queries that Calcite can handle,
>> >>>> including but not limited to graph queries, and that can be
>> immediately
>> >>>> available to engines talking to Calcite.
>> >>>>
>> >>>> To answer your questions, we probably need to add a transitive closure
>> >>>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/
>> >
>> >> by
>> >>>> Rakesh Agrawal proposes the notion of alpha relations, and defines an
>> >> alpha
>> >>>> operator on top of them which computes the transitive closure of alpha
>> >>>> relations. The operator fits well with the rest of Cod's relational
>> >> algebra
>> >>>> operators.
>> >>>>
>> >>>> For query optimizations, one of the commonly used Datalog
>> optimizations
>> >> is
>> >>>> Semi-naive evaluation, where instead of re-evaluating the recursive
>> >> program
>> >>>> using all existing facts, only new facts inferred since last iteration
>> >> are
>> >>>> used. Datalog optimizations become much more interesting when
>> >> introducing
>> >>>> aggregation and negation, and it is still an open research question,
>> but
>> >>>> there is already some tangible progress. Otherwise, as you mentioned
>> >>>> transitive closure is repeated joins, so pretty much many of the join
>> >>>> optimizations apply such as predicate pushdown, and aggregation
>> >>>> pushdown/pull up.
>> >>>>
>> >>>> Regarding the effort, we can always start from basic features and
>> expand
>> >>>> from there. I have already started working on the parser, AST and
>> >> logical
>> >>>> plan builder for basic Datalog without recursion. I am happy to
>> create a
>> >>>> JIRA ticket to track this effort there.
>> >>>>
>> >>>> Thanks,
>> >>>> Walaa.
>> >>>>
>> >>>>
>> >>>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org>
>> wrote:
>> >>>>
>> >>>>> I've been thinking about Datalog front end to Calcite. How much
>> effort
>> >>>>> would it be? Would there be an audience who would find it useful?
>> >>>>>
>> >>>>> The genesis of the idea is talks by Frank McSherry[1] and Vasia
>> >>>>> Kalavri[2] about graph queries and in particular Timely
>> >>>>> Dataflow[3][4], and a paper about using Datalog for graph processing
>> >>>>> [5].
>> >>>>>
>> >>>>> A few observations:
>> >>>>> * Graph queries require repeated (unbounded) joins, and so are beyond
>> >>>>> SQL:92.
>> >>>>> * Datalog allows recursion, and can therefore compute things like
>> >>>>> transitive closure, and is therefore powerful enough for graph
>> >>>>> queries.
>> >>>>> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class
>> of
>> >>>>> queries. However, for a variety of reasons, people tend not to use
>> SQL
>> >>>>> for querying graph databases.
>> >>>>> * Datalog is more than just recursive queries. For instance, it is
>> >>>>> popular with academics analyzing the power/complexity of languages.
>> >>>>> And it can do deductive queries.
>> >>>>> * There are different "strengths" of Datalog. Some variants support
>> >>>>> only linearizable recursion; most variants only support queries whose
>> >>>>> results are stratified (i.e. can be defined using well-founded
>> >>>>> induction, necessary when negations are involved).
>> >>>>> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
>> >>>>> discussed how they can handle graph queries and iterative
>> computation.
>> >>>>> However they have mainly focused on improvements to their engine and
>> >>>>> physical algebra, not looked at logical algebra or query language.
>> >>>>> * If Calcite's algebra could express deductive query / recursive
>> query
>> >>>>> / iteration, then not only would Datalog be possible, but also SQL's
>> >>>>> WITH RECURSIVE, and also SPARQL.
>> >>>>>
>> >>>>> So, questions on my mind:
>> >>>>> * What new operator(s) would we add to Calcite's algebra to enable
>> >>>>> recursive query?
>> >>>>> * What optimization rules are possible/necessary for graph queries?
>> >>>>> * How much effort would it be to add a Datalog parser to Calcite and
>> >>>>> translate to Calcite's algebra?
>> >>>>>
>> >>>>> Julian
>> >>>>>
>> >>>>> [1] http://www.dataengconf.com/scalability-but-at-what-cost
>> >>>>>
>> >>>>> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
>> >>>>>
>> >>>>> [3] https://github.com/frankmcsherry/timely-dataflow
>> >>>>>
>> >>>>> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
>> >>>>>
>> >>>>> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
>> >>>>>
>> >>>>
>> >>
>>
>>

Re: Recursive query, graph query, Datalog

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.

Agreed on the high level description of the iterate operator and table
function. The table function is basically a "combiner" that will combine
the delta with the result of past iteration somehow.

I would say we need a UINION (versus UNION ALL) operator since we do not
want to re-add facts that were already inferred in past iteration (in case
they are re-inferred).

Are you aware of anyone working on the parser/AST? I am giving them a try
in case someone wants to collaborate.

Thanks,
Walaa.


On Mon, Dec 18, 2017 at 12:02 AM, Julian Hyde <jh...@apache.org> wrote:

> Yes, I agree.
>
> My first instinct is to add an Iterate operator whose arguments are (1)
> the input, (2) a table function that applies to the delta at each
> iteration. When the table function returns the empty set, iteration stops.
> The “table function” is not a function per se, but a RelNode tree that
> references a pseudo-table called “Delta”. Think of it as a relational
> lambda expression, and the “Delta" is the argument.
>
> Intermediate results are combined using UNION ALL. Is this too
> restrictive? I think maybe not, because you can always add a “finalizer”
> such as an Aggregate after the Iterate operator.
>
> Julian
>
>
> > On Dec 15, 2017, at 3:11 PM, Walaa Eldin Moustafa <wa...@gmail.com>
> wrote:
> >
> > Yes, Magic sets is a very important and popular optimization as well. I
> > guess once we can get a basic notion of recursion as a construct in
> > Calcite, and get it to work correctly, we can start cracking
> optimizations.
> > One thing to note is that the convergence/fixed point depend on the data,
> > and there is no way to know beforehand what the (complete) plan will look
> > like (i.e., how many joins). It seems that there must be some sort of
> > awareness in the host engine of the fact that the query is recursive, and
> > it should keep iterating till fixed point, or at least tell Calcite if it
> > converged or not, and if not, Calcite will ask it to keep trying, so
> every
> > iteration Calcite sends a traditional (non-recursive) RA plan, or ask the
> > engine to stop. Do you agree?
> >
> >
> > On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> (Moving Carl, Shrikanth, Vasanth to bcc.)
> >>
> >> Regarding optimizations. One one hand, it is daunting that there so
> >> many optimizations are required to make graph queries run efficiently,
> >> but on the other hand, it is good news for the project if those can be
> >> expressed in relational algebra.
> >>
> >> Looking at the previous research, some of the optimizations applied
> >> are genuinely only possible at run-time, but others should be thought
> >> of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
> >> can be expressed as a logical operation (very similar to incremental
> >> view maintenance and streaming, by the way).
> >>
> >> (Untangling the capabilities of a particular engine from algebraic
> >> rewrites is Calcite's gift to the world!)
> >>
> >> Another very important logical rewrite is "magic sets"[1], which can
> >> be modeled as semi-join push-down and done entirely at planning
> >> time[2] or (if the runtime supports it) as side-ways information
> >> passing of bloom filters or similar. Magic sets are important for
> >> graph queries but also very useful for star-schema queries with a
> >> fixed number of joins.
> >>
> >> Julian
> >>
> >> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf
> >>
> >> [2] https://issues.apache.org/jira/browse/CALCITE-468
> >>
> >>
> >> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <eb...@gmail.com>
> wrote:
> >>> Great initiative.
> >>>
> >>> I will also share some comparative performance studies we did at ORNL
> on
> >>> different graph processing engines. Could be useful.
> >>>
> >>> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <
> >> wa.moustafa@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Julian,
> >>>>
> >>>> Thanks for referencing our Datalog query processing paper [5]. I have
> >> been
> >>>> thinking about the same idea for a while now too :) I think Calcite is
> >> very
> >>>> well positioned as a generic query optimizer to add Datalog/recursive
> >> query
> >>>> support. Also, it makes a lot of sense since it opens a completely new
> >>>> dimension for the kind of logic and queries that Calcite can handle,
> >>>> including but not limited to graph queries, and that can be
> immediately
> >>>> available to engines talking to Calcite.
> >>>>
> >>>> To answer your questions, we probably need to add a transitive closure
> >>>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/
> >
> >> by
> >>>> Rakesh Agrawal proposes the notion of alpha relations, and defines an
> >> alpha
> >>>> operator on top of them which computes the transitive closure of alpha
> >>>> relations. The operator fits well with the rest of Cod's relational
> >> algebra
> >>>> operators.
> >>>>
> >>>> For query optimizations, one of the commonly used Datalog
> optimizations
> >> is
> >>>> Semi-naive evaluation, where instead of re-evaluating the recursive
> >> program
> >>>> using all existing facts, only new facts inferred since last iteration
> >> are
> >>>> used. Datalog optimizations become much more interesting when
> >> introducing
> >>>> aggregation and negation, and it is still an open research question,
> but
> >>>> there is already some tangible progress. Otherwise, as you mentioned
> >>>> transitive closure is repeated joins, so pretty much many of the join
> >>>> optimizations apply such as predicate pushdown, and aggregation
> >>>> pushdown/pull up.
> >>>>
> >>>> Regarding the effort, we can always start from basic features and
> expand
> >>>> from there. I have already started working on the parser, AST and
> >> logical
> >>>> plan builder for basic Datalog without recursion. I am happy to
> create a
> >>>> JIRA ticket to track this effort there.
> >>>>
> >>>> Thanks,
> >>>> Walaa.
> >>>>
> >>>>
> >>>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org>
> wrote:
> >>>>
> >>>>> I've been thinking about Datalog front end to Calcite. How much
> effort
> >>>>> would it be? Would there be an audience who would find it useful?
> >>>>>
> >>>>> The genesis of the idea is talks by Frank McSherry[1] and Vasia
> >>>>> Kalavri[2] about graph queries and in particular Timely
> >>>>> Dataflow[3][4], and a paper about using Datalog for graph processing
> >>>>> [5].
> >>>>>
> >>>>> A few observations:
> >>>>> * Graph queries require repeated (unbounded) joins, and so are beyond
> >>>>> SQL:92.
> >>>>> * Datalog allows recursion, and can therefore compute things like
> >>>>> transitive closure, and is therefore powerful enough for graph
> >>>>> queries.
> >>>>> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class
> of
> >>>>> queries. However, for a variety of reasons, people tend not to use
> SQL
> >>>>> for querying graph databases.
> >>>>> * Datalog is more than just recursive queries. For instance, it is
> >>>>> popular with academics analyzing the power/complexity of languages.
> >>>>> And it can do deductive queries.
> >>>>> * There are different "strengths" of Datalog. Some variants support
> >>>>> only linearizable recursion; most variants only support queries whose
> >>>>> results are stratified (i.e. can be defined using well-founded
> >>>>> induction, necessary when negations are involved).
> >>>>> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
> >>>>> discussed how they can handle graph queries and iterative
> computation.
> >>>>> However they have mainly focused on improvements to their engine and
> >>>>> physical algebra, not looked at logical algebra or query language.
> >>>>> * If Calcite's algebra could express deductive query / recursive
> query
> >>>>> / iteration, then not only would Datalog be possible, but also SQL's
> >>>>> WITH RECURSIVE, and also SPARQL.
> >>>>>
> >>>>> So, questions on my mind:
> >>>>> * What new operator(s) would we add to Calcite's algebra to enable
> >>>>> recursive query?
> >>>>> * What optimization rules are possible/necessary for graph queries?
> >>>>> * How much effort would it be to add a Datalog parser to Calcite and
> >>>>> translate to Calcite's algebra?
> >>>>>
> >>>>> Julian
> >>>>>
> >>>>> [1] http://www.dataengconf.com/scalability-but-at-what-cost
> >>>>>
> >>>>> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
> >>>>>
> >>>>> [3] https://github.com/frankmcsherry/timely-dataflow
> >>>>>
> >>>>> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
> >>>>>
> >>>>> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
> >>>>>
> >>>>
> >>
>
>

Re: Recursive query, graph query, Datalog

Posted by Julian Hyde <jh...@apache.org>.

Yes, I agree.

My first instinct is to add an Iterate operator whose arguments are (1) the input, (2) a table function that applies to the delta at each iteration. When the table function returns the empty set, iteration stops. The “table function” is not a function per se, but a RelNode tree that references a pseudo-table called “Delta”. Think of it as a relational lambda expression, and the “Delta" is the argument.

Intermediate results are combined using UNION ALL. Is this too restrictive? I think maybe not, because you can always add a “finalizer” such as an Aggregate after the Iterate operator.

Julian


> On Dec 15, 2017, at 3:11 PM, Walaa Eldin Moustafa <wa...@gmail.com> wrote:
> 
> Yes, Magic sets is a very important and popular optimization as well. I
> guess once we can get a basic notion of recursion as a construct in
> Calcite, and get it to work correctly, we can start cracking optimizations.
> One thing to note is that the convergence/fixed point depend on the data,
> and there is no way to know beforehand what the (complete) plan will look
> like (i.e., how many joins). It seems that there must be some sort of
> awareness in the host engine of the fact that the query is recursive, and
> it should keep iterating till fixed point, or at least tell Calcite if it
> converged or not, and if not, Calcite will ask it to keep trying, so every
> iteration Calcite sends a traditional (non-recursive) RA plan, or ask the
> engine to stop. Do you agree?
> 
> 
> On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <jh...@apache.org> wrote:
> 
>> (Moving Carl, Shrikanth, Vasanth to bcc.)
>> 
>> Regarding optimizations. One one hand, it is daunting that there so
>> many optimizations are required to make graph queries run efficiently,
>> but on the other hand, it is good news for the project if those can be
>> expressed in relational algebra.
>> 
>> Looking at the previous research, some of the optimizations applied
>> are genuinely only possible at run-time, but others should be thought
>> of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
>> can be expressed as a logical operation (very similar to incremental
>> view maintenance and streaming, by the way).
>> 
>> (Untangling the capabilities of a particular engine from algebraic
>> rewrites is Calcite's gift to the world!)
>> 
>> Another very important logical rewrite is "magic sets"[1], which can
>> be modeled as semi-join push-down and done entirely at planning
>> time[2] or (if the runtime supports it) as side-ways information
>> passing of bloom filters or similar. Magic sets are important for
>> graph queries but also very useful for star-schema queries with a
>> fixed number of joins.
>> 
>> Julian
>> 
>> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf
>> 
>> [2] https://issues.apache.org/jira/browse/CALCITE-468
>> 
>> 
>> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <eb...@gmail.com> wrote:
>>> Great initiative.
>>> 
>>> I will also share some comparative performance studies we did at ORNL on
>>> different graph processing engines. Could be useful.
>>> 
>>> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <
>> wa.moustafa@gmail.com>
>>> wrote:
>>> 
>>>> Hi Julian,
>>>> 
>>>> Thanks for referencing our Datalog query processing paper [5]. I have
>> been
>>>> thinking about the same idea for a while now too :) I think Calcite is
>> very
>>>> well positioned as a generic query optimizer to add Datalog/recursive
>> query
>>>> support. Also, it makes a lot of sense since it opens a completely new
>>>> dimension for the kind of logic and queries that Calcite can handle,
>>>> including but not limited to graph queries, and that can be immediately
>>>> available to engines talking to Calcite.
>>>> 
>>>> To answer your questions, we probably need to add a transitive closure
>>>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/>
>> by
>>>> Rakesh Agrawal proposes the notion of alpha relations, and defines an
>> alpha
>>>> operator on top of them which computes the transitive closure of alpha
>>>> relations. The operator fits well with the rest of Cod's relational
>> algebra
>>>> operators.
>>>> 
>>>> For query optimizations, one of the commonly used Datalog optimizations
>> is
>>>> Semi-naive evaluation, where instead of re-evaluating the recursive
>> program
>>>> using all existing facts, only new facts inferred since last iteration
>> are
>>>> used. Datalog optimizations become much more interesting when
>> introducing
>>>> aggregation and negation, and it is still an open research question, but
>>>> there is already some tangible progress. Otherwise, as you mentioned
>>>> transitive closure is repeated joins, so pretty much many of the join
>>>> optimizations apply such as predicate pushdown, and aggregation
>>>> pushdown/pull up.
>>>> 
>>>> Regarding the effort, we can always start from basic features and expand
>>>> from there. I have already started working on the parser, AST and
>> logical
>>>> plan builder for basic Datalog without recursion. I am happy to create a
>>>> JIRA ticket to track this effort there.
>>>> 
>>>> Thanks,
>>>> Walaa.
>>>> 
>>>> 
>>>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org> wrote:
>>>> 
>>>>> I've been thinking about Datalog front end to Calcite. How much effort
>>>>> would it be? Would there be an audience who would find it useful?
>>>>> 
>>>>> The genesis of the idea is talks by Frank McSherry[1] and Vasia
>>>>> Kalavri[2] about graph queries and in particular Timely
>>>>> Dataflow[3][4], and a paper about using Datalog for graph processing
>>>>> [5].
>>>>> 
>>>>> A few observations:
>>>>> * Graph queries require repeated (unbounded) joins, and so are beyond
>>>>> SQL:92.
>>>>> * Datalog allows recursion, and can therefore compute things like
>>>>> transitive closure, and is therefore powerful enough for graph
>>>>> queries.
>>>>> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
>>>>> queries. However, for a variety of reasons, people tend not to use SQL
>>>>> for querying graph databases.
>>>>> * Datalog is more than just recursive queries. For instance, it is
>>>>> popular with academics analyzing the power/complexity of languages.
>>>>> And it can do deductive queries.
>>>>> * There are different "strengths" of Datalog. Some variants support
>>>>> only linearizable recursion; most variants only support queries whose
>>>>> results are stratified (i.e. can be defined using well-founded
>>>>> induction, necessary when negations are involved).
>>>>> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
>>>>> discussed how they can handle graph queries and iterative computation.
>>>>> However they have mainly focused on improvements to their engine and
>>>>> physical algebra, not looked at logical algebra or query language.
>>>>> * If Calcite's algebra could express deductive query / recursive query
>>>>> / iteration, then not only would Datalog be possible, but also SQL's
>>>>> WITH RECURSIVE, and also SPARQL.
>>>>> 
>>>>> So, questions on my mind:
>>>>> * What new operator(s) would we add to Calcite's algebra to enable
>>>>> recursive query?
>>>>> * What optimization rules are possible/necessary for graph queries?
>>>>> * How much effort would it be to add a Datalog parser to Calcite and
>>>>> translate to Calcite's algebra?
>>>>> 
>>>>> Julian
>>>>> 
>>>>> [1] http://www.dataengconf.com/scalability-but-at-what-cost
>>>>> 
>>>>> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
>>>>> 
>>>>> [3] https://github.com/frankmcsherry/timely-dataflow
>>>>> 
>>>>> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
>>>>> 
>>>>> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
>>>>> 
>>>> 
>>

Re: Recursive query, graph query, Datalog

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.

Yes, Magic sets is a very important and popular optimization as well. I
guess once we can get a basic notion of recursion as a construct in
Calcite, and get it to work correctly, we can start cracking optimizations.
One thing to note is that the convergence/fixed point depend on the data,
and there is no way to know beforehand what the (complete) plan will look
like (i.e., how many joins). It seems that there must be some sort of
awareness in the host engine of the fact that the query is recursive, and
it should keep iterating till fixed point, or at least tell Calcite if it
converged or not, and if not, Calcite will ask it to keep trying, so every
iteration Calcite sends a traditional (non-recursive) RA plan, or ask the
engine to stop. Do you agree?


On Fri, Dec 15, 2017 at 12:06 PM, Julian Hyde <jh...@apache.org> wrote:

> (Moving Carl, Shrikanth, Vasanth to bcc.)
>
> Regarding optimizations. One one hand, it is daunting that there so
> many optimizations are required to make graph queries run efficiently,
> but on the other hand, it is good news for the project if those can be
> expressed in relational algebra.
>
> Looking at the previous research, some of the optimizations applied
> are genuinely only possible at run-time, but others should be thought
> of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
> can be expressed as a logical operation (very similar to incremental
> view maintenance and streaming, by the way).
>
> (Untangling the capabilities of a particular engine from algebraic
> rewrites is Calcite's gift to the world!)
>
> Another very important logical rewrite is "magic sets"[1], which can
> be modeled as semi-join push-down and done entirely at planning
> time[2] or (if the runtime supports it) as side-ways information
> passing of bloom filters or similar. Magic sets are important for
> graph queries but also very useful for star-schema queries with a
> fixed number of joins.
>
> Julian
>
> [1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf
>
> [2] https://issues.apache.org/jira/browse/CALCITE-468
>
>
> On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <eb...@gmail.com> wrote:
> > Great initiative.
> >
> > I will also share some comparative performance studies we did at ORNL on
> > different graph processing engines. Could be useful.
> >
> > On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <
> wa.moustafa@gmail.com>
> > wrote:
> >
> >> Hi Julian,
> >>
> >> Thanks for referencing our Datalog query processing paper [5]. I have
> been
> >> thinking about the same idea for a while now too :) I think Calcite is
> very
> >> well positioned as a generic query optimizer to add Datalog/recursive
> query
> >> support. Also, it makes a lot of sense since it opens a completely new
> >> dimension for the kind of logic and queries that Calcite can handle,
> >> including but not limited to graph queries, and that can be immediately
> >> available to engines talking to Calcite.
> >>
> >> To answer your questions, we probably need to add a transitive closure
> >> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/>
> by
> >> Rakesh Agrawal proposes the notion of alpha relations, and defines an
> alpha
> >> operator on top of them which computes the transitive closure of alpha
> >> relations. The operator fits well with the rest of Cod's relational
> algebra
> >> operators.
> >>
> >> For query optimizations, one of the commonly used Datalog optimizations
> is
> >> Semi-naive evaluation, where instead of re-evaluating the recursive
> program
> >> using all existing facts, only new facts inferred since last iteration
> are
> >> used. Datalog optimizations become much more interesting when
> introducing
> >> aggregation and negation, and it is still an open research question, but
> >> there is already some tangible progress. Otherwise, as you mentioned
> >> transitive closure is repeated joins, so pretty much many of the join
> >> optimizations apply such as predicate pushdown, and aggregation
> >> pushdown/pull up.
> >>
> >> Regarding the effort, we can always start from basic features and expand
> >> from there. I have already started working on the parser, AST and
> logical
> >> plan builder for basic Datalog without recursion. I am happy to create a
> >> JIRA ticket to track this effort there.
> >>
> >> Thanks,
> >> Walaa.
> >>
> >>
> >> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org> wrote:
> >>
> >> > I've been thinking about Datalog front end to Calcite. How much effort
> >> > would it be? Would there be an audience who would find it useful?
> >> >
> >> > The genesis of the idea is talks by Frank McSherry[1] and Vasia
> >> > Kalavri[2] about graph queries and in particular Timely
> >> > Dataflow[3][4], and a paper about using Datalog for graph processing
> >> > [5].
> >> >
> >> > A few observations:
> >> > * Graph queries require repeated (unbounded) joins, and so are beyond
> >> > SQL:92.
> >> > * Datalog allows recursion, and can therefore compute things like
> >> > transitive closure, and is therefore powerful enough for graph
> >> > queries.
> >> > * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
> >> > queries. However, for a variety of reasons, people tend not to use SQL
> >> > for querying graph databases.
> >> > * Datalog is more than just recursive queries. For instance, it is
> >> > popular with academics analyzing the power/complexity of languages.
> >> > And it can do deductive queries.
> >> > * There are different "strengths" of Datalog. Some variants support
> >> > only linearizable recursion; most variants only support queries whose
> >> > results are stratified (i.e. can be defined using well-founded
> >> > induction, necessary when negations are involved).
> >> > * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
> >> > discussed how they can handle graph queries and iterative computation.
> >> > However they have mainly focused on improvements to their engine and
> >> > physical algebra, not looked at logical algebra or query language.
> >> > * If Calcite's algebra could express deductive query / recursive query
> >> > / iteration, then not only would Datalog be possible, but also SQL's
> >> > WITH RECURSIVE, and also SPARQL.
> >> >
> >> > So, questions on my mind:
> >> > * What new operator(s) would we add to Calcite's algebra to enable
> >> > recursive query?
> >> > * What optimization rules are possible/necessary for graph queries?
> >> > * How much effort would it be to add a Datalog parser to Calcite and
> >> > translate to Calcite's algebra?
> >> >
> >> > Julian
> >> >
> >> > [1] http://www.dataengconf.com/scalability-but-at-what-cost
> >> >
> >> > [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
> >> >
> >> > [3] https://github.com/frankmcsherry/timely-dataflow
> >> >
> >> > [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
> >> >
> >> > [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
> >> >
> >>
>

Re: Recursive query, graph query, Datalog

Posted by Julian Hyde <jh...@apache.org>.

(Moving Carl, Shrikanth, Vasanth to bcc.)

Regarding optimizations. One one hand, it is daunting that there so
many optimizations are required to make graph queries run efficiently,
but on the other hand, it is good news for the project if those can be
expressed in relational algebra.

Looking at the previous research, some of the optimizations applied
are genuinely only possible at run-time, but others should be thought
of as logical rewrites. Semi-naive evaluation, which Walaa mentions,
can be expressed as a logical operation (very similar to incremental
view maintenance and streaming, by the way).

(Untangling the capabilities of a particular engine from algebraic
rewrites is Calcite's gift to the world!)

Another very important logical rewrite is "magic sets"[1], which can
be modeled as semi-join push-down and done entirely at planning
time[2] or (if the runtime supports it) as side-ways information
passing of bloom filters or similar. Magic sets are important for
graph queries but also very useful for star-schema queries with a
fixed number of joins.

Julian

[1] http://db.cs.berkeley.edu/papers/sigmod96-magic.pdf

[2] https://issues.apache.org/jira/browse/CALCITE-468


On Fri, Dec 15, 2017 at 11:21 AM, Edmon Begoli <eb...@gmail.com> wrote:
> Great initiative.
>
> I will also share some comparative performance studies we did at ORNL on
> different graph processing engines. Could be useful.
>
> On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <wa...@gmail.com>
> wrote:
>
>> Hi Julian,
>>
>> Thanks for referencing our Datalog query processing paper [5]. I have been
>> thinking about the same idea for a while now too :) I think Calcite is very
>> well positioned as a generic query optimizer to add Datalog/recursive query
>> support. Also, it makes a lot of sense since it opens a completely new
>> dimension for the kind of logic and queries that Calcite can handle,
>> including but not limited to graph queries, and that can be immediately
>> available to engines talking to Calcite.
>>
>> To answer your questions, we probably need to add a transitive closure
>> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/> by
>> Rakesh Agrawal proposes the notion of alpha relations, and defines an alpha
>> operator on top of them which computes the transitive closure of alpha
>> relations. The operator fits well with the rest of Cod's relational algebra
>> operators.
>>
>> For query optimizations, one of the commonly used Datalog optimizations is
>> Semi-naive evaluation, where instead of re-evaluating the recursive program
>> using all existing facts, only new facts inferred since last iteration are
>> used. Datalog optimizations become much more interesting when introducing
>> aggregation and negation, and it is still an open research question, but
>> there is already some tangible progress. Otherwise, as you mentioned
>> transitive closure is repeated joins, so pretty much many of the join
>> optimizations apply such as predicate pushdown, and aggregation
>> pushdown/pull up.
>>
>> Regarding the effort, we can always start from basic features and expand
>> from there. I have already started working on the parser, AST and logical
>> plan builder for basic Datalog without recursion. I am happy to create a
>> JIRA ticket to track this effort there.
>>
>> Thanks,
>> Walaa.
>>
>>
>> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org> wrote:
>>
>> > I've been thinking about Datalog front end to Calcite. How much effort
>> > would it be? Would there be an audience who would find it useful?
>> >
>> > The genesis of the idea is talks by Frank McSherry[1] and Vasia
>> > Kalavri[2] about graph queries and in particular Timely
>> > Dataflow[3][4], and a paper about using Datalog for graph processing
>> > [5].
>> >
>> > A few observations:
>> > * Graph queries require repeated (unbounded) joins, and so are beyond
>> > SQL:92.
>> > * Datalog allows recursion, and can therefore compute things like
>> > transitive closure, and is therefore powerful enough for graph
>> > queries.
>> > * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
>> > queries. However, for a variety of reasons, people tend not to use SQL
>> > for querying graph databases.
>> > * Datalog is more than just recursive queries. For instance, it is
>> > popular with academics analyzing the power/complexity of languages.
>> > And it can do deductive queries.
>> > * There are different "strengths" of Datalog. Some variants support
>> > only linearizable recursion; most variants only support queries whose
>> > results are stratified (i.e. can be defined using well-founded
>> > induction, necessary when negations are involved).
>> > * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
>> > discussed how they can handle graph queries and iterative computation.
>> > However they have mainly focused on improvements to their engine and
>> > physical algebra, not looked at logical algebra or query language.
>> > * If Calcite's algebra could express deductive query / recursive query
>> > / iteration, then not only would Datalog be possible, but also SQL's
>> > WITH RECURSIVE, and also SPARQL.
>> >
>> > So, questions on my mind:
>> > * What new operator(s) would we add to Calcite's algebra to enable
>> > recursive query?
>> > * What optimization rules are possible/necessary for graph queries?
>> > * How much effort would it be to add a Datalog parser to Calcite and
>> > translate to Calcite's algebra?
>> >
>> > Julian
>> >
>> > [1] http://www.dataengconf.com/scalability-but-at-what-cost
>> >
>> > [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
>> >
>> > [3] https://github.com/frankmcsherry/timely-dataflow
>> >
>> > [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
>> >
>> > [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
>> >
>>

Re: Recursive query, graph query, Datalog

Posted by Edmon Begoli <eb...@gmail.com>.

Great initiative.

I will also share some comparative performance studies we did at ORNL on
different graph processing engines. Could be useful.

On Fri, Dec 15, 2017 at 14:11 Walaa Eldin Moustafa <wa...@gmail.com>
wrote:

> Hi Julian,
>
> Thanks for referencing our Datalog query processing paper [5]. I have been
> thinking about the same idea for a while now too :) I think Calcite is very
> well positioned as a generic query optimizer to add Datalog/recursive query
> support. Also, it makes a lot of sense since it opens a completely new
> dimension for the kind of logic and queries that Calcite can handle,
> including but not limited to graph queries, and that can be immediately
> available to engines talking to Calcite.
>
> To answer your questions, we probably need to add a transitive closure
> operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/> by
> Rakesh Agrawal proposes the notion of alpha relations, and defines an alpha
> operator on top of them which computes the transitive closure of alpha
> relations. The operator fits well with the rest of Cod's relational algebra
> operators.
>
> For query optimizations, one of the commonly used Datalog optimizations is
> Semi-naive evaluation, where instead of re-evaluating the recursive program
> using all existing facts, only new facts inferred since last iteration are
> used. Datalog optimizations become much more interesting when introducing
> aggregation and negation, and it is still an open research question, but
> there is already some tangible progress. Otherwise, as you mentioned
> transitive closure is repeated joins, so pretty much many of the join
> optimizations apply such as predicate pushdown, and aggregation
> pushdown/pull up.
>
> Regarding the effort, we can always start from basic features and expand
> from there. I have already started working on the parser, AST and logical
> plan builder for basic Datalog without recursion. I am happy to create a
> JIRA ticket to track this effort there.
>
> Thanks,
> Walaa.
>
>
> On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org> wrote:
>
> > I've been thinking about Datalog front end to Calcite. How much effort
> > would it be? Would there be an audience who would find it useful?
> >
> > The genesis of the idea is talks by Frank McSherry[1] and Vasia
> > Kalavri[2] about graph queries and in particular Timely
> > Dataflow[3][4], and a paper about using Datalog for graph processing
> > [5].
> >
> > A few observations:
> > * Graph queries require repeated (unbounded) joins, and so are beyond
> > SQL:92.
> > * Datalog allows recursion, and can therefore compute things like
> > transitive closure, and is therefore powerful enough for graph
> > queries.
> > * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
> > queries. However, for a variety of reasons, people tend not to use SQL
> > for querying graph databases.
> > * Datalog is more than just recursive queries. For instance, it is
> > popular with academics analyzing the power/complexity of languages.
> > And it can do deductive queries.
> > * There are different "strengths" of Datalog. Some variants support
> > only linearizable recursion; most variants only support queries whose
> > results are stratified (i.e. can be defined using well-founded
> > induction, necessary when negations are involved).
> > * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
> > discussed how they can handle graph queries and iterative computation.
> > However they have mainly focused on improvements to their engine and
> > physical algebra, not looked at logical algebra or query language.
> > * If Calcite's algebra could express deductive query / recursive query
> > / iteration, then not only would Datalog be possible, but also SQL's
> > WITH RECURSIVE, and also SPARQL.
> >
> > So, questions on my mind:
> > * What new operator(s) would we add to Calcite's algebra to enable
> > recursive query?
> > * What optimization rules are possible/necessary for graph queries?
> > * How much effort would it be to add a Datalog parser to Calcite and
> > translate to Calcite's algebra?
> >
> > Julian
> >
> > [1] http://www.dataengconf.com/scalability-but-at-what-cost
> >
> > [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
> >
> > [3] https://github.com/frankmcsherry/timely-dataflow
> >
> > [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
> >
> > [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
> >
>

Re: Recursive query, graph query, Datalog

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.

Hi Julian,

Thanks for referencing our Datalog query processing paper [5]. I have been
thinking about the same idea for a while now too :) I think Calcite is very
well positioned as a generic query optimizer to add Datalog/recursive query
support. Also, it makes a lot of sense since it opens a completely new
dimension for the kind of logic and queries that Calcite can handle,
including but not limited to graph queries, and that can be immediately
available to engines talking to Calcite.

To answer your questions, we probably need to add a transitive closure
operator. This 1988 paper <http://ieeexplore.ieee.org/document/42731/> by
Rakesh Agrawal proposes the notion of alpha relations, and defines an alpha
operator on top of them which computes the transitive closure of alpha
relations. The operator fits well with the rest of Cod's relational algebra
operators.

For query optimizations, one of the commonly used Datalog optimizations is
Semi-naive evaluation, where instead of re-evaluating the recursive program
using all existing facts, only new facts inferred since last iteration are
used. Datalog optimizations become much more interesting when introducing
aggregation and negation, and it is still an open research question, but
there is already some tangible progress. Otherwise, as you mentioned
transitive closure is repeated joins, so pretty much many of the join
optimizations apply such as predicate pushdown, and aggregation
pushdown/pull up.

Regarding the effort, we can always start from basic features and expand
from there. I have already started working on the parser, AST and logical
plan builder for basic Datalog without recursion. I am happy to create a
JIRA ticket to track this effort there.

Thanks,
Walaa.

On Fri, Dec 15, 2017 at 10:26 AM, Julian Hyde <jh...@apache.org> wrote:

> I've been thinking about Datalog front end to Calcite. How much effort
> would it be? Would there be an audience who would find it useful?
>
> The genesis of the idea is talks by Frank McSherry[1] and Vasia
> Kalavri[2] about graph queries and in particular Timely
> Dataflow[3][4], and a paper about using Datalog for graph processing
> [5].
>
> A few observations:
> * Graph queries require repeated (unbounded) joins, and so are beyond
> SQL:92.
> * Datalog allows recursion, and can therefore compute things like
> transitive closure, and is therefore powerful enough for graph
> queries.
> * SQL:99 added 'WITH RECURSIVE' so can handle a pretty useful class of
> queries. However, for a variety of reasons, people tend not to use SQL
> for querying graph databases.
> * Datalog is more than just recursive queries. For instance, it is
> popular with academics analyzing the power/complexity of languages.
> And it can do deductive queries.
> * There are different "strengths" of Datalog. Some variants support
> only linearizable recursion; most variants only support queries whose
> results are stratified (i.e. can be defined using well-founded
> induction, necessary when negations are involved).
> * The "big data" languages (Hadoop, Spark, Flink, even Pig) have all
> discussed how they can handle graph queries and iterative computation.
> However they have mainly focused on improvements to their engine and
> physical algebra, not looked at logical algebra or query language.
> * If Calcite's algebra could express deductive query / recursive query
> / iteration, then not only would Datalog be possible, but also SQL's
> WITH RECURSIVE, and also SPARQL.
>
> So, questions on my mind:
> * What new operator(s) would we add to Calcite's algebra to enable
> recursive query?
> * What optimization rules are possible/necessary for graph queries?
> * How much effort would it be to add a Datalog parser to Calcite and
> translate to Calcite's algebra?
>
> Julian
>
> [1] http://www.dataengconf.com/scalability-but-at-what-cost
>
> [2] https://qconsf.com/sf2017/speakers/vasia-kalavri
>
> [3] https://github.com/frankmcsherry/timely-dataflow
>
> [4] http://sigops.org/sosp/sosp13/papers/p439-murray.pdf
>
> [5] http://www.sysnet.ucsd.edu/sysnet/miscpapers/datalog-icbd16.pdf
>