You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Ted Dunning <te...@gmail.com> on 2012/10/13 04:37:39 UTC

updates to logical plan spec

I talked to Jason some more.  He had some very good suggestions.

a) some operators need to have multiple outputs.  For instance, the group
operator needs to output the main data stream and a reference to the
grouped field

b) what Julian was calling nest/unnest is more naturally called explode and
flatten.  The idea is that some field has a list-like value and the output
will be each of those values.  Actually, there are two outputs.  One is the
original input and the second is the explode sequence.  This can be the
input to a DAG which does whatever we want to that exploded sequence,
typically aggregating it, but really doing whatever we want.  Then the
flatten operator handles splicing the output of the sub-DAG into the
original record that had the list-like value.  There are two outputs of the
flatten operator as well, which are the main data flow and a reference to
the output of the DAG in the main data.

This style handles all of the normal grouping/aggregating type of things we
want to do and it also handles all of Dremel's within syntax.

I also realized a few things as well

1) the bind needs to be rooted in some data source so that we can
understand scoping relative to schemas

2) there is an important difference between two separate outputs of a DAG
element and a single output that goes two places.

3) everywhere I was wanting to inject an output field name can be handled
by multiple outputs



I think that the logical plan spec is ready for two things, both of which
can be done by somebody other than me:


A - We can now start trying to convert an abstract syntax tree from
Dremel-ish source into a logical plan

B - We can implement a toy interpreter for the logical plan that transforms
sequences of trees.

Re: updates to logical plan spec

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Oct 13, 2012 at 2:03 AM, Julian Hyde <ju...@gmail.com> wrote:

> I guess I've been thinking about something a little different. I've been
> thinking about what would be an appropriate algebra to internally
> represent, manipulate, and optimize Drill queries.
>

I don't think that we are far apart.

> My conclusion is that the best candidate is relational algebra, augmented
> with data types that allow collections of nested records and with "explode"
> and "implode" operators. ("explode" takes a record with nested records and
> converts it to a sequence of flat records, plus a "location" value that
> indicates how the nested record fits into the parent; "implode" is the
> inverse: it takes a sequence of flat records with a "location" field and
> converts into a nested record.)

This is very close to what I now have in the plan document.

> ...
> The algebra is (probably) higher level than what Ted calls a "logical
> plan". His operators produce two outputs, and while that makes perfect
> sense for physical operators, it is difficult to reason about.
>

The multiple outputs fall into two cases:

1) a secondary output that is simply a reference to a new value that
doesn't yet have a name.

2) explode provides a reference to the original data stream.  If you think
that flatten (aka implode) doesn't need that, then this isn't necessary.

I don't think that cases of (1) affect reasoning about the DAG and I think
that the unique case of (2) may be something we can eliminate.

To fix up DrQL's "where" operator, we convert it to "explode" followed by
> "filter" followed by "implode". To fix up aggregate "within", we apply
> "explode" then aggregate. We find that we never need to operate on trees.
> If we need to operate on a tree, we explode into several records, apply
> relational operators on those records, and if necessary implode back again.
> We're operating in relational algebra.
>

I think that this works with the current plan document.

> Am I over-engineering here? It's possible. Maybe Drill doesn't need query
> optimization. Maybe queries can go straight from a DrQL parse tree to a DAG
> of operators using a straightforward mapping.

If so, the query optimizer will be very simple.  But I am pretty sure that
getting delayed record assembly to work right will require reasoning such
as you suggest.

Re: updates to logical plan spec

Posted by Julian Hyde <ju...@gmail.com>.

I guess I've been thinking about something a little different. I've been thinking about what would be an appropriate algebra to internally represent, manipulate, and optimize Drill queries. 

My conclusion is that the best candidate is relational algebra, augmented with data types that allow collections of nested records and with "explode" and "implode" operators. ("explode" takes a record with nested records and converts it to a sequence of flat records, plus a "location" value that indicates how the nested record fits into the parent; "implode" is the inverse: it takes a sequence of flat records with a "location" field and converts into a nested record.) Here is how I came to that conclusion.

The algebra is lower level than DrQL: viz, it would have fewer operators, and the operators would be easier to reason about and to write transformation rules for. (Anyone who thinks that DrQL's 'where' operator is straightforward should ponder why BigQuery will sometimes give the error "Cannot query the cross product of repeated fields".)

The algebra is (probably) higher level than what Ted calls a "logical plan". His operators produce two outputs, and while that makes perfect sense for physical operators, it is difficult to reason about.

Here are a few reasons why I consider DrQL to be a less clean model than the relational model. As I've said before, the "where" operator has a much more complex behavior than SQL's where. It is best understood as decomposing records, applying a filtering predicate, then re-composing the fragments of the row that made it through the filter. The "within" clause is a nice shorthand, but is too limited to be considered a full operator. Trees (collections within rows) are similar to relations (collections of rows) but are handled using different operators. If the "tree" model was as powerful as advertised then we wouldn't need the concept of "relation" at all.

That doesn't mean that DrQL is not a good query language. It seems to be concise, and users learn it quickly and like it. Syntactic sugar operators like "within" is totally appropriate (just like syntactic sugar "select distinct" and "having" in SQL).

To fix up DrQL's "where" operator, we convert it to "explode" followed by "filter" followed by "implode". To fix up aggregate "within", we apply "explode" then aggregate. We find that we never need to operate on trees. If we need to operate on a tree, we explode into several records, apply relational operators on those records, and if necessary implode back again. We're operating in relational algebra.

This is good news, because relational algebra is well behaved and well understood.

And by the way, even if the algebra is about exploded sets of flat records, there's no reason that the physical operators can't operate on tree-structured records. We could recognize explode-followed-by-filter-followed-by-implode and implement an operator that does precisely the same as a DrQL "where" clause.

Am I over-engineering here? It's possible. Maybe Drill doesn't need query optimization. Maybe queries can go straight from a DrQL parse tree to a DAG of operators using a straightforward mapping. But I'll argue that many people will come to Drill with SQL queries, or queries very similar to SQL, data sets with minimal nesting, and will be saddened when Drill can't execute their queries. This particular user kicked the tires, was impressed with the speed of the car, but was disappointed that he couldn't drive it where he wanted to go: http://cwebbbi.wordpress.com/2012/05/20/a-look-at-google-bigquery/.

Julian

On Oct 12, 2012, at 7:37 PM, Ted Dunning <te...@gmail.com> wrote:

> I talked to Jason some more.  He had some very good suggestions.
> 
> a) some operators need to have multiple outputs.  For instance, the group
> operator needs to output the main data stream and a reference to the
> grouped field
> 
> b) what Julian was calling nest/unnest is more naturally called explode and
> flatten.  The idea is that some field has a list-like value and the output
> will be each of those values.  Actually, there are two outputs.  One is the
> original input and the second is the explode sequence.  This can be the
> input to a DAG which does whatever we want to that exploded sequence,
> typically aggregating it, but really doing whatever we want.  Then the
> flatten operator handles splicing the output of the sub-DAG into the
> original record that had the list-like value.  There are two outputs of the
> flatten operator as well, which are the main data flow and a reference to
> the output of the DAG in the main data.
> 
> This style handles all of the normal grouping/aggregating type of things we
> want to do and it also handles all of Dremel's within syntax.
> 
> I also realized a few things as well
> 
> 1) the bind needs to be rooted in some data source so that we can
> understand scoping relative to schemas
> 
> 2) there is an important difference between two separate outputs of a DAG
> element and a single output that goes two places.
> 
> 3) everywhere I was wanting to inject an output field name can be handled
> by multiple outputs
> 
> 
> 
> I think that the logical plan spec is ready for two things, both of which
> can be done by somebody other than me:
> 
> 
> A - We can now start trying to convert an abstract syntax tree from
> Dremel-ish source into a logical plan
> 
> B - We can implement a toy interpreter for the logical plan that transforms
> sequences of trees.