You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jacques Nadeau <ja...@apache.org> on 2015/02/08 03:47:43 UTC

Incorrect logic in DrillScanRel

Hey Guys,

When reviewing our code I ran across this code:

https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java#L70

The problem with this code is it converts no columns into every column.
This means we waste a bunch of time readers where we don't need to project
any columns.  An example is select count(*) which turns into select
count(0).  In that case, we should avoid projecting any columns other than
a count column.  However, because we translate no columns into every
column, we read way more than we should.  I remember someone (maybe Hanifi)
working through some rules around not allowing a blank list of columns
anywhere.  What was the thinking and how do we correct.  We should treat an
empty list different from a null list different from an all list.

thoughts?

Re: Incorrect logic in DrillScanRel

Posted by Hanifi Gunes <hg...@maprtech.com>.
Filed https://issues.apache.org/jira/browse/DRILL-2192 to track this.

On Sun, Feb 8, 2015 at 3:03 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I think the following makes sense (from your email)
>
> null -> scan-all
> non-empty with star -> scan-all
> non-empty w/o star -> regular / scan-some
> empty -> count(*) / skip-all
>
> The reason I identified it in the DrillScanRel is because the planner
> actually returns an empty set for the count(*) case as that actually gets
> rewritten as count(0) or count(1) if I recall correctly.  As such, the list
> of columns needed to satisfy are the empty set.
>
> Agreed that most (all?) readers don't support currently.  We should address
> this separately and make sure the readers handle what they are given (e.g.
> if they don't support skip-all, they should convert to something they do
> support) rather than removing the capability in the Rel layer.
>
> On Sun, Feb 8, 2015 at 2:44 PM, Hanifi Gunes <hg...@maprtech.com> wrote:
>
> > - We should treat an empty list different from a null list different from
> > an all list.
> > Agreed. What would each mean to DrillScanRel though?
> > null -> scan-all or problem?
> > non-empty with star -> scan-all
> > non-empty w/o star -> regular / scan-some
> > empty -> count(*) / skip-all
> >
> >
> > I have no concrete understanding of what planner outputs as the list of
> > columns in case of count(*) vs select * queries. It would be nice to
> > differentiate these cases without use of "null" to mean anything special
> > though. Earlier, planner -at least in DrillScanRule- used to invoke a
> c'tor
> > that used to pass "null" to mean "all" which led some ambiguity: is it
> that
> > code is broken or callee asks me to scan everything? This was one of the
> > reasons for disallowing nulls. Another one was to wipe out all null
> checks
> > on columns from all of the readers.
> >
> >
> > The other part of the consideration is readers. Readers don't seem to
> > support skip-all queries particularly for count. Looks like the current
> > JsonReader implementation(so does mongo reader) after major refactoring
> > also relies on a non-empty list of columns while ensuring at least one
> > column is scanned for count queries, which I doubt works as intended in
> > case an empty list is used to mean skip-all. Part of the code change to
> > make count(*) more efficient is to make every reader understand skip-all
> > for count queries.
> >
> >
> > Earlier we had some discussions on how to make count(*) more efficient as
> > well. A good starting point could be to resolve ambiguity at the planner
> > side with the help of planning experts perhaps. If reader knows for sure
> > that she executing a count(*) query, she should be able to output a cheap
> > count vector instead of scanning everything without enforcing schema
> check
> > on the count column for readers that is schema-ed.
> >
> > How does this sound?
> >
> >
> > Thanks.
> > -Hanifi
> >
> >
> > On Sat, Feb 7, 2015 at 6:47 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > Hey Guys,
> > >
> > > When reviewing our code I ran across this code:
> > >
> > >
> > >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java#L70
> > >
> > > The problem with this code is it converts no columns into every column.
> > > This means we waste a bunch of time readers where we don't need to
> > project
> > > any columns.  An example is select count(*) which turns into select
> > > count(0).  In that case, we should avoid projecting any columns other
> > than
> > > a count column.  However, because we translate no columns into every
> > > column, we read way more than we should.  I remember someone (maybe
> > Hanifi)
> > > working through some rules around not allowing a blank list of columns
> > > anywhere.  What was the thinking and how do we correct.  We should
> treat
> > an
> > > empty list different from a null list different from an all list.
> > >
> > > thoughts?
> > >
> >
>

Re: Incorrect logic in DrillScanRel

Posted by Jacques Nadeau <ja...@apache.org>.
I think the following makes sense (from your email)

null -> scan-all
non-empty with star -> scan-all
non-empty w/o star -> regular / scan-some
empty -> count(*) / skip-all

The reason I identified it in the DrillScanRel is because the planner
actually returns an empty set for the count(*) case as that actually gets
rewritten as count(0) or count(1) if I recall correctly.  As such, the list
of columns needed to satisfy are the empty set.

Agreed that most (all?) readers don't support currently.  We should address
this separately and make sure the readers handle what they are given (e.g.
if they don't support skip-all, they should convert to something they do
support) rather than removing the capability in the Rel layer.

On Sun, Feb 8, 2015 at 2:44 PM, Hanifi Gunes <hg...@maprtech.com> wrote:

> - We should treat an empty list different from a null list different from
> an all list.
> Agreed. What would each mean to DrillScanRel though?
> null -> scan-all or problem?
> non-empty with star -> scan-all
> non-empty w/o star -> regular / scan-some
> empty -> count(*) / skip-all
>
>
> I have no concrete understanding of what planner outputs as the list of
> columns in case of count(*) vs select * queries. It would be nice to
> differentiate these cases without use of "null" to mean anything special
> though. Earlier, planner -at least in DrillScanRule- used to invoke a c'tor
> that used to pass "null" to mean "all" which led some ambiguity: is it that
> code is broken or callee asks me to scan everything? This was one of the
> reasons for disallowing nulls. Another one was to wipe out all null checks
> on columns from all of the readers.
>
>
> The other part of the consideration is readers. Readers don't seem to
> support skip-all queries particularly for count. Looks like the current
> JsonReader implementation(so does mongo reader) after major refactoring
> also relies on a non-empty list of columns while ensuring at least one
> column is scanned for count queries, which I doubt works as intended in
> case an empty list is used to mean skip-all. Part of the code change to
> make count(*) more efficient is to make every reader understand skip-all
> for count queries.
>
>
> Earlier we had some discussions on how to make count(*) more efficient as
> well. A good starting point could be to resolve ambiguity at the planner
> side with the help of planning experts perhaps. If reader knows for sure
> that she executing a count(*) query, she should be able to output a cheap
> count vector instead of scanning everything without enforcing schema check
> on the count column for readers that is schema-ed.
>
> How does this sound?
>
>
> Thanks.
> -Hanifi
>
>
> On Sat, Feb 7, 2015 at 6:47 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Hey Guys,
> >
> > When reviewing our code I ran across this code:
> >
> >
> >
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java#L70
> >
> > The problem with this code is it converts no columns into every column.
> > This means we waste a bunch of time readers where we don't need to
> project
> > any columns.  An example is select count(*) which turns into select
> > count(0).  In that case, we should avoid projecting any columns other
> than
> > a count column.  However, because we translate no columns into every
> > column, we read way more than we should.  I remember someone (maybe
> Hanifi)
> > working through some rules around not allowing a blank list of columns
> > anywhere.  What was the thinking and how do we correct.  We should treat
> an
> > empty list different from a null list different from an all list.
> >
> > thoughts?
> >
>

Re: Incorrect logic in DrillScanRel

Posted by Hanifi Gunes <hg...@maprtech.com>.
- We should treat an empty list different from a null list different from
an all list.
Agreed. What would each mean to DrillScanRel though?
null -> scan-all or problem?
non-empty with star -> scan-all
non-empty w/o star -> regular / scan-some
empty -> count(*) / skip-all


I have no concrete understanding of what planner outputs as the list of
columns in case of count(*) vs select * queries. It would be nice to
differentiate these cases without use of "null" to mean anything special
though. Earlier, planner -at least in DrillScanRule- used to invoke a c'tor
that used to pass "null" to mean "all" which led some ambiguity: is it that
code is broken or callee asks me to scan everything? This was one of the
reasons for disallowing nulls. Another one was to wipe out all null checks
on columns from all of the readers.


The other part of the consideration is readers. Readers don't seem to
support skip-all queries particularly for count. Looks like the current
JsonReader implementation(so does mongo reader) after major refactoring
also relies on a non-empty list of columns while ensuring at least one
column is scanned for count queries, which I doubt works as intended in
case an empty list is used to mean skip-all. Part of the code change to
make count(*) more efficient is to make every reader understand skip-all
for count queries.


Earlier we had some discussions on how to make count(*) more efficient as
well. A good starting point could be to resolve ambiguity at the planner
side with the help of planning experts perhaps. If reader knows for sure
that she executing a count(*) query, she should be able to output a cheap
count vector instead of scanning everything without enforcing schema check
on the count column for readers that is schema-ed.

How does this sound?


Thanks.
-Hanifi


On Sat, Feb 7, 2015 at 6:47 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Hey Guys,
>
> When reviewing our code I ran across this code:
>
>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillScanRel.java#L70
>
> The problem with this code is it converts no columns into every column.
> This means we waste a bunch of time readers where we don't need to project
> any columns.  An example is select count(*) which turns into select
> count(0).  In that case, we should avoid projecting any columns other than
> a count column.  However, because we translate no columns into every
> column, we read way more than we should.  I remember someone (maybe Hanifi)
> working through some rules around not allowing a blank list of columns
> anywhere.  What was the thinking and how do we correct.  We should treat an
> empty list different from a null list different from an all list.
>
> thoughts?
>