You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by weijie tong <to...@gmail.com> on 2017/07/26 11:14:57 UTC

Drill query planning error

HI all:

  I materialize the count distinct query result to a cache, then when user
query the count distinct , a specific rule will translate the query to the
cache. It turns out right when the query has only one count (distinct )
operator ,but when it has two count (distinct ) ,it causes error .The error
info is here:
https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269


Best Regards.

Re: Drill query planning error

Posted by weijie tong <to...@gmail.com>.

here is the jira issue link:
https://issues.apache.org/jira/browse/DRILL-5691

On Thu, Jul 27, 2017 at 8:32 AM, weijie tong <to...@gmail.com>
wrote:

> another tips ,the coun distinct query examples mentioned above are all
> transferred by my rule,that is from Aggregate,Aggregate,Project,Scan to
> Project,Scan.    Jira will be logged soon later as I am on road.
>
> On Thu, 27 Jul 2017 at 8:17 AM Julian Hyde <jh...@apache.org> wrote:
>
>> I think the “multiple distinct count” code path also gets triggered if
>> there’s a mixture of distinct and non-distinct aggregates.
>>
>> > On Jul 26, 2017, at 5:16 PM, Aman Sinha <am...@apache.org> wrote:
>> >
>> > It sounds like you have narrowed it down further and it could be an
>> issue
>> > with the scalar check when 2 or more cartesian joins are present.
>> > Please file a JIRA with relevant details.
>> >
>> > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <tongweijie178@gmail.com
>> <ma...@gmail.com>>
>> > wrote:
>> >
>> >> Thanks for pointing out the possible reasons @Aman @Julian . I am not
>> sure
>> >> that's the real problem .
>> >>
>> >> As I initially mentioned,sorry for not clearly described ,queries like
>> >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
>> >> right.but "select count(*),sum(a),count (distinct b),count(distinct c)
>> from
>> >> t where dt=xxx"will be wrong.  The only one count(distinct) scenario
>> does
>> >> do the Cartesian join check too.It will and a broadcast operator on my
>> >> transferred relnode. So I think I have not loose the scalar property.
>> But
>> >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
>> >> check for satisfying the Cartesian conditions happened only once.It
>> sounds
>> >> should happen twice as there are two nested loop join operators at the
>> two
>> >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
>> >>
>> >> Best Regards
>> >>
>> >>
>> >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansinha@apache.org
>> <ma...@apache.org>> wrote:
>> >>
>> >>> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks
>> for
>> >>> pointing that out.   I'll see if we can leverage that.  The grouping
>> sets
>> >>> is not yet supported in Drill, but that would be a better solution
>> since
>> >> it
>> >>> avoids extra scans.
>> >>>
>> >>> -Aman
>> >>>
>> >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jhyde@apache.org
>> <ma...@apache.org>> wrote:
>> >>>
>> >>>> Aman,
>> >>>>
>> >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill
>> >> question.
>> >>>>
>> >>>> A blanket restriction on cartesian joins is a blunt instrument.
>> >> Sometimes
>> >>>> cartesian joins are valid, safe, and the best plan for a query. This
>> >> is a
>> >>>> case in point. Users shouldn’t have to change config parameters to
>> get
>> >> it
>> >>>> to work.
>> >>>>
>> >>>> (Actually I don’t know the query, but
>> >>>>
>> >>>>  select count(distinct deptno), count(distinct gender) from emp
>> >>>>
>> >>>> is equivalent.)
>> >>>>
>> >>>> Drill should detect that a relational expression can return at most
>> one
>> >>>> row, and allow a cartesian join if one side is such. Calcite has a
>> >>>> RelMdMaxRowCount statistic for this. This was added as part of
>> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
>> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
>> http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100%
>> >>>> safe. No config parameters required.
>> >>>>
>> >>>> Also, Calcite has an alternative way of handling multiple distinct
>> >>>> aggregates that rewrites to use grouping sets. It doesn’t generate
>> >>>> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
>> <http://issues.apache.org/jira/>
>> >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732
>> <http://issues.apache.org/jira/browse/CALCITE-732>>.
>> >>>>
>> >>>> Julian
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org>
>> >> wrote:
>> >>>>>
>> >>>>> [Since this is Drill specific, I put dev@calcite on BCC].
>> >>>>>
>> >>>>> If you have two aggregates: Count(distinct a), Count(distinct b),
>> the
>> >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries
>> >> each
>> >>> of
>> >>>>> which first does a group-by on the distinct column followed by a
>> >> count
>> >>>>> aggregate.   By default,  Drill only processes cartesian join if one
>> >>>> input
>> >>>>> of the join is known to be scalar (single row).  It sounds like
>> after
>> >>> you
>> >>>>> did the transformation to use the cache, that scalar property
>> somehow
>> >>> did
>> >>>>> not get propagated.
>> >>>>> You can override this behavior by a session configuration:  (this
>> >> will
>> >>>> use
>> >>>>> a nested loop join even if inputs are not provably scalar, but it
>> >>> should
>> >>>> be
>> >>>>> used for specific query only).    For a more general solution, I
>> >>> believe
>> >>>>> you may have to create an enhancement JIRA with appropriate details.
>> >>>>>  'alter session set planner.enable_nljoin_for_scalar_only = false';
>> >>>>>
>> >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
>> >> tongweijie178@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> HI all:
>> >>>>>>
>> >>>>>> I materialize the count distinct query result to a cache, then when
>> >>>> user
>> >>>>>> query the count distinct , a specific rule will translate the query
>> >> to
>> >>>> the
>> >>>>>> cache. It turns out right when the query has only one count
>> >> (distinct
>> >>> )
>> >>>>>> operator ,but when it has two count (distinct ) ,it causes error
>> >> .The
>> >>>> error
>> >>>>>> info is here:
>> >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee522
>> 69
>> >>>>>>
>> >>>>>>
>> >>>>>> Best Regards.
>>
>>

Re: Drill query planning error

Posted by weijie tong <to...@gmail.com>.

another tips ,the coun distinct query examples mentioned above are all
transferred by my rule,that is from Aggregate,Aggregate,Project,Scan to
Project,Scan.    Jira will be logged soon later as I am on road.

On Thu, 27 Jul 2017 at 8:17 AM Julian Hyde <jh...@apache.org> wrote:

> I think the “multiple distinct count” code path also gets triggered if
> there’s a mixture of distinct and non-distinct aggregates.
>
> > On Jul 26, 2017, at 5:16 PM, Aman Sinha <am...@apache.org> wrote:
> >
> > It sounds like you have narrowed it down further and it could be an issue
> > with the scalar check when 2 or more cartesian joins are present.
> > Please file a JIRA with relevant details.
> >
> > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <tongweijie178@gmail.com
> <ma...@gmail.com>>
> > wrote:
> >
> >> Thanks for pointing out the possible reasons @Aman @Julian . I am not
> sure
> >> that's the real problem .
> >>
> >> As I initially mentioned,sorry for not clearly described ,queries like
> >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
> >> right.but "select count(*),sum(a),count (distinct b),count(distinct c)
> from
> >> t where dt=xxx"will be wrong.  The only one count(distinct) scenario
> does
> >> do the Cartesian join check too.It will and a broadcast operator on my
> >> transferred relnode. So I think I have not loose the scalar property.
> But
> >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
> >> check for satisfying the Cartesian conditions happened only once.It
> sounds
> >> should happen twice as there are two nested loop join operators at the
> two
> >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
> >>
> >> Best Regards
> >>
> >>
> >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansinha@apache.org
> <ma...@apache.org>> wrote:
> >>
> >>> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks
> for
> >>> pointing that out.   I'll see if we can leverage that.  The grouping
> sets
> >>> is not yet supported in Drill, but that would be a better solution
> since
> >> it
> >>> avoids extra scans.
> >>>
> >>> -Aman
> >>>
> >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jhyde@apache.org
> <ma...@apache.org>> wrote:
> >>>
> >>>> Aman,
> >>>>
> >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill
> >> question.
> >>>>
> >>>> A blanket restriction on cartesian joins is a blunt instrument.
> >> Sometimes
> >>>> cartesian joins are valid, safe, and the best plan for a query. This
> >> is a
> >>>> case in point. Users shouldn’t have to change config parameters to get
> >> it
> >>>> to work.
> >>>>
> >>>> (Actually I don’t know the query, but
> >>>>
> >>>>  select count(distinct deptno), count(distinct gender) from emp
> >>>>
> >>>> is equivalent.)
> >>>>
> >>>> Drill should detect that a relational expression can return at most
> one
> >>>> row, and allow a cartesian join if one side is such. Calcite has a
> >>>> RelMdMaxRowCount statistic for this. This was added as part of
> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
> http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100%
> >>>> safe. No config parameters required.
> >>>>
> >>>> Also, Calcite has an alternative way of handling multiple distinct
> >>>> aggregates that rewrites to use grouping sets. It doesn’t generate
> >>>> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/ <
> http://issues.apache.org/jira/>
> >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732
> <http://issues.apache.org/jira/browse/CALCITE-732>>.
> >>>>
> >>>> Julian
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org>
> >> wrote:
> >>>>>
> >>>>> [Since this is Drill specific, I put dev@calcite on BCC].
> >>>>>
> >>>>> If you have two aggregates: Count(distinct a), Count(distinct b), the
> >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries
> >> each
> >>> of
> >>>>> which first does a group-by on the distinct column followed by a
> >> count
> >>>>> aggregate.   By default,  Drill only processes cartesian join if one
> >>>> input
> >>>>> of the join is known to be scalar (single row).  It sounds like after
> >>> you
> >>>>> did the transformation to use the cache, that scalar property somehow
> >>> did
> >>>>> not get propagated.
> >>>>> You can override this behavior by a session configuration:  (this
> >> will
> >>>> use
> >>>>> a nested loop join even if inputs are not provably scalar, but it
> >>> should
> >>>> be
> >>>>> used for specific query only).    For a more general solution, I
> >>> believe
> >>>>> you may have to create an enhancement JIRA with appropriate details.
> >>>>>  'alter session set planner.enable_nljoin_for_scalar_only = false';
> >>>>>
> >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
> >> tongweijie178@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> HI all:
> >>>>>>
> >>>>>> I materialize the count distinct query result to a cache, then when
> >>>> user
> >>>>>> query the count distinct , a specific rule will translate the query
> >> to
> >>>> the
> >>>>>> cache. It turns out right when the query has only one count
> >> (distinct
> >>> )
> >>>>>> operator ,but when it has two count (distinct ) ,it causes error
> >> .The
> >>>> error
> >>>>>> info is here:
> >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> >>>>>>
> >>>>>>
> >>>>>> Best Regards.
>
>

Re: Drill query planning error

Posted by Julian Hyde <jh...@apache.org>.

I think the “multiple distinct count” code path also gets triggered if there’s a mixture of distinct and non-distinct aggregates.

> On Jul 26, 2017, at 5:16 PM, Aman Sinha <am...@apache.org> wrote:
> 
> It sounds like you have narrowed it down further and it could be an issue
> with the scalar check when 2 or more cartesian joins are present.
> Please file a JIRA with relevant details.
> 
> On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <tongweijie178@gmail.com <ma...@gmail.com>>
> wrote:
> 
>> Thanks for pointing out the possible reasons @Aman @Julian . I am not sure
>> that's the real problem .
>> 
>> As I initially mentioned,sorry for not clearly described ,queries like
>> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
>> right.but "select count(*),sum(a),count (distinct b),count(distinct c) from
>> t where dt=xxx"will be wrong.  The only one count(distinct) scenario does
>> do the Cartesian join check too.It will and a broadcast operator on my
>> transferred relnode. So I think I have not loose the scalar property. But
>> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
>> check for satisfying the Cartesian conditions happened only once.It sounds
>> should happen twice as there are two nested loop join operators at the two
>> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
>> 
>> Best Regards
>> 
>> 
>> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansinha@apache.org <ma...@apache.org>> wrote:
>> 
>>> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
>>> pointing that out.   I'll see if we can leverage that.  The grouping sets
>>> is not yet supported in Drill, but that would be a better solution since
>> it
>>> avoids extra scans.
>>> 
>>> -Aman
>>> 
>>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jhyde@apache.org <ma...@apache.org>> wrote:
>>> 
>>>> Aman,
>>>> 
>>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill
>> question.
>>>> 
>>>> A blanket restriction on cartesian joins is a blunt instrument.
>> Sometimes
>>>> cartesian joins are valid, safe, and the best plan for a query. This
>> is a
>>>> case in point. Users shouldn’t have to change config parameters to get
>> it
>>>> to work.
>>>> 
>>>> (Actually I don’t know the query, but
>>>> 
>>>>  select count(distinct deptno), count(distinct gender) from emp
>>>> 
>>>> is equivalent.)
>>>> 
>>>> Drill should detect that a relational expression can return at most one
>>>> row, and allow a cartesian join if one side is such. Calcite has a
>>>> RelMdMaxRowCount statistic for this. This was added as part of
>>>> http://issues.apache.org/jira/browse/CALCITE-604 <
>>>> http://issues.apache.org/jira/browse/CALCITE-604 <http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100%
>>>> safe. No config parameters required.
>>>> 
>>>> Also, Calcite has an alternative way of handling multiple distinct
>>>> aggregates that rewrites to use grouping sets. It doesn’t generate
>>>> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/ <http://issues.apache.org/jira/>
>>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>>.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org>
>> wrote:
>>>>> 
>>>>> [Since this is Drill specific, I put dev@calcite on BCC].
>>>>> 
>>>>> If you have two aggregates: Count(distinct a), Count(distinct b), the
>>>>> Calcite logical plan consists of a cartesian join of 2 subqueries
>> each
>>> of
>>>>> which first does a group-by on the distinct column followed by a
>> count
>>>>> aggregate.   By default,  Drill only processes cartesian join if one
>>>> input
>>>>> of the join is known to be scalar (single row).  It sounds like after
>>> you
>>>>> did the transformation to use the cache, that scalar property somehow
>>> did
>>>>> not get propagated.
>>>>> You can override this behavior by a session configuration:  (this
>> will
>>>> use
>>>>> a nested loop join even if inputs are not provably scalar, but it
>>> should
>>>> be
>>>>> used for specific query only).    For a more general solution, I
>>> believe
>>>>> you may have to create an enhancement JIRA with appropriate details.
>>>>>  'alter session set planner.enable_nljoin_for_scalar_only = false';
>>>>> 
>>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
>> tongweijie178@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> HI all:
>>>>>> 
>>>>>> I materialize the count distinct query result to a cache, then when
>>>> user
>>>>>> query the count distinct , a specific rule will translate the query
>> to
>>>> the
>>>>>> cache. It turns out right when the query has only one count
>> (distinct
>>> )
>>>>>> operator ,but when it has two count (distinct ) ,it causes error
>> .The
>>>> error
>>>>>> info is here:
>>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>>>>>> 
>>>>>> 
>>>>>> Best Regards.

Re: Drill query planning error

Posted by Aman Sinha <am...@apache.org>.

It sounds like you have narrowed it down further and it could be an issue
with the scalar check when 2 or more cartesian joins are present.
Please file a JIRA with relevant details.

On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <to...@gmail.com>
wrote:

> Thanks for pointing out the possible reasons @Aman @Julian . I am not sure
> that's the real problem .
>
> As I initially mentioned,sorry for not clearly described ,queries like
> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
> right.but "select count(*),sum(a),count (distinct b),count(distinct c) from
> t where dt=xxx"will be wrong.  The only one count(distinct) scenario does
> do the Cartesian join check too.It will and a broadcast operator on my
> transferred relnode. So I think I have not loose the scalar property. But
> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
>  check for satisfying the Cartesian conditions happened only once.It sounds
> should happen twice as there are two nested loop join operators at the two
> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
>
> Best Regards
>
>
> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <am...@apache.org> wrote:
>
> > Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
> > pointing that out.   I'll see if we can leverage that.  The grouping sets
> > is not yet supported in Drill, but that would be a better solution since
> it
> > avoids extra scans.
> >
> > -Aman
> >
> > On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jh...@apache.org> wrote:
> >
> > > Aman,
> > >
> > > Thanks for moving dev@calcite to Bcc. This is properly a Drill
> question.
> > >
> > > A blanket restriction on cartesian joins is a blunt instrument.
> Sometimes
> > > cartesian joins are valid, safe, and the best plan for a query. This
> is a
> > > case in point. Users shouldn’t have to change config parameters to get
> it
> > > to work.
> > >
> > > (Actually I don’t know the query, but
> > >
> > >   select count(distinct deptno), count(distinct gender) from emp
> > >
> > > is equivalent.)
> > >
> > > Drill should detect that a relational expression can return at most one
> > > row, and allow a cartesian join if one side is such. Calcite has a
> > > RelMdMaxRowCount statistic for this. This was added as part of
> > > http://issues.apache.org/jira/browse/CALCITE-604 <
> > > http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100%
> > > safe. No config parameters required.
> > >
> > > Also, Calcite has an alternative way of handling multiple distinct
> > > aggregates that rewrites to use grouping sets. It doesn’t generate
> > > self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
> > > browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>.
> > >
> > > Julian
> > >
> > >
> > >
> > >
> > >
> > >
> > > > On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org>
> wrote:
> > > >
> > > > [Since this is Drill specific, I put dev@calcite on BCC].
> > > >
> > > > If you have two aggregates: Count(distinct a), Count(distinct b), the
> > > > Calcite logical plan consists of a cartesian join of 2 subqueries
> each
> > of
> > > > which first does a group-by on the distinct column followed by a
> count
> > > > aggregate.   By default,  Drill only processes cartesian join if one
> > > input
> > > > of the join is known to be scalar (single row).  It sounds like after
> > you
> > > > did the transformation to use the cache, that scalar property somehow
> > did
> > > > not get propagated.
> > > > You can override this behavior by a session configuration:  (this
> will
> > > use
> > > > a nested loop join even if inputs are not provably scalar, but it
> > should
> > > be
> > > > used for specific query only).    For a more general solution, I
> > believe
> > > > you may have to create an enhancement JIRA with appropriate details.
> > > >   'alter session set planner.enable_nljoin_for_scalar_only = false';
> > > >
> > > > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
> tongweijie178@gmail.com>
> > > > wrote:
> > > >
> > > >> HI all:
> > > >>
> > > >>  I materialize the count distinct query result to a cache, then when
> > > user
> > > >> query the count distinct , a specific rule will translate the query
> to
> > > the
> > > >> cache. It turns out right when the query has only one count
> (distinct
> > )
> > > >> operator ,but when it has two count (distinct ) ,it causes error
> .The
> > > error
> > > >> info is here:
> > > >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> > > >>
> > > >>
> > > >> Best Regards.
> > > >>
> > >
> > >
> >
>

Re: Drill query planning error

Posted by weijie tong <to...@gmail.com>.

Thanks for pointing out the possible reasons @Aman @Julian . I am not sure
that's the real problem .

As I initially mentioned,sorry for not clearly described ,queries like
"select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
right.but "select count(*),sum(a),count (distinct b),count(distinct c) from
t where dt=xxx"will be wrong.  The only one count(distinct) scenario does
do the Cartesian join check too.It will and a broadcast operator on my
transferred relnode. So I think I have not loose the scalar property. But
when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
 check for satisfying the Cartesian conditions happened only once.It sounds
should happen twice as there are two nested loop join operators at the two
count(distinct) scenarios .If it turns out a bug ,I will log a Jira .

Best Regards


On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <am...@apache.org> wrote:

> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
> pointing that out.   I'll see if we can leverage that.  The grouping sets
> is not yet supported in Drill, but that would be a better solution since it
> avoids extra scans.
>
> -Aman
>
> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jh...@apache.org> wrote:
>
> > Aman,
> >
> > Thanks for moving dev@calcite to Bcc. This is properly a Drill question.
> >
> > A blanket restriction on cartesian joins is a blunt instrument. Sometimes
> > cartesian joins are valid, safe, and the best plan for a query. This is a
> > case in point. Users shouldn’t have to change config parameters to get it
> > to work.
> >
> > (Actually I don’t know the query, but
> >
> >   select count(distinct deptno), count(distinct gender) from emp
> >
> > is equivalent.)
> >
> > Drill should detect that a relational expression can return at most one
> > row, and allow a cartesian join if one side is such. Calcite has a
> > RelMdMaxRowCount statistic for this. This was added as part of
> > http://issues.apache.org/jira/browse/CALCITE-604 <
> > http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100%
> > safe. No config parameters required.
> >
> > Also, Calcite has an alternative way of handling multiple distinct
> > aggregates that rewrites to use grouping sets. It doesn’t generate
> > self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
> > browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>.
> >
> > Julian
> >
> >
> >
> >
> >
> >
> > > On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org> wrote:
> > >
> > > [Since this is Drill specific, I put dev@calcite on BCC].
> > >
> > > If you have two aggregates: Count(distinct a), Count(distinct b), the
> > > Calcite logical plan consists of a cartesian join of 2 subqueries each
> of
> > > which first does a group-by on the distinct column followed by a count
> > > aggregate.   By default,  Drill only processes cartesian join if one
> > input
> > > of the join is known to be scalar (single row).  It sounds like after
> you
> > > did the transformation to use the cache, that scalar property somehow
> did
> > > not get propagated.
> > > You can override this behavior by a session configuration:  (this will
> > use
> > > a nested loop join even if inputs are not provably scalar, but it
> should
> > be
> > > used for specific query only).    For a more general solution, I
> believe
> > > you may have to create an enhancement JIRA with appropriate details.
> > >   'alter session set planner.enable_nljoin_for_scalar_only = false';
> > >
> > > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <to...@gmail.com>
> > > wrote:
> > >
> > >> HI all:
> > >>
> > >>  I materialize the count distinct query result to a cache, then when
> > user
> > >> query the count distinct , a specific rule will translate the query to
> > the
> > >> cache. It turns out right when the query has only one count (distinct
> )
> > >> operator ,but when it has two count (distinct ) ,it causes error .The
> > error
> > >> info is here:
> > >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> > >>
> > >>
> > >> Best Regards.
> > >>
> >
> >
>

Re: Drill query planning error

Posted by Aman Sinha <am...@apache.org>.

Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
pointing that out.   I'll see if we can leverage that.  The grouping sets
is not yet supported in Drill, but that would be a better solution since it
avoids extra scans.

-Aman

On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jh...@apache.org> wrote:

> Aman,
>
> Thanks for moving dev@calcite to Bcc. This is properly a Drill question.
>
> A blanket restriction on cartesian joins is a blunt instrument. Sometimes
> cartesian joins are valid, safe, and the best plan for a query. This is a
> case in point. Users shouldn’t have to change config parameters to get it
> to work.
>
> (Actually I don’t know the query, but
>
>   select count(distinct deptno), count(distinct gender) from emp
>
> is equivalent.)
>
> Drill should detect that a relational expression can return at most one
> row, and allow a cartesian join if one side is such. Calcite has a
> RelMdMaxRowCount statistic for this. This was added as part of
> http://issues.apache.org/jira/browse/CALCITE-604 <
> http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100%
> safe. No config parameters required.
>
> Also, Calcite has an alternative way of handling multiple distinct
> aggregates that rewrites to use grouping sets. It doesn’t generate
> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>.
>
> Julian
>
>
>
>
>
>
> > On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org> wrote:
> >
> > [Since this is Drill specific, I put dev@calcite on BCC].
> >
> > If you have two aggregates: Count(distinct a), Count(distinct b), the
> > Calcite logical plan consists of a cartesian join of 2 subqueries each of
> > which first does a group-by on the distinct column followed by a count
> > aggregate.   By default,  Drill only processes cartesian join if one
> input
> > of the join is known to be scalar (single row).  It sounds like after you
> > did the transformation to use the cache, that scalar property somehow did
> > not get propagated.
> > You can override this behavior by a session configuration:  (this will
> use
> > a nested loop join even if inputs are not provably scalar, but it should
> be
> > used for specific query only).    For a more general solution, I believe
> > you may have to create an enhancement JIRA with appropriate details.
> >   'alter session set planner.enable_nljoin_for_scalar_only = false';
> >
> > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <to...@gmail.com>
> > wrote:
> >
> >> HI all:
> >>
> >>  I materialize the count distinct query result to a cache, then when
> user
> >> query the count distinct , a specific rule will translate the query to
> the
> >> cache. It turns out right when the query has only one count (distinct )
> >> operator ,but when it has two count (distinct ) ,it causes error .The
> error
> >> info is here:
> >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> >>
> >>
> >> Best Regards.
> >>
>
>

Re: Drill query planning error

Posted by Julian Hyde <jh...@apache.org>.

Aman,

Thanks for moving dev@calcite to Bcc. This is properly a Drill question.

A blanket restriction on cartesian joins is a blunt instrument. Sometimes cartesian joins are valid, safe, and the best plan for a query. This is a case in point. Users shouldn’t have to change config parameters to get it to work.

(Actually I don’t know the query, but

  select count(distinct deptno), count(distinct gender) from emp 

is equivalent.)

Drill should detect that a relational expression can return at most one row, and allow a cartesian join if one side is such. Calcite has a RelMdMaxRowCount statistic for this. This was added as part of http://issues.apache.org/jira/browse/CALCITE-604 <http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100% safe. No config parameters required.

Also, Calcite has an alternative way of handling multiple distinct aggregates that rewrites to use grouping sets. It doesn’t generate self-joins, cartesian or otherwise.  http://issues.apache.org/jira/browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>. 

Julian

> On Jul 26, 2017, at 9:20 AM, Aman Sinha <am...@apache.org> wrote:
> 
> [Since this is Drill specific, I put dev@calcite on BCC].
> 
> If you have two aggregates: Count(distinct a), Count(distinct b), the
> Calcite logical plan consists of a cartesian join of 2 subqueries each of
> which first does a group-by on the distinct column followed by a count
> aggregate.   By default,  Drill only processes cartesian join if one input
> of the join is known to be scalar (single row).  It sounds like after you
> did the transformation to use the cache, that scalar property somehow did
> not get propagated.
> You can override this behavior by a session configuration:  (this will use
> a nested loop join even if inputs are not provably scalar, but it should be
> used for specific query only).    For a more general solution, I believe
> you may have to create an enhancement JIRA with appropriate details.
>   'alter session set planner.enable_nljoin_for_scalar_only = false';
> 
> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <to...@gmail.com>
> wrote:
> 
>> HI all:
>> 
>>  I materialize the count distinct query result to a cache, then when user
>> query the count distinct , a specific rule will translate the query to the
>> cache. It turns out right when the query has only one count (distinct )
>> operator ,but when it has two count (distinct ) ,it causes error .The error
>> info is here:
>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>> 
>> 
>> Best Regards.
>>

Re: Drill query planning error

Posted by Aman Sinha <am...@apache.org>.

[Since this is Drill specific, I put dev@calcite on BCC].

If you have two aggregates: Count(distinct a), Count(distinct b), the
Calcite logical plan consists of a cartesian join of 2 subqueries each of
which first does a group-by on the distinct column followed by a count
aggregate.   By default,  Drill only processes cartesian join if one input
of the join is known to be scalar (single row).  It sounds like after you
did the transformation to use the cache, that scalar property somehow did
not get propagated.
You can override this behavior by a session configuration:  (this will use
a nested loop join even if inputs are not provably scalar, but it should be
used for specific query only).    For a more general solution, I believe
you may have to create an enhancement JIRA with appropriate details.
   'alter session set planner.enable_nljoin_for_scalar_only = false';

On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <to...@gmail.com>
wrote:

> HI all:
>
>   I materialize the count distinct query result to a cache, then when user
> query the count distinct , a specific rule will translate the query to the
> cache. It turns out right when the query has only one count (distinct )
> operator ,but when it has two count (distinct ) ,it causes error .The error
> info is here:
> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>
>
> Best Regards.
>

Re: Drill query planning error

Posted by Aman Sinha <am...@apache.org>.

[Since this is Drill specific, I put dev@calcite on BCC].

If you have two aggregates: Count(distinct a), Count(distinct b), the
Calcite logical plan consists of a cartesian join of 2 subqueries each of
which first does a group-by on the distinct column followed by a count
aggregate.   By default,  Drill only processes cartesian join if one input
of the join is known to be scalar (single row).  It sounds like after you
did the transformation to use the cache, that scalar property somehow did
not get propagated.
You can override this behavior by a session configuration:  (this will use
a nested loop join even if inputs are not provably scalar, but it should be
used for specific query only).    For a more general solution, I believe
you may have to create an enhancement JIRA with appropriate details.
   'alter session set planner.enable_nljoin_for_scalar_only = false';

On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <to...@gmail.com>
wrote:

> HI all:
>
>   I materialize the count distinct query result to a cache, then when user
> query the count distinct , a specific rule will translate the query to the
> cache. It turns out right when the query has only one count (distinct )
> operator ,but when it has two count (distinct ) ,it causes error .The error
> info is here:
> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>
>
> Best Regards.
>