You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by Krzysztof Zarzycki <k....@gmail.com> on 2019/06/26 13:52:41 UTC

Modify Calcite Planner in Hive to remove GROUP BY

Hello,

While the question I have might look like regards to Hive, I believe is
more about Calcite. I need to add a Calcite plan rule to Hive, that removes
"Group by" clause when it groups by some constant value (GROUP BY TRUE more
precisely). As far as I believe, the query semantically is the same.
Could anyone on this mailing list help me how to do it properly? While I'm
an experienced java engineer, I have no clue how to achieve this.
I was trying to modify hive code to do this myself, but unfortunately I got
only NullPointerExceptions.


More context below:
I want to use JdbcStorageHandler in Hive, that connects to Apache Kylin and
forward queries there. Then I put Tableau on top of Hive. Unfortunately,
the queries produced by Tableau to Hive and then reproduced by Calcite
Planner to Kylin, cannot be handled by Kylin (which BTW uses Calcite as
well). I disabled some of the hive optimizations which fixed some of my
queries. But I'm stuck on one I cannot disable. Tableau generates a query
with "GROUP BY 1.000000...01" , that is translated to "GROUP BY TRUE", by
Hive/Calcite. But neither of those can be handled by Kylin. I got an idea
that I will remove GROUP BY completely, because in my understanding it's
unecessary.

I will be very grateful for your help,
Kind Regards,
Krzysztof

Re: Modify Calcite Planner in Hive to remove GROUP BY

Posted by Krzysztof Zarzycki <k....@gmail.com>.

Thanks for your explanation, it helps a lot!
I was in a big mistake thinking that the result should be the same.
Possibly then Tableau puts "group by <constant>" intentionally, to receive
zero rows.

But that means... It is an even bigger bug/miss that Apache Kylin does not
handle grouping by constant. And so I'm afraid I cannot do anything on
Calcite level (like rewrite), I need to work on Kylin. (Or someone has a
different idea?)
I will raise an issue on Kylin Jira then.

Krzysztof



czw., 27 cze 2019 o 04:03 Vineet Garg <vg...@apache.org> napisał(a):

> Hi Julian,
>
> You are right it should produce zero rows not NULL. Thanks for the
> correction.
>
> Vineet
>
>
> On Wed, Jun 26, 2019 at 4:49 PM Julian Hyde <jh...@apache.org> wrote:
>
> > > Select count(*) from empty_table group by <constant> will produce NULL
> >
> > Really? I thought it should produce zero rows.
> >
> > Hsqldb:
> >
> > > select count(*) from "foodmart"."days" where false group by true;
> > +-----------------+
> > |       C1        |
> > +-----------------+
> > +-----------------+
> > No rows selected (0.001 seconds)
> >
> >
> > Julian
> >
> >
> > > On Jun 26, 2019, at 1:12 PM, Vineet Garg <vg...@apache.org> wrote:
> > >
> > > Hello Krzysztof,
> > >
> > > The rewrite you mention in Hive was done in HIVE-19674
> > > <https://issues.apache.org/jira/browse/HIVE-19674> to be able to push
> > such
> > > group by to Druid. Currently there is no way to disable this rewrite.
> > >
> > > As for removing Group by <constant>, there are rules/rewrites which can
> > > reduce grouping keys by removing constants but removing whole group by
> is
> > > not safe since it can lead to semantically different query.
> > > e.g. Select count(*) from empty_table group by <constant> will produce
> > NULL
> > > but Select count(*) from empty_table will produce 0.
> > >
> > > P.S. There was a bug in HIVE-19674' patch which was further fixed by
> > > HIVE-21539 <https://issues.apache.org/jira/browse/HIVE-21539>.
> > >
> > > Regards,
> > > Vineet Garg
> > >
> > > On Wed, Jun 26, 2019 at 7:08 AM Haisheng Yuan <h....@alibaba-inc.com>
> > > wrote:
> > >
> > >> Calcite has the rule that does the work. But you can't remove the
> group
> > by
> > >> clause if the constant is the only group key. The semantic is
> different
> > >> without group key. Try it on empty relation, you will see the
> > difference.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thanks~
> > >> Haisheng
> > >> Yuan------------------------------------------------------------------
> > >> 发件人：Krzysztof Zarzycki<k....@gmail.com>
> > >> 日 期：2019年06月26日 21:52:41
> > >> 收件人：<de...@calcite.apache.org>
> > >> 主 题：Modify Calcite Planner in Hive to remove GROUP BY <constant>
> > >>
> > >> Hello,
> > >>
> > >> While the question I have might look like regards to Hive, I believe
> is
> > >> more about Calcite. I need to add a Calcite plan rule to Hive, that
> > removes
> > >> "Group by" clause when it groups by some constant value (GROUP BY TRUE
> > more
> > >> precisely). As far as I believe, the query semantically is the same.
> > >> Could anyone on this mailing list help me how to do it properly? While
> > I'm
> > >> an experienced java engineer, I have no clue how to achieve this.
> > >> I was trying to modify hive code to do this myself, but unfortunately
> I
> > got
> > >> only NullPointerExceptions.
> > >>
> > >>
> > >> More context below:
> > >> I want to use JdbcStorageHandler in Hive, that connects to Apache
> Kylin
> > and
> > >> forward queries there. Then I put Tableau on top of Hive.
> Unfortunately,
> > >> the queries produced by Tableau to Hive and then reproduced by Calcite
> > >> Planner to Kylin, cannot be handled by Kylin (which BTW uses Calcite
> as
> > >> well). I disabled some of the hive optimizations which fixed some of
> my
> > >> queries. But I'm stuck on one I cannot disable. Tableau generates a
> > query
> > >> with "GROUP BY 1.000000...01" , that is translated to "GROUP BY TRUE",
> > by
> > >> Hive/Calcite. But neither of those can be handled by Kylin. I got an
> > idea
> > >> that I will remove GROUP BY completely, because in my understanding
> it's
> > >> unecessary.
> > >>
> > >> I will be very grateful for your help,
> > >> Kind Regards,
> > >> Krzysztof
> > >>
> > >>
> >
> >
>

Re: Modify Calcite Planner in Hive to remove GROUP BY

Posted by Vineet Garg <vg...@apache.org>.

Hi Julian,

You are right it should produce zero rows not NULL. Thanks for the
correction.

Vineet


On Wed, Jun 26, 2019 at 4:49 PM Julian Hyde <jh...@apache.org> wrote:

> > Select count(*) from empty_table group by <constant> will produce NULL
>
> Really? I thought it should produce zero rows.
>
> Hsqldb:
>
> > select count(*) from "foodmart"."days" where false group by true;
> +-----------------+
> |       C1        |
> +-----------------+
> +-----------------+
> No rows selected (0.001 seconds)
>
>
> Julian
>
>
> > On Jun 26, 2019, at 1:12 PM, Vineet Garg <vg...@apache.org> wrote:
> >
> > Hello Krzysztof,
> >
> > The rewrite you mention in Hive was done in HIVE-19674
> > <https://issues.apache.org/jira/browse/HIVE-19674> to be able to push
> such
> > group by to Druid. Currently there is no way to disable this rewrite.
> >
> > As for removing Group by <constant>, there are rules/rewrites which can
> > reduce grouping keys by removing constants but removing whole group by is
> > not safe since it can lead to semantically different query.
> > e.g. Select count(*) from empty_table group by <constant> will produce
> NULL
> > but Select count(*) from empty_table will produce 0.
> >
> > P.S. There was a bug in HIVE-19674' patch which was further fixed by
> > HIVE-21539 <https://issues.apache.org/jira/browse/HIVE-21539>.
> >
> > Regards,
> > Vineet Garg
> >
> > On Wed, Jun 26, 2019 at 7:08 AM Haisheng Yuan <h....@alibaba-inc.com>
> > wrote:
> >
> >> Calcite has the rule that does the work. But you can't remove the group
> by
> >> clause if the constant is the only group key. The semantic is different
> >> without group key. Try it on empty relation, you will see the
> difference.
> >>
> >>
> >>
> >>
> >>
> >> Thanks~
> >> Haisheng
> >> Yuan------------------------------------------------------------------
> >> 发件人：Krzysztof Zarzycki<k....@gmail.com>
> >> 日 期：2019年06月26日 21:52:41
> >> 收件人：<de...@calcite.apache.org>
> >> 主 题：Modify Calcite Planner in Hive to remove GROUP BY <constant>
> >>
> >> Hello,
> >>
> >> While the question I have might look like regards to Hive, I believe is
> >> more about Calcite. I need to add a Calcite plan rule to Hive, that
> removes
> >> "Group by" clause when it groups by some constant value (GROUP BY TRUE
> more
> >> precisely). As far as I believe, the query semantically is the same.
> >> Could anyone on this mailing list help me how to do it properly? While
> I'm
> >> an experienced java engineer, I have no clue how to achieve this.
> >> I was trying to modify hive code to do this myself, but unfortunately I
> got
> >> only NullPointerExceptions.
> >>
> >>
> >> More context below:
> >> I want to use JdbcStorageHandler in Hive, that connects to Apache Kylin
> and
> >> forward queries there. Then I put Tableau on top of Hive. Unfortunately,
> >> the queries produced by Tableau to Hive and then reproduced by Calcite
> >> Planner to Kylin, cannot be handled by Kylin (which BTW uses Calcite as
> >> well). I disabled some of the hive optimizations which fixed some of my
> >> queries. But I'm stuck on one I cannot disable. Tableau generates a
> query
> >> with "GROUP BY 1.000000...01" , that is translated to "GROUP BY TRUE",
> by
> >> Hive/Calcite. But neither of those can be handled by Kylin. I got an
> idea
> >> that I will remove GROUP BY completely, because in my understanding it's
> >> unecessary.
> >>
> >> I will be very grateful for your help,
> >> Kind Regards,
> >> Krzysztof
> >>
> >>
>
>

Re: Modify Calcite Planner in Hive to remove GROUP BY

Posted by Julian Hyde <jh...@apache.org>.

> Select count(*) from empty_table group by <constant> will produce NULL

Really? I thought it should produce zero rows.

Hsqldb:

> select count(*) from "foodmart"."days" where false group by true;
+-----------------+
|       C1        |
+-----------------+
+-----------------+
No rows selected (0.001 seconds)


Julian


> On Jun 26, 2019, at 1:12 PM, Vineet Garg <vg...@apache.org> wrote:
> 
> Hello Krzysztof,
> 
> The rewrite you mention in Hive was done in HIVE-19674
> <https://issues.apache.org/jira/browse/HIVE-19674> to be able to push such
> group by to Druid. Currently there is no way to disable this rewrite.
> 
> As for removing Group by <constant>, there are rules/rewrites which can
> reduce grouping keys by removing constants but removing whole group by is
> not safe since it can lead to semantically different query.
> e.g. Select count(*) from empty_table group by <constant> will produce NULL
> but Select count(*) from empty_table will produce 0.
> 
> P.S. There was a bug in HIVE-19674' patch which was further fixed by
> HIVE-21539 <https://issues.apache.org/jira/browse/HIVE-21539>.
> 
> Regards,
> Vineet Garg
> 
> On Wed, Jun 26, 2019 at 7:08 AM Haisheng Yuan <h....@alibaba-inc.com>
> wrote:
> 
>> Calcite has the rule that does the work. But you can't remove the group by
>> clause if the constant is the only group key. The semantic is different
>> without group key. Try it on empty relation, you will see the difference.
>> 
>> 
>> 
>> 
>> 
>> Thanks~
>> Haisheng
>> Yuan------------------------------------------------------------------
>> 发件人：Krzysztof Zarzycki<k....@gmail.com>
>> 日 期：2019年06月26日 21:52:41
>> 收件人：<de...@calcite.apache.org>
>> 主 题：Modify Calcite Planner in Hive to remove GROUP BY <constant>
>> 
>> Hello,
>> 
>> While the question I have might look like regards to Hive, I believe is
>> more about Calcite. I need to add a Calcite plan rule to Hive, that removes
>> "Group by" clause when it groups by some constant value (GROUP BY TRUE more
>> precisely). As far as I believe, the query semantically is the same.
>> Could anyone on this mailing list help me how to do it properly? While I'm
>> an experienced java engineer, I have no clue how to achieve this.
>> I was trying to modify hive code to do this myself, but unfortunately I got
>> only NullPointerExceptions.
>> 
>> 
>> More context below:
>> I want to use JdbcStorageHandler in Hive, that connects to Apache Kylin and
>> forward queries there. Then I put Tableau on top of Hive. Unfortunately,
>> the queries produced by Tableau to Hive and then reproduced by Calcite
>> Planner to Kylin, cannot be handled by Kylin (which BTW uses Calcite as
>> well). I disabled some of the hive optimizations which fixed some of my
>> queries. But I'm stuck on one I cannot disable. Tableau generates a query
>> with "GROUP BY 1.000000...01" , that is translated to "GROUP BY TRUE", by
>> Hive/Calcite. But neither of those can be handled by Kylin. I got an idea
>> that I will remove GROUP BY completely, because in my understanding it's
>> unecessary.
>> 
>> I will be very grateful for your help,
>> Kind Regards,
>> Krzysztof
>> 
>>

Re: Modify Calcite Planner in Hive to remove GROUP BY

Posted by Vineet Garg <vg...@apache.org>.

Hello Krzysztof,

The rewrite you mention in Hive was done in HIVE-19674
<https://issues.apache.org/jira/browse/HIVE-19674> to be able to push such
group by to Druid. Currently there is no way to disable this rewrite.

As for removing Group by <constant>, there are rules/rewrites which can
reduce grouping keys by removing constants but removing whole group by is
not safe since it can lead to semantically different query.
e.g. Select count(*) from empty_table group by <constant> will produce NULL
but Select count(*) from empty_table will produce 0.

P.S. There was a bug in HIVE-19674' patch which was further fixed by
HIVE-21539 <https://issues.apache.org/jira/browse/HIVE-21539>.

Regards,
Vineet Garg

On Wed, Jun 26, 2019 at 7:08 AM Haisheng Yuan <h....@alibaba-inc.com>
wrote:

> Calcite has the rule that does the work. But you can't remove the group by
> clause if the constant is the only group key. The semantic is different
> without group key. Try it on empty relation, you will see the difference.
>
>
>
>
>
> Thanks~
> Haisheng
> Yuan------------------------------------------------------------------
> 发件人：Krzysztof Zarzycki<k....@gmail.com>
> 日 期：2019年06月26日 21:52:41
> 收件人：<de...@calcite.apache.org>
> 主 题：Modify Calcite Planner in Hive to remove GROUP BY <constant>
>
> Hello,
>
> While the question I have might look like regards to Hive, I believe is
> more about Calcite. I need to add a Calcite plan rule to Hive, that removes
> "Group by" clause when it groups by some constant value (GROUP BY TRUE more
> precisely). As far as I believe, the query semantically is the same.
> Could anyone on this mailing list help me how to do it properly? While I'm
> an experienced java engineer, I have no clue how to achieve this.
> I was trying to modify hive code to do this myself, but unfortunately I got
> only NullPointerExceptions.
>
>
> More context below:
> I want to use JdbcStorageHandler in Hive, that connects to Apache Kylin and
> forward queries there. Then I put Tableau on top of Hive. Unfortunately,
> the queries produced by Tableau to Hive and then reproduced by Calcite
> Planner to Kylin, cannot be handled by Kylin (which BTW uses Calcite as
> well). I disabled some of the hive optimizations which fixed some of my
> queries. But I'm stuck on one I cannot disable. Tableau generates a query
> with "GROUP BY 1.000000...01" , that is translated to "GROUP BY TRUE", by
> Hive/Calcite. But neither of those can be handled by Kylin. I got an idea
> that I will remove GROUP BY completely, because in my understanding it's
> unecessary.
>
> I will be very grateful for your help,
> Kind Regards,
> Krzysztof
>
>

回复：Modify Calcite Planner in Hive to remove GROUP BY

Posted by Haisheng Yuan <h....@alibaba-inc.com>.

Calcite has the rule that does the work. But you can't remove the group by clause if the constant is the only group key. The semantic is different without group key. Try it on empty relation, you will see the difference.

Thanks~
Haisheng Yuan------------------------------------------------------------------
发件人：Krzysztof Zarzycki<k....@gmail.com>
日 期：2019年06月26日 21:52:41
收件人：<de...@calcite.apache.org>
主 题：Modify Calcite Planner in Hive to remove GROUP BY <constant>

Hello,

While the question I have might look like regards to Hive, I believe is
more about Calcite. I need to add a Calcite plan rule to Hive, that removes
"Group by" clause when it groups by some constant value (GROUP BY TRUE more
precisely). As far as I believe, the query semantically is the same.
Could anyone on this mailing list help me how to do it properly? While I'm
an experienced java engineer, I have no clue how to achieve this.
I was trying to modify hive code to do this myself, but unfortunately I got
only NullPointerExceptions.

More context below:
I want to use JdbcStorageHandler in Hive, that connects to Apache Kylin and
forward queries there. Then I put Tableau on top of Hive. Unfortunately,
the queries produced by Tableau to Hive and then reproduced by Calcite
Planner to Kylin, cannot be handled by Kylin (which BTW uses Calcite as
well). I disabled some of the hive optimizations which fixed some of my
queries. But I'm stuck on one I cannot disable. Tableau generates a query
with "GROUP BY 1.000000...01" , that is translated to "GROUP BY TRUE", by
Hive/Calcite. But neither of those can be handled by Kylin. I got an idea
that I will remove GROUP BY completely, because in my understanding it's
unecessary.

I will be very grateful for your help,
Kind Regards,
Krzysztof