You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@calcite.apache.org by hongbin ma <ma...@apache.org> on 2015/08/07 15:51:38 UTC

sum of integer overflow

hi,

Suppose I have a table column called "price", its data type is integer.
it seems that the sum aggregator in "select sum(price) from table" will
return integer type, too.

When I have millions of rows in the table,
"select sum(price) from table" might overflow, is it a bug?
Or may I how do you look into this problem?

-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: sum of integer overflow

Posted by Li Yang <li...@apache.org>.

https://issues.apache.org/jira/browse/CALCITE-845  created

On Thu, Aug 13, 2015 at 4:23 AM, Julian Hyde <jh...@apache.org> wrote:

> OK, if you want this feature please log a JIRA case to track it.
>
> > On Aug 11, 2015, at 2:17 AM, Li Yang <li...@apache.org> wrote:
> >
> > A customizable policy is great since there's no standard.
> >
> > On Tue, Aug 11, 2015 at 5:54 AM, Jinfeng Ni <ji...@gmail.com>
> wrote:
> >
> >> I think adding a policy in Calcite to derive SUM's return type makes
> sense,
> >> as SQL standard seems not specify exactly what the return type of SUM
> >> should be. As a result, each system could choose to use different
> policy in
> >> the implementation.
> >>
> >> Oracle seems to use "the same data type as the numeric data type of the
> >> argument" [1], while DB2 uses the policy "The result is a large integer
> if
> >> the argument values are small integers" [2].
> >>
> >>
> >> [1] http://docs.oracle.com/database/121/SQLRF/functions196.htm#i89126
> >> [2]
> >>
> >>
> http://public.dhe.ibm.com/ps/products/db2/info/vr101/pdf/en_US/DB2SQLRefVol1-db2s1e1011.pdf
> >>
> >> On Mon, Aug 10, 2015 at 2:05 PM, Julian Hyde <jh...@apache.org> wrote:
> >>
> >>> Postgres return type is "bigint for smallint or int arguments, numeric
> >> for
> >>> bigint arguments, otherwise the same as the argument data type”[1]
> >>>
> >>> SQL Server return type is int for tinyint, smallint or int; bigint for
> >>> bigint[2].
> >>>
> >>> I can see your point that “User demands the correct sum result”. But
> I’d
> >>> also be pissed with Postgres if it returned a numeric (arbitrary
> >> precision)
> >>> result when I am summing a bigint value. So I don’t think we’re going
> to
> >>> please everyone.
> >>>
> >>> I think the solution is to add the policy to derive SUM’s return type
> to
> >>> as a new method to RelDataTypeSystem. Then Kylin can supply its own.
> >>>
> >>> Julian
> >>>
> >>> [1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html
> <
> >>> http://www.postgresql.org/docs/9.1/static/functions-aggregate.html>
> >>>
> >>> [2] https://msdn.microsoft.com/en-us/library/ms187810.aspx <
> >>> https://msdn.microsoft.com/en-us/library/ms187810.aspx>
> >>>
> >>>> On Aug 9, 2015, at 8:09 PM, Li Yang <li...@apache.org> wrote:
> >>>>
> >>>>> 1. If x is an integer, what is the type of sum(x)?
> >>>>
> >>>> This is the key question.  If calcite believes sum(int) = int, then
> >> Kylin
> >>>> have to find solution else where.  User demands the correct sum result
> >>>> anyway.  It's very hard to explain and justify the behavior to user,
> >>>> because other SQL engines like Postgres seem simply works...
> >>>>
> >>>> One workaround maybe let all initial values be bigint.
> >>>>
> >>>> On Sat, Aug 8, 2015 at 9:02 AM, Julian Hyde <jh...@apache.org> wrote:
> >>>>
> >>>>> I would approach it a different way. There are 2 separate questions:
> >>>>>
> >>>>> 1. If x is an integer, what is the type of sum(x)?
> >>>>>
> >>>>> This concerns SQL query validation. Calcite’s answer is that if x has
> >>> type
> >>>>> T, then sum(x) has type T. Not perfect, but simple. If x is an int
> and
> >>> you
> >>>>> want the result to be a bigint, just write sum(cast(x as bigint)).
> >>>>>
> >>>>> 2. Do we detect overflow while calculating sum, and if so, how?
> >>>>>
> >>>>> This is an implementation question, and needs to be solved in each
> >>> engine.
> >>>>> Drill is one such engine, and Enumerable is another. Enumerable does
> >> not
> >>>>> currently detect overflow.
> >>>>>
> >>>>> One strategy would be to use a  higher precision data type internally
> >>> (but
> >>>>> this strategy works only if you have an upper bound on the number of
> >>> input
> >>>>> rows). Another is to use a method such as
> java.lang.Math.addExact(int,
> >>> int).
> >>>>>
> >>>>> Julian
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com>
> >> wrote:
> >>>>>>
> >>>>>> I think it makes sense to use bigint as the result type for
> >>> sum(integer).
> >>>>>>
> >>>>>> Postgres seems to work in this way.
> >>>>>>
> >>>>>> mydb=# \d+ emp
> >>>>>>                                Table "public.emp"
> >>>>>> Column  |         Type          | Modifiers | Storage  | Stats
> >> target |
> >>>>>> Description
> >>>>>>
> >>>>>
> >>>
> >>
> ----------+-----------------------+-----------+----------+--------------+-------------
> >>>>>> empno    | integer               |           | plain    |
> >>> |
> >>>>>> .....
> >>>>>>
> >>>>>> create table tmp as select sum(empno) sum_eno from emp;
> >>>>>>
> >>>>>> \d+ tmp;
> >>>>>>                       Table "public.tmp"
> >>>>>> Column  |  Type  | Modifiers | Storage | Stats target | Description
> >>>>>>
> ---------+--------+-----------+---------+--------------+-------------
> >>>>>> sum_eno | bigint |           | plain   |              |
> >>>>>>
> >>>>>>
> >>>>>> As we can see, the column sum_eno in 'tmp' table after the CTAS
> >>> statement
> >>>>>> has bigint type.
> >>>>>>
> >>>>>> In Drill, we also use bigint for sum(integer).  Drill has to put
> >>>>> additional
> >>>>>> logic, since Calcite by default will use int as the result type for
> >>>>>> sum(int).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>> hi,
> >>>>>>>
> >>>>>>> Suppose I have a table column called "price", its data type is
> >>> integer.
> >>>>>>> it seems that the sum aggregator in "select sum(price) from table"
> >>> will
> >>>>>>> return integer type, too.
> >>>>>>>
> >>>>>>> When I have millions of rows in the table,
> >>>>>>> "select sum(price) from table" might overflow, is it a bug?
> >>>>>>> Or may I how do you look into this problem?
> >>>>>>>
> >>>>>>> --
> >>>>>>> Regards,
> >>>>>>>
> >>>>>>> *Bin Mahone | 马洪宾*
> >>>>>>> Apache Kylin: http://kylin.io
> >>>>>>> Github: https://github.com/binmahone
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: sum of integer overflow

Posted by Julian Hyde <jh...@apache.org>.

OK, if you want this feature please log a JIRA case to track it.

> On Aug 11, 2015, at 2:17 AM, Li Yang <li...@apache.org> wrote:
> 
> A customizable policy is great since there's no standard.
> 
> On Tue, Aug 11, 2015 at 5:54 AM, Jinfeng Ni <ji...@gmail.com> wrote:
> 
>> I think adding a policy in Calcite to derive SUM's return type makes sense,
>> as SQL standard seems not specify exactly what the return type of SUM
>> should be. As a result, each system could choose to use different policy in
>> the implementation.
>> 
>> Oracle seems to use "the same data type as the numeric data type of the
>> argument" [1], while DB2 uses the policy "The result is a large integer if
>> the argument values are small integers" [2].
>> 
>> 
>> [1] http://docs.oracle.com/database/121/SQLRF/functions196.htm#i89126
>> [2]
>> 
>> http://public.dhe.ibm.com/ps/products/db2/info/vr101/pdf/en_US/DB2SQLRefVol1-db2s1e1011.pdf
>> 
>> On Mon, Aug 10, 2015 at 2:05 PM, Julian Hyde <jh...@apache.org> wrote:
>> 
>>> Postgres return type is "bigint for smallint or int arguments, numeric
>> for
>>> bigint arguments, otherwise the same as the argument data type”[1]
>>> 
>>> SQL Server return type is int for tinyint, smallint or int; bigint for
>>> bigint[2].
>>> 
>>> I can see your point that “User demands the correct sum result”. But I’d
>>> also be pissed with Postgres if it returned a numeric (arbitrary
>> precision)
>>> result when I am summing a bigint value. So I don’t think we’re going to
>>> please everyone.
>>> 
>>> I think the solution is to add the policy to derive SUM’s return type to
>>> as a new method to RelDataTypeSystem. Then Kylin can supply its own.
>>> 
>>> Julian
>>> 
>>> [1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html <
>>> http://www.postgresql.org/docs/9.1/static/functions-aggregate.html>
>>> 
>>> [2] https://msdn.microsoft.com/en-us/library/ms187810.aspx <
>>> https://msdn.microsoft.com/en-us/library/ms187810.aspx>
>>> 
>>>> On Aug 9, 2015, at 8:09 PM, Li Yang <li...@apache.org> wrote:
>>>> 
>>>>> 1. If x is an integer, what is the type of sum(x)?
>>>> 
>>>> This is the key question.  If calcite believes sum(int) = int, then
>> Kylin
>>>> have to find solution else where.  User demands the correct sum result
>>>> anyway.  It's very hard to explain and justify the behavior to user,
>>>> because other SQL engines like Postgres seem simply works...
>>>> 
>>>> One workaround maybe let all initial values be bigint.
>>>> 
>>>> On Sat, Aug 8, 2015 at 9:02 AM, Julian Hyde <jh...@apache.org> wrote:
>>>> 
>>>>> I would approach it a different way. There are 2 separate questions:
>>>>> 
>>>>> 1. If x is an integer, what is the type of sum(x)?
>>>>> 
>>>>> This concerns SQL query validation. Calcite’s answer is that if x has
>>> type
>>>>> T, then sum(x) has type T. Not perfect, but simple. If x is an int and
>>> you
>>>>> want the result to be a bigint, just write sum(cast(x as bigint)).
>>>>> 
>>>>> 2. Do we detect overflow while calculating sum, and if so, how?
>>>>> 
>>>>> This is an implementation question, and needs to be solved in each
>>> engine.
>>>>> Drill is one such engine, and Enumerable is another. Enumerable does
>> not
>>>>> currently detect overflow.
>>>>> 
>>>>> One strategy would be to use a  higher precision data type internally
>>> (but
>>>>> this strategy works only if you have an upper bound on the number of
>>> input
>>>>> rows). Another is to use a method such as java.lang.Math.addExact(int,
>>> int).
>>>>> 
>>>>> Julian
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com>
>> wrote:
>>>>>> 
>>>>>> I think it makes sense to use bigint as the result type for
>>> sum(integer).
>>>>>> 
>>>>>> Postgres seems to work in this way.
>>>>>> 
>>>>>> mydb=# \d+ emp
>>>>>>                                Table "public.emp"
>>>>>> Column  |         Type          | Modifiers | Storage  | Stats
>> target |
>>>>>> Description
>>>>>> 
>>>>> 
>>> 
>> ----------+-----------------------+-----------+----------+--------------+-------------
>>>>>> empno    | integer               |           | plain    |
>>> |
>>>>>> .....
>>>>>> 
>>>>>> create table tmp as select sum(empno) sum_eno from emp;
>>>>>> 
>>>>>> \d+ tmp;
>>>>>>                       Table "public.tmp"
>>>>>> Column  |  Type  | Modifiers | Storage | Stats target | Description
>>>>>> ---------+--------+-----------+---------+--------------+-------------
>>>>>> sum_eno | bigint |           | plain   |              |
>>>>>> 
>>>>>> 
>>>>>> As we can see, the column sum_eno in 'tmp' table after the CTAS
>>> statement
>>>>>> has bigint type.
>>>>>> 
>>>>>> In Drill, we also use bigint for sum(integer).  Drill has to put
>>>>> additional
>>>>>> logic, since Calcite by default will use int as the result type for
>>>>>> sum(int).
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org>
>>> wrote:
>>>>>> 
>>>>>>> hi,
>>>>>>> 
>>>>>>> Suppose I have a table column called "price", its data type is
>>> integer.
>>>>>>> it seems that the sum aggregator in "select sum(price) from table"
>>> will
>>>>>>> return integer type, too.
>>>>>>> 
>>>>>>> When I have millions of rows in the table,
>>>>>>> "select sum(price) from table" might overflow, is it a bug?
>>>>>>> Or may I how do you look into this problem?
>>>>>>> 
>>>>>>> --
>>>>>>> Regards,
>>>>>>> 
>>>>>>> *Bin Mahone | 马洪宾*
>>>>>>> Apache Kylin: http://kylin.io
>>>>>>> Github: https://github.com/binmahone
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: sum of integer overflow

Posted by Li Yang <li...@apache.org>.

A customizable policy is great since there's no standard.

On Tue, Aug 11, 2015 at 5:54 AM, Jinfeng Ni <ji...@gmail.com> wrote:

> I think adding a policy in Calcite to derive SUM's return type makes sense,
> as SQL standard seems not specify exactly what the return type of SUM
> should be. As a result, each system could choose to use different policy in
> the implementation.
>
> Oracle seems to use "the same data type as the numeric data type of the
> argument" [1], while DB2 uses the policy "The result is a large integer if
> the argument values are small integers" [2].
>
>
> [1] http://docs.oracle.com/database/121/SQLRF/functions196.htm#i89126
> [2]
>
> http://public.dhe.ibm.com/ps/products/db2/info/vr101/pdf/en_US/DB2SQLRefVol1-db2s1e1011.pdf
>
> On Mon, Aug 10, 2015 at 2:05 PM, Julian Hyde <jh...@apache.org> wrote:
>
> > Postgres return type is "bigint for smallint or int arguments, numeric
> for
> > bigint arguments, otherwise the same as the argument data type”[1]
> >
> > SQL Server return type is int for tinyint, smallint or int; bigint for
> > bigint[2].
> >
> > I can see your point that “User demands the correct sum result”. But I’d
> > also be pissed with Postgres if it returned a numeric (arbitrary
> precision)
> > result when I am summing a bigint value. So I don’t think we’re going to
> > please everyone.
> >
> > I think the solution is to add the policy to derive SUM’s return type to
> > as a new method to RelDataTypeSystem. Then Kylin can supply its own.
> >
> > Julian
> >
> > [1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html <
> > http://www.postgresql.org/docs/9.1/static/functions-aggregate.html>
> >
> > [2] https://msdn.microsoft.com/en-us/library/ms187810.aspx <
> > https://msdn.microsoft.com/en-us/library/ms187810.aspx>
> >
> > > On Aug 9, 2015, at 8:09 PM, Li Yang <li...@apache.org> wrote:
> > >
> > >> 1. If x is an integer, what is the type of sum(x)?
> > >
> > > This is the key question.  If calcite believes sum(int) = int, then
> Kylin
> > > have to find solution else where.  User demands the correct sum result
> > > anyway.  It's very hard to explain and justify the behavior to user,
> > > because other SQL engines like Postgres seem simply works...
> > >
> > > One workaround maybe let all initial values be bigint.
> > >
> > > On Sat, Aug 8, 2015 at 9:02 AM, Julian Hyde <jh...@apache.org> wrote:
> > >
> > >> I would approach it a different way. There are 2 separate questions:
> > >>
> > >> 1. If x is an integer, what is the type of sum(x)?
> > >>
> > >> This concerns SQL query validation. Calcite’s answer is that if x has
> > type
> > >> T, then sum(x) has type T. Not perfect, but simple. If x is an int and
> > you
> > >> want the result to be a bigint, just write sum(cast(x as bigint)).
> > >>
> > >> 2. Do we detect overflow while calculating sum, and if so, how?
> > >>
> > >> This is an implementation question, and needs to be solved in each
> > engine.
> > >> Drill is one such engine, and Enumerable is another. Enumerable does
> not
> > >> currently detect overflow.
> > >>
> > >> One strategy would be to use a  higher precision data type internally
> > (but
> > >> this strategy works only if you have an upper bound on the number of
> > input
> > >> rows). Another is to use a method such as java.lang.Math.addExact(int,
> > int).
> > >>
> > >> Julian
> > >>
> > >>
> > >>
> > >>> On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com>
> wrote:
> > >>>
> > >>> I think it makes sense to use bigint as the result type for
> > sum(integer).
> > >>>
> > >>> Postgres seems to work in this way.
> > >>>
> > >>> mydb=# \d+ emp
> > >>>                                 Table "public.emp"
> > >>> Column  |         Type          | Modifiers | Storage  | Stats
> target |
> > >>> Description
> > >>>
> > >>
> >
> ----------+-----------------------+-----------+----------+--------------+-------------
> > >>> empno    | integer               |           | plain    |
> > |
> > >>> .....
> > >>>
> > >>> create table tmp as select sum(empno) sum_eno from emp;
> > >>>
> > >>> \d+ tmp;
> > >>>                        Table "public.tmp"
> > >>> Column  |  Type  | Modifiers | Storage | Stats target | Description
> > >>> ---------+--------+-----------+---------+--------------+-------------
> > >>> sum_eno | bigint |           | plain   |              |
> > >>>
> > >>>
> > >>> As we can see, the column sum_eno in 'tmp' table after the CTAS
> > statement
> > >>> has bigint type.
> > >>>
> > >>> In Drill, we also use bigint for sum(integer).  Drill has to put
> > >> additional
> > >>> logic, since Calcite by default will use int as the result type for
> > >>> sum(int).
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org>
> > wrote:
> > >>>
> > >>>> hi,
> > >>>>
> > >>>> Suppose I have a table column called "price", its data type is
> > integer.
> > >>>> it seems that the sum aggregator in "select sum(price) from table"
> > will
> > >>>> return integer type, too.
> > >>>>
> > >>>> When I have millions of rows in the table,
> > >>>> "select sum(price) from table" might overflow, is it a bug?
> > >>>> Or may I how do you look into this problem?
> > >>>>
> > >>>> --
> > >>>> Regards,
> > >>>>
> > >>>> *Bin Mahone | 马洪宾*
> > >>>> Apache Kylin: http://kylin.io
> > >>>> Github: https://github.com/binmahone
> > >>>>
> > >>
> > >>
> >
> >
>

Re: sum of integer overflow

Posted by Jinfeng Ni <ji...@gmail.com>.

I think adding a policy in Calcite to derive SUM's return type makes sense,
as SQL standard seems not specify exactly what the return type of SUM
should be. As a result, each system could choose to use different policy in
the implementation.

Oracle seems to use "the same data type as the numeric data type of the
argument" [1], while DB2 uses the policy "The result is a large integer if
the argument values are small integers" [2].


[1] http://docs.oracle.com/database/121/SQLRF/functions196.htm#i89126
[2]
http://public.dhe.ibm.com/ps/products/db2/info/vr101/pdf/en_US/DB2SQLRefVol1-db2s1e1011.pdf

On Mon, Aug 10, 2015 at 2:05 PM, Julian Hyde <jh...@apache.org> wrote:

> Postgres return type is "bigint for smallint or int arguments, numeric for
> bigint arguments, otherwise the same as the argument data type”[1]
>
> SQL Server return type is int for tinyint, smallint or int; bigint for
> bigint[2].
>
> I can see your point that “User demands the correct sum result”. But I’d
> also be pissed with Postgres if it returned a numeric (arbitrary precision)
> result when I am summing a bigint value. So I don’t think we’re going to
> please everyone.
>
> I think the solution is to add the policy to derive SUM’s return type to
> as a new method to RelDataTypeSystem. Then Kylin can supply its own.
>
> Julian
>
> [1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html <
> http://www.postgresql.org/docs/9.1/static/functions-aggregate.html>
>
> [2] https://msdn.microsoft.com/en-us/library/ms187810.aspx <
> https://msdn.microsoft.com/en-us/library/ms187810.aspx>
>
> > On Aug 9, 2015, at 8:09 PM, Li Yang <li...@apache.org> wrote:
> >
> >> 1. If x is an integer, what is the type of sum(x)?
> >
> > This is the key question.  If calcite believes sum(int) = int, then Kylin
> > have to find solution else where.  User demands the correct sum result
> > anyway.  It's very hard to explain and justify the behavior to user,
> > because other SQL engines like Postgres seem simply works...
> >
> > One workaround maybe let all initial values be bigint.
> >
> > On Sat, Aug 8, 2015 at 9:02 AM, Julian Hyde <jh...@apache.org> wrote:
> >
> >> I would approach it a different way. There are 2 separate questions:
> >>
> >> 1. If x is an integer, what is the type of sum(x)?
> >>
> >> This concerns SQL query validation. Calcite’s answer is that if x has
> type
> >> T, then sum(x) has type T. Not perfect, but simple. If x is an int and
> you
> >> want the result to be a bigint, just write sum(cast(x as bigint)).
> >>
> >> 2. Do we detect overflow while calculating sum, and if so, how?
> >>
> >> This is an implementation question, and needs to be solved in each
> engine.
> >> Drill is one such engine, and Enumerable is another. Enumerable does not
> >> currently detect overflow.
> >>
> >> One strategy would be to use a  higher precision data type internally
> (but
> >> this strategy works only if you have an upper bound on the number of
> input
> >> rows). Another is to use a method such as java.lang.Math.addExact(int,
> int).
> >>
> >> Julian
> >>
> >>
> >>
> >>> On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com> wrote:
> >>>
> >>> I think it makes sense to use bigint as the result type for
> sum(integer).
> >>>
> >>> Postgres seems to work in this way.
> >>>
> >>> mydb=# \d+ emp
> >>>                                 Table "public.emp"
> >>> Column  |         Type          | Modifiers | Storage  | Stats target |
> >>> Description
> >>>
> >>
> ----------+-----------------------+-----------+----------+--------------+-------------
> >>> empno    | integer               |           | plain    |
> |
> >>> .....
> >>>
> >>> create table tmp as select sum(empno) sum_eno from emp;
> >>>
> >>> \d+ tmp;
> >>>                        Table "public.tmp"
> >>> Column  |  Type  | Modifiers | Storage | Stats target | Description
> >>> ---------+--------+-----------+---------+--------------+-------------
> >>> sum_eno | bigint |           | plain   |              |
> >>>
> >>>
> >>> As we can see, the column sum_eno in 'tmp' table after the CTAS
> statement
> >>> has bigint type.
> >>>
> >>> In Drill, we also use bigint for sum(integer).  Drill has to put
> >> additional
> >>> logic, since Calcite by default will use int as the result type for
> >>> sum(int).
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org>
> wrote:
> >>>
> >>>> hi,
> >>>>
> >>>> Suppose I have a table column called "price", its data type is
> integer.
> >>>> it seems that the sum aggregator in "select sum(price) from table"
> will
> >>>> return integer type, too.
> >>>>
> >>>> When I have millions of rows in the table,
> >>>> "select sum(price) from table" might overflow, is it a bug?
> >>>> Or may I how do you look into this problem?
> >>>>
> >>>> --
> >>>> Regards,
> >>>>
> >>>> *Bin Mahone | 马洪宾*
> >>>> Apache Kylin: http://kylin.io
> >>>> Github: https://github.com/binmahone
> >>>>
> >>
> >>
>
>

Re: sum of integer overflow

Posted by Julian Hyde <jh...@apache.org>.

Postgres return type is "bigint for smallint or int arguments, numeric for bigint arguments, otherwise the same as the argument data type”[1]

SQL Server return type is int for tinyint, smallint or int; bigint for bigint[2].

I can see your point that “User demands the correct sum result”. But I’d also be pissed with Postgres if it returned a numeric (arbitrary precision) result when I am summing a bigint value. So I don’t think we’re going to please everyone.

I think the solution is to add the policy to derive SUM’s return type to as a new method to RelDataTypeSystem. Then Kylin can supply its own.

Julian

[1] http://www.postgresql.org/docs/9.1/static/functions-aggregate.html <http://www.postgresql.org/docs/9.1/static/functions-aggregate.html>

[2] https://msdn.microsoft.com/en-us/library/ms187810.aspx <https://msdn.microsoft.com/en-us/library/ms187810.aspx>

> On Aug 9, 2015, at 8:09 PM, Li Yang <li...@apache.org> wrote:
> 
>> 1. If x is an integer, what is the type of sum(x)?
> 
> This is the key question.  If calcite believes sum(int) = int, then Kylin
> have to find solution else where.  User demands the correct sum result
> anyway.  It's very hard to explain and justify the behavior to user,
> because other SQL engines like Postgres seem simply works...
> 
> One workaround maybe let all initial values be bigint.
> 
> On Sat, Aug 8, 2015 at 9:02 AM, Julian Hyde <jh...@apache.org> wrote:
> 
>> I would approach it a different way. There are 2 separate questions:
>> 
>> 1. If x is an integer, what is the type of sum(x)?
>> 
>> This concerns SQL query validation. Calcite’s answer is that if x has type
>> T, then sum(x) has type T. Not perfect, but simple. If x is an int and you
>> want the result to be a bigint, just write sum(cast(x as bigint)).
>> 
>> 2. Do we detect overflow while calculating sum, and if so, how?
>> 
>> This is an implementation question, and needs to be solved in each engine.
>> Drill is one such engine, and Enumerable is another. Enumerable does not
>> currently detect overflow.
>> 
>> One strategy would be to use a  higher precision data type internally (but
>> this strategy works only if you have an upper bound on the number of input
>> rows). Another is to use a method such as java.lang.Math.addExact(int, int).
>> 
>> Julian
>> 
>> 
>> 
>>> On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com> wrote:
>>> 
>>> I think it makes sense to use bigint as the result type for sum(integer).
>>> 
>>> Postgres seems to work in this way.
>>> 
>>> mydb=# \d+ emp
>>>                                 Table "public.emp"
>>> Column  |         Type          | Modifiers | Storage  | Stats target |
>>> Description
>>> 
>> ----------+-----------------------+-----------+----------+--------------+-------------
>>> empno    | integer               |           | plain    |              |
>>> .....
>>> 
>>> create table tmp as select sum(empno) sum_eno from emp;
>>> 
>>> \d+ tmp;
>>>                        Table "public.tmp"
>>> Column  |  Type  | Modifiers | Storage | Stats target | Description
>>> ---------+--------+-----------+---------+--------------+-------------
>>> sum_eno | bigint |           | plain   |              |
>>> 
>>> 
>>> As we can see, the column sum_eno in 'tmp' table after the CTAS statement
>>> has bigint type.
>>> 
>>> In Drill, we also use bigint for sum(integer).  Drill has to put
>> additional
>>> logic, since Calcite by default will use int as the result type for
>>> sum(int).
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org> wrote:
>>> 
>>>> hi,
>>>> 
>>>> Suppose I have a table column called "price", its data type is integer.
>>>> it seems that the sum aggregator in "select sum(price) from table" will
>>>> return integer type, too.
>>>> 
>>>> When I have millions of rows in the table,
>>>> "select sum(price) from table" might overflow, is it a bug?
>>>> Or may I how do you look into this problem?
>>>> 
>>>> --
>>>> Regards,
>>>> 
>>>> *Bin Mahone | 马洪宾*
>>>> Apache Kylin: http://kylin.io
>>>> Github: https://github.com/binmahone
>>>> 
>> 
>>

Re: sum of integer overflow

Posted by Li Yang <li...@apache.org>.

> 1. If x is an integer, what is the type of sum(x)?

This is the key question.  If calcite believes sum(int) = int, then Kylin
have to find solution else where.  User demands the correct sum result
anyway.  It's very hard to explain and justify the behavior to user,
because other SQL engines like Postgres seem simply works...

One workaround maybe let all initial values be bigint.

On Sat, Aug 8, 2015 at 9:02 AM, Julian Hyde <jh...@apache.org> wrote:

> I would approach it a different way. There are 2 separate questions:
>
> 1. If x is an integer, what is the type of sum(x)?
>
> This concerns SQL query validation. Calcite’s answer is that if x has type
> T, then sum(x) has type T. Not perfect, but simple. If x is an int and you
> want the result to be a bigint, just write sum(cast(x as bigint)).
>
> 2. Do we detect overflow while calculating sum, and if so, how?
>
> This is an implementation question, and needs to be solved in each engine.
> Drill is one such engine, and Enumerable is another. Enumerable does not
> currently detect overflow.
>
> One strategy would be to use a  higher precision data type internally (but
> this strategy works only if you have an upper bound on the number of input
> rows). Another is to use a method such as java.lang.Math.addExact(int, int).
>
> Julian
>
>
>
> > On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com> wrote:
> >
> > I think it makes sense to use bigint as the result type for sum(integer).
> >
> > Postgres seems to work in this way.
> >
> > mydb=# \d+ emp
> >                                  Table "public.emp"
> >  Column  |         Type          | Modifiers | Storage  | Stats target |
> > Description
> >
> ----------+-----------------------+-----------+----------+--------------+-------------
> > empno    | integer               |           | plain    |              |
> > .....
> >
> > create table tmp as select sum(empno) sum_eno from emp;
> >
> > \d+ tmp;
> >                         Table "public.tmp"
> > Column  |  Type  | Modifiers | Storage | Stats target | Description
> > ---------+--------+-----------+---------+--------------+-------------
> > sum_eno | bigint |           | plain   |              |
> >
> >
> > As we can see, the column sum_eno in 'tmp' table after the CTAS statement
> > has bigint type.
> >
> > In Drill, we also use bigint for sum(integer).  Drill has to put
> additional
> > logic, since Calcite by default will use int as the result type for
> > sum(int).
> >
> >
> >
> >
> > On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org> wrote:
> >
> >> hi,
> >>
> >> Suppose I have a table column called "price", its data type is integer.
> >> it seems that the sum aggregator in "select sum(price) from table" will
> >> return integer type, too.
> >>
> >> When I have millions of rows in the table,
> >> "select sum(price) from table" might overflow, is it a bug?
> >> Or may I how do you look into this problem?
> >>
> >> --
> >> Regards,
> >>
> >> *Bin Mahone | 马洪宾*
> >> Apache Kylin: http://kylin.io
> >> Github: https://github.com/binmahone
> >>
>
>

Re: sum of integer overflow

Posted by Julian Hyde <jh...@apache.org>.

I would approach it a different way. There are 2 separate questions:

1. If x is an integer, what is the type of sum(x)?

This concerns SQL query validation. Calcite’s answer is that if x has type T, then sum(x) has type T. Not perfect, but simple. If x is an int and you want the result to be a bigint, just write sum(cast(x as bigint)).

2. Do we detect overflow while calculating sum, and if so, how?

This is an implementation question, and needs to be solved in each engine. Drill is one such engine, and Enumerable is another. Enumerable does not currently detect overflow.

One strategy would be to use a  higher precision data type internally (but this strategy works only if you have an upper bound on the number of input rows). Another is to use a method such as java.lang.Math.addExact(int, int).

Julian

> On Aug 7, 2015, at 10:38 AM, Jinfeng Ni <ji...@gmail.com> wrote:
> 
> I think it makes sense to use bigint as the result type for sum(integer).
> 
> Postgres seems to work in this way.
> 
> mydb=# \d+ emp
>                                  Table "public.emp"
>  Column  |         Type          | Modifiers | Storage  | Stats target |
> Description
> ----------+-----------------------+-----------+----------+--------------+-------------
> empno    | integer               |           | plain    |              |
> .....
> 
> create table tmp as select sum(empno) sum_eno from emp;
> 
> \d+ tmp;
>                         Table "public.tmp"
> Column  |  Type  | Modifiers | Storage | Stats target | Description
> ---------+--------+-----------+---------+--------------+-------------
> sum_eno | bigint |           | plain   |              |
> 
> 
> As we can see, the column sum_eno in 'tmp' table after the CTAS statement
> has bigint type.
> 
> In Drill, we also use bigint for sum(integer).  Drill has to put additional
> logic, since Calcite by default will use int as the result type for
> sum(int).
> 
> 
> 
> 
> On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org> wrote:
> 
>> hi,
>> 
>> Suppose I have a table column called "price", its data type is integer.
>> it seems that the sum aggregator in "select sum(price) from table" will
>> return integer type, too.
>> 
>> When I have millions of rows in the table,
>> "select sum(price) from table" might overflow, is it a bug?
>> Or may I how do you look into this problem?
>> 
>> --
>> Regards,
>> 
>> *Bin Mahone | 马洪宾*
>> Apache Kylin: http://kylin.io
>> Github: https://github.com/binmahone
>>

Re: sum of integer overflow

Posted by Jinfeng Ni <ji...@gmail.com>.

I think it makes sense to use bigint as the result type for sum(integer).

Postgres seems to work in this way.

mydb=# \d+ emp
                                  Table "public.emp"
  Column  |         Type          | Modifiers | Storage  | Stats target |
Description
----------+-----------------------+-----------+----------+--------------+-------------
 empno    | integer               |           | plain    |              |
.....

create table tmp as select sum(empno) sum_eno from emp;

\d+ tmp;
                         Table "public.tmp"
 Column  |  Type  | Modifiers | Storage | Stats target | Description
---------+--------+-----------+---------+--------------+-------------
 sum_eno | bigint |           | plain   |              |


As we can see, the column sum_eno in 'tmp' table after the CTAS statement
has bigint type.

In Drill, we also use bigint for sum(integer).  Drill has to put additional
logic, since Calcite by default will use int as the result type for
sum(int).




On Fri, Aug 7, 2015 at 6:51 AM, hongbin ma <ma...@apache.org> wrote:

> hi,
>
> Suppose I have a table column called "price", its data type is integer.
> it seems that the sum aggregator in "select sum(price) from table" will
> return integer type, too.
>
> When I have millions of rows in the table,
> "select sum(price) from table" might overflow, is it a bug?
> Or may I how do you look into this problem?
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>