You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Remus Rusanu <re...@microsoft.com> on 2014/02/14 12:55:27 UTC

A question about the derivation of intermediate sum field for decimal average aggregates

Hi,

With HIVE-5872 the intermediate sum field for decimal aggregates was changed to increase scale by 4. I understand the reasoning for having accurate precision/scale for the aggregate output. However, for the intermediate sum field of AVG, I believe we should increase precision w/o increasing scale. The sum can grow large, but cannot increase digits in the fractional part, so we should increase the precision of the sum, but not the scale. When sum is divided by count to get the average on the reduce side then we should indeed project value with higher scale.

Opinions?

Thanks,
~Remus

Re: A question about the derivation of intermediate sum field for decimal average aggregates

Posted by Xuefu Zhang <xz...@cloudera.com>.

10 was chosen in Hive. While rather arbitrary, it seems reasonable. The
problem of going all the way to max is lack of room to grow decimal parts
for subsequent operations. For instance, if we have sum(a) of (38, 2), then
the expression, sum(a)/count(*) can only have a precision/scale (38,2),
which may not be an accurate average.

I'm not sure how other DB does, but MySQL doesn't go all the way to max.
And there is no standard on this.

--Xuefu


On Sat, Feb 15, 2014 at 12:55 AM, Remus Rusanu <re...@microsoft.com> wrote:

> Sure, sounds good. But why +10, why not extend to max? p=38 for the
> intermediate sum, with whatever original scale. This way, if it does
> overflow, we did best effort to prevent it.
>
> -----Original Message-----
> From: Xuefu Zhang [mailto:xzhang@cloudera.com]
> Sent: Saturday, February 15, 2014 12:45 AM
> To: dev@hive.apache.org
> Cc: Eric Hanson (BIG DATA)
> Subject: Re: A question about the derivation of intermediate sum field for
> decimal average aggregates
>
> Hi Remus,
>
> Thank you for the info. I now realized that it might be good to address
> this, as the object inspector for the sum field might be problematic
> because it has the same number of integer digits as the summed field, which
> may overflow easily. What about we have sum filed precision/scale as (p+10,
> s), which is consistent with UDF sum, and have (p+4, s+4) for the avg
> result only?
>
> If you agree, I can make the change.
>
> Thanks,
> Xuefu
>
>
> On Fri, Feb 14, 2014 at 2:09 PM, Remus Rusanu <re...@microsoft.com>
> wrote:
>
> > Hi Xuefu,
> >
> > I do not have any particular use case in mind. I've noticed the
> > problem when I implemented the vectorized AVG for decimal, which must
> > match your implementation (since we vectorized only the map side
> > operator, it better produce the output expected by the reduce
> > side...).  I thought that since we alter the precission/scale for the
> > result, we may as well alter it for the intermediate sum field. But if
> > this complicates the use of object inspectors and introduces maintenance
> risks, then is probably not worth it.
> >
> > Thanks,
> > ~Remus
> >
> > -----Original Message-----
> > From: Xuefu Zhang [mailto:xzhang@cloudera.com]
> > Sent: Friday, February 14, 2014 7:18 PM
> > To: dev@hive.apache.org
> > Cc: xuefu@apache.org; Eric Hanson (BIG DATA)
> > Subject: Re: A question about the derivation of intermediate sum field
> > for decimal average aggregates
> >
> > Remus,
> >
> > Thanks for looking into this. You're right that sum() result doesn't
> > increase the scale, but have you seen that sum UDF returns wrong scale?
> >
> > As to the implementation of avg UDF, the object inspector for sum
> > field is initialized with a scale + 4, which might not be necessary,
> > but perhaps harmless. The same object inspector is also used for the
> > average result, which gives correct type. I guess it's possible to
> > separate this into two object inspectors, one for sum field and one
> > for the avg result, but the difference might be subtle and
> > questionable. This is because the data may not comply to the metadata
> > specified for Hive tables. Thus, I'm not sure if truncating data before
> it's summed if the right behavior.
> >
> > Do you have a use case that suggests one is better than the other?
> >
> > --Xuefu
> >
> >
> > On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu <re...@microsoft.com>
> > wrote:
> >
> > > Hi,
> > >
> > > With HIVE-5872 the intermediate sum field for decimal aggregates was
> > > changed to increase scale by 4. I understand the reasoning for
> > > having accurate precision/scale for the aggregate output. However,
> > > for the intermediate sum field of AVG, I believe we should increase
> > > precision w/o increasing scale. The sum can grow large, but cannot
> > > increase digits in the fractional part, so we should increase the
> > > precision of the sum, but not the scale. When sum is divided by
> > > count to get the average on the reduce side then we should indeed
> > > project value with
> > higher scale.
> > >
> > > Opinions?
> > >
> > > Thanks,
> > > ~Remus
> > >
> > >
> >
>

RE: A question about the derivation of intermediate sum field for decimal average aggregates

Posted by Remus Rusanu <re...@microsoft.com>.

Sure, sounds good. But why +10, why not extend to max? p=38 for the intermediate sum, with whatever original scale. This way, if it does overflow, we did best effort to prevent it. 

-----Original Message-----
From: Xuefu Zhang [mailto:xzhang@cloudera.com] 
Sent: Saturday, February 15, 2014 12:45 AM
To: dev@hive.apache.org
Cc: Eric Hanson (BIG DATA)
Subject: Re: A question about the derivation of intermediate sum field for decimal average aggregates

Hi Remus,

Thank you for the info. I now realized that it might be good to address this, as the object inspector for the sum field might be problematic because it has the same number of integer digits as the summed field, which may overflow easily. What about we have sum filed precision/scale as (p+10, s), which is consistent with UDF sum, and have (p+4, s+4) for the avg result only?

If you agree, I can make the change.

Thanks,
Xuefu


On Fri, Feb 14, 2014 at 2:09 PM, Remus Rusanu <re...@microsoft.com> wrote:

> Hi Xuefu,
>
> I do not have any particular use case in mind. I've noticed the 
> problem when I implemented the vectorized AVG for decimal, which must 
> match your implementation (since we vectorized only the map side 
> operator, it better produce the output expected by the reduce 
> side...).  I thought that since we alter the precission/scale for the 
> result, we may as well alter it for the intermediate sum field. But if 
> this complicates the use of object inspectors and introduces maintenance risks, then is probably not worth it.
>
> Thanks,
> ~Remus
>
> -----Original Message-----
> From: Xuefu Zhang [mailto:xzhang@cloudera.com]
> Sent: Friday, February 14, 2014 7:18 PM
> To: dev@hive.apache.org
> Cc: xuefu@apache.org; Eric Hanson (BIG DATA)
> Subject: Re: A question about the derivation of intermediate sum field 
> for decimal average aggregates
>
> Remus,
>
> Thanks for looking into this. You're right that sum() result doesn't 
> increase the scale, but have you seen that sum UDF returns wrong scale?
>
> As to the implementation of avg UDF, the object inspector for sum 
> field is initialized with a scale + 4, which might not be necessary, 
> but perhaps harmless. The same object inspector is also used for the 
> average result, which gives correct type. I guess it's possible to 
> separate this into two object inspectors, one for sum field and one 
> for the avg result, but the difference might be subtle and 
> questionable. This is because the data may not comply to the metadata 
> specified for Hive tables. Thus, I'm not sure if truncating data before it's summed if the right behavior.
>
> Do you have a use case that suggests one is better than the other?
>
> --Xuefu
>
>
> On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu <re...@microsoft.com>
> wrote:
>
> > Hi,
> >
> > With HIVE-5872 the intermediate sum field for decimal aggregates was 
> > changed to increase scale by 4. I understand the reasoning for 
> > having accurate precision/scale for the aggregate output. However, 
> > for the intermediate sum field of AVG, I believe we should increase 
> > precision w/o increasing scale. The sum can grow large, but cannot 
> > increase digits in the fractional part, so we should increase the 
> > precision of the sum, but not the scale. When sum is divided by 
> > count to get the average on the reduce side then we should indeed 
> > project value with
> higher scale.
> >
> > Opinions?
> >
> > Thanks,
> > ~Remus
> >
> >
>

Re: A question about the derivation of intermediate sum field for decimal average aggregates

Posted by Xuefu Zhang <xz...@cloudera.com>.

Hi Remus,

Thank you for the info. I now realized that it might be good to address
this, as the object inspector for the sum field might be problematic
because it has the same number of integer digits as the summed field, which
may overflow easily. What about we have sum filed precision/scale as (p+10,
s), which is consistent with UDF sum, and have (p+4, s+4) for the avg
result only?

If you agree, I can make the change.

Thanks,
Xuefu


On Fri, Feb 14, 2014 at 2:09 PM, Remus Rusanu <re...@microsoft.com> wrote:

> Hi Xuefu,
>
> I do not have any particular use case in mind. I've noticed the problem
> when I implemented the vectorized AVG for decimal, which must match your
> implementation (since we vectorized only the map side operator, it better
> produce the output expected by the reduce side...).  I thought that since
> we alter the precission/scale for the result, we may as well alter it for
> the intermediate sum field. But if this complicates the use of object
> inspectors and introduces maintenance risks, then is probably not worth it.
>
> Thanks,
> ~Remus
>
> -----Original Message-----
> From: Xuefu Zhang [mailto:xzhang@cloudera.com]
> Sent: Friday, February 14, 2014 7:18 PM
> To: dev@hive.apache.org
> Cc: xuefu@apache.org; Eric Hanson (BIG DATA)
> Subject: Re: A question about the derivation of intermediate sum field for
> decimal average aggregates
>
> Remus,
>
> Thanks for looking into this. You're right that sum() result doesn't
> increase the scale, but have you seen that sum UDF returns wrong scale?
>
> As to the implementation of avg UDF, the object inspector for sum field is
> initialized with a scale + 4, which might not be necessary, but perhaps
> harmless. The same object inspector is also used for the average result,
> which gives correct type. I guess it's possible to separate this into two
> object inspectors, one for sum field and one for the avg result, but the
> difference might be subtle and questionable. This is because the data may
> not comply to the metadata specified for Hive tables. Thus, I'm not sure if
> truncating data before it's summed if the right behavior.
>
> Do you have a use case that suggests one is better than the other?
>
> --Xuefu
>
>
> On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu <re...@microsoft.com>
> wrote:
>
> > Hi,
> >
> > With HIVE-5872 the intermediate sum field for decimal aggregates was
> > changed to increase scale by 4. I understand the reasoning for having
> > accurate precision/scale for the aggregate output. However, for the
> > intermediate sum field of AVG, I believe we should increase precision
> > w/o increasing scale. The sum can grow large, but cannot increase
> > digits in the fractional part, so we should increase the precision of
> > the sum, but not the scale. When sum is divided by count to get the
> > average on the reduce side then we should indeed project value with
> higher scale.
> >
> > Opinions?
> >
> > Thanks,
> > ~Remus
> >
> >
>

RE: A question about the derivation of intermediate sum field for decimal average aggregates

Posted by Remus Rusanu <re...@microsoft.com>.

Hi Xuefu,

I do not have any particular use case in mind. I've noticed the problem when I implemented the vectorized AVG for decimal, which must match your implementation (since we vectorized only the map side operator, it better produce the output expected by the reduce side...).  I thought that since we alter the precission/scale for the result, we may as well alter it for the intermediate sum field. But if this complicates the use of object inspectors and introduces maintenance risks, then is probably not worth it.

Thanks,
~Remus

-----Original Message-----
From: Xuefu Zhang [mailto:xzhang@cloudera.com] 
Sent: Friday, February 14, 2014 7:18 PM
To: dev@hive.apache.org
Cc: xuefu@apache.org; Eric Hanson (BIG DATA)
Subject: Re: A question about the derivation of intermediate sum field for decimal average aggregates

Remus,

Thanks for looking into this. You're right that sum() result doesn't increase the scale, but have you seen that sum UDF returns wrong scale?

As to the implementation of avg UDF, the object inspector for sum field is initialized with a scale + 4, which might not be necessary, but perhaps harmless. The same object inspector is also used for the average result, which gives correct type. I guess it's possible to separate this into two object inspectors, one for sum field and one for the avg result, but the difference might be subtle and questionable. This is because the data may not comply to the metadata specified for Hive tables. Thus, I'm not sure if truncating data before it's summed if the right behavior.

Do you have a use case that suggests one is better than the other?

--Xuefu

On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu <re...@microsoft.com> wrote:

> Hi,
>
> With HIVE-5872 the intermediate sum field for decimal aggregates was 
> changed to increase scale by 4. I understand the reasoning for having 
> accurate precision/scale for the aggregate output. However, for the 
> intermediate sum field of AVG, I believe we should increase precision 
> w/o increasing scale. The sum can grow large, but cannot increase 
> digits in the fractional part, so we should increase the precision of 
> the sum, but not the scale. When sum is divided by count to get the 
> average on the reduce side then we should indeed project value with higher scale.
>
> Opinions?
>
> Thanks,
> ~Remus
>
>

Re: A question about the derivation of intermediate sum field for decimal average aggregates

Posted by Xuefu Zhang <xz...@cloudera.com>.

Remus,

Thanks for looking into this. You're right that sum() result doesn't
increase the scale, but have you seen that sum UDF returns wrong scale?

As to the implementation of avg UDF, the object inspector for sum field is
initialized with a scale + 4, which might not be necessary, but perhaps
harmless. The same object inspector is also used for the average result,
which gives correct type. I guess it's possible to separate this into two
object inspectors, one for sum field and one for the avg result, but the
difference might be subtle and questionable. This is because the data may
not comply to the metadata specified for Hive tables. Thus, I'm not sure if
truncating data before it's summed if the right behavior.

Do you have a use case that suggests one is better than the other?

--Xuefu

On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu <re...@microsoft.com> wrote:

> Hi,
>
> With HIVE-5872 the intermediate sum field for decimal aggregates was
> changed to increase scale by 4. I understand the reasoning for having
> accurate precision/scale for the aggregate output. However, for the
> intermediate sum field of AVG, I believe we should increase precision w/o
> increasing scale. The sum can grow large, but cannot increase digits in the
> fractional part, so we should increase the precision of the sum, but not
> the scale. When sum is divided by count to get the average on the reduce
> side then we should indeed project value with higher scale.
>
> Opinions?
>
> Thanks,
> ~Remus
>
>