You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by alex schufo <al...@gmail.com> on 2015/08/05 16:57:57 UTC

Negative number in SUM result and Kylin results not matching exactly Hive results

I have a simple cube with a Fact Table and a Lookup Table. The Fact Table
includes some counter. The Lookup Table includes a region property which is
the highest level of the hierarchy and therefore can accumulate quite a lot
of data when summing the counter by region.

I am doing a query like this:

select sum(MY_COUNTER) as tot_count, REGION

from FACT_TABLE

join LOOKUP_TABLE

on FACT_TABLE.FK = LOOKUP_TABLE.PK

where date = '2015-01-01'

group by REGION


I get the following result:

-917,164,421 EMEA --> negative result!

777,795,808 AMER

514,879,134 APAC

117,814,485 LATAM

I ran the exact same query in Hive and got the following result:

3,381,591,228  EMEA  --> big number but smaller than the BIGINT limit

778,718,563    AMER  --> 922,755 difference with the Kylin result

520,253,610    APAC   --> 5,374,476 difference with the Kylin result

117,913,857    LATAM --> 99,372 difference with the Kylin result

Based on that it seems that the limit is the int limit 2^31-1
(2,147,483,647) and that my negative result comes from an overflow.

In my cube I defined this measure as a bigint:

Expression    Param Type    Param Value       Return Type

SUM              column            AVAIL_COUNT    bigint

The other thing that worries me a bit is that the other numbers are not
100% accurate either (>100,000,000 difference!)

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Li Yang <li...@apache.org>.

> Not possible to export and share I'm afraid as it's sensitive information

Understand.  Anyway we will try reproduce with big sums in local test
data.  Also keep an eye on other reports of incorrect sum.

You may also try count(*) to compare Kylin and hive.  If that's also
different, then most likely their data set is different.

On Thu, Aug 6, 2015 at 6:14 PM, alex schufo <al...@gmail.com> wrote:

> The test table is in billions record and in the hundreds of GB. For this
> test query however I am restricting to only 1 day of data which makes it
> some millions records.
>
> Not possible to export and share I'm afraid as it's sensitive information.
>
> On Thu, Aug 6, 2015 at 10:42 AM, Li Yang <li...@apache.org> wrote:
>
> > Interesting.. I'm thinking how to reproduce your problem then we can
> help..
> >
> > How big is the test hive table? Possible to export and share with us?
> >
> > On Thu, Aug 6, 2015 at 5:23 PM, alex schufo <al...@gmail.com>
> wrote:
> >
> > > This is a QA environment where I fully control the Hive tables and I
> > loaded
> > > this dataset manually and didn't make any change since. The cube itself
> > has
> > > been created yesterday so no I don't see any Hive update after the cube
> > > build.
> > >
> > > I didn't have the chance to modify the Hive table use BIGINT as well
> but
> > > will do when I have time.
> > >
> > > On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <li...@apache.org> wrote:
> > >
> > > > Integer overflow is possible. I do remember some FIXME there.
> > > >
> > > > Incorrect SUM is quite impossible. Regression test reconciles SUM
> > result
> > > > with H2 database every day. Maybe the hive table is updated after
> cube
> > > > build? Try refresh cube to ensure it captures latest hive data.
> > > >
> > > >
> > > >
> > > > On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <ma...@apache.org>
> > wrote:
> > > >
> > > > > I'll try to reproduce the issue in my own environment
> > > > >
> > > > > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com>
> wrote:
> > > > >
> > > > > > That's fine, please change your hive schema first to make sure
> they
> > > are
> > > > > > same,
> > > > > > then try again.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards!
> > > > > > ---------------------
> > > > > >
> > > > > > Luke Han
> > > > > >
> > > > > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <
> alexschufo@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I am using 0.7.2.
> > > > > > >
> > > > > > > The Hive column is int, does that mean I should try to modify
> the
> > > > > > existing
> > > > > > > Hive table to move to Bigint as well to have the Kylin count
> as a
> > > > > Bigint?
> > > > > > >
> > > > > > > I probably won't be allowed to share this kind of data outside
> my
> > > > > > > organization unfortunately.
> > > > > > >
> > > > > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <
> mahongbin@apache.org
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > except for what Luke asked, can you also share which version
> > are
> > > > you
> > > > > > > using?
> > > > > > > > We'll look into this
> > > > > > > >
> > > > > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <luke.hq@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > > What's the data type in your hive table of that column?
> > > > > > > > > We recommend to use same as your measure definition.
> > > > > > > > >
> > > > > > > > > For data quality, we have approach to ensure it will 100%
> > > > > > > > > match with source data, there's H2 comparison in our
> testing
> > > > case.
> > > > > > > > > But yes, let's double check the test case about huge number
> > > like
> > > > > your
> > > > > > > > case
> > > > > > > > >
> > > > > > > > > Is it possible to share your data and model to us if not
> > > > sensitive?
> > > > > > > > > It will be better to using that to do the verify and
> testing
> > > > here.
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best Regards!
> > > > > > > > > ---------------------
> > > > > > > > >
> > > > > > > > > Luke Han
> > > > > > > > >
> > > > > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
> > > > alexschufo@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I have a simple cube with a Fact Table and a Lookup
> Table.
> > > The
> > > > > Fact
> > > > > > > > Table
> > > > > > > > > > includes some counter. The Lookup Table includes a region
> > > > > property
> > > > > > > > which
> > > > > > > > > is
> > > > > > > > > > the highest level of the hierarchy and therefore can
> > > accumulate
> > > > > > > quite a
> > > > > > > > > lot
> > > > > > > > > > of data when summing the counter by region.
> > > > > > > > > >
> > > > > > > > > > I am doing a query like this:
> > > > > > > > > >
> > > > > > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > > > > > >
> > > > > > > > > > from FACT_TABLE
> > > > > > > > > >
> > > > > > > > > > join LOOKUP_TABLE
> > > > > > > > > >
> > > > > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > > > > > >
> > > > > > > > > > where date = '2015-01-01'
> > > > > > > > > >
> > > > > > > > > > group by REGION
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I get the following result:
> > > > > > > > > >
> > > > > > > > > > -917,164,421 EMEA --> negative result!
> > > > > > > > > >
> > > > > > > > > > 777,795,808 AMER
> > > > > > > > > >
> > > > > > > > > > 514,879,134 APAC
> > > > > > > > > >
> > > > > > > > > > 117,814,485 LATAM
> > > > > > > > > >
> > > > > > > > > > I ran the exact same query in Hive and got the following
> > > > result:
> > > > > > > > > >
> > > > > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the
> > > BIGINT
> > > > > > limit
> > > > > > > > > >
> > > > > > > > > > 778,718,563    AMER  --> 922,755 difference with the
> Kylin
> > > > result
> > > > > > > > > >
> > > > > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the
> > Kylin
> > > > > > result
> > > > > > > > > >
> > > > > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
> > > > result
> > > > > > > > > >
> > > > > > > > > > Based on that it seems that the limit is the int limit
> > 2^31-1
> > > > > > > > > > (2,147,483,647) and that my negative result comes from an
> > > > > overflow.
> > > > > > > > > >
> > > > > > > > > > In my cube I defined this measure as a bigint:
> > > > > > > > > >
> > > > > > > > > > Expression    Param Type    Param Value       Return Type
> > > > > > > > > >
> > > > > > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > > > > > >
> > > > > > > > > > The other thing that worries me a bit is that the other
> > > numbers
> > > > > are
> > > > > > > not
> > > > > > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > *Bin Mahone | 马洪宾*
> > > > > > > > Apache Kylin: http://kylin.io
> > > > > > > > Github: https://github.com/binmahone
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by alex schufo <al...@gmail.com>.

The test table is in billions record and in the hundreds of GB. For this
test query however I am restricting to only 1 day of data which makes it
some millions records.

Not possible to export and share I'm afraid as it's sensitive information.

On Thu, Aug 6, 2015 at 10:42 AM, Li Yang <li...@apache.org> wrote:

> Interesting.. I'm thinking how to reproduce your problem then we can help..
>
> How big is the test hive table? Possible to export and share with us?
>
> On Thu, Aug 6, 2015 at 5:23 PM, alex schufo <al...@gmail.com> wrote:
>
> > This is a QA environment where I fully control the Hive tables and I
> loaded
> > this dataset manually and didn't make any change since. The cube itself
> has
> > been created yesterday so no I don't see any Hive update after the cube
> > build.
> >
> > I didn't have the chance to modify the Hive table use BIGINT as well but
> > will do when I have time.
> >
> > On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <li...@apache.org> wrote:
> >
> > > Integer overflow is possible. I do remember some FIXME there.
> > >
> > > Incorrect SUM is quite impossible. Regression test reconciles SUM
> result
> > > with H2 database every day. Maybe the hive table is updated after cube
> > > build? Try refresh cube to ensure it captures latest hive data.
> > >
> > >
> > >
> > > On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <ma...@apache.org>
> wrote:
> > >
> > > > I'll try to reproduce the issue in my own environment
> > > >
> > > > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com> wrote:
> > > >
> > > > > That's fine, please change your hive schema first to make sure they
> > are
> > > > > same,
> > > > > then try again.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > >
> > > > > Best Regards!
> > > > > ---------------------
> > > > >
> > > > > Luke Han
> > > > >
> > > > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <alexschufo@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > I am using 0.7.2.
> > > > > >
> > > > > > The Hive column is int, does that mean I should try to modify the
> > > > > existing
> > > > > > Hive table to move to Bigint as well to have the Kylin count as a
> > > > Bigint?
> > > > > >
> > > > > > I probably won't be allowed to share this kind of data outside my
> > > > > > organization unfortunately.
> > > > > >
> > > > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <mahongbin@apache.org
> >
> > > > wrote:
> > > > > >
> > > > > > > except for what Luke asked, can you also share which version
> are
> > > you
> > > > > > using?
> > > > > > > We'll look into this
> > > > > > >
> > > > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > What's the data type in your hive table of that column?
> > > > > > > > We recommend to use same as your measure definition.
> > > > > > > >
> > > > > > > > For data quality, we have approach to ensure it will 100%
> > > > > > > > match with source data, there's H2 comparison in our testing
> > > case.
> > > > > > > > But yes, let's double check the test case about huge number
> > like
> > > > your
> > > > > > > case
> > > > > > > >
> > > > > > > > Is it possible to share your data and model to us if not
> > > sensitive?
> > > > > > > > It will be better to using that to do the verify and testing
> > > here.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Best Regards!
> > > > > > > > ---------------------
> > > > > > > >
> > > > > > > > Luke Han
> > > > > > > >
> > > > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
> > > alexschufo@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I have a simple cube with a Fact Table and a Lookup Table.
> > The
> > > > Fact
> > > > > > > Table
> > > > > > > > > includes some counter. The Lookup Table includes a region
> > > > property
> > > > > > > which
> > > > > > > > is
> > > > > > > > > the highest level of the hierarchy and therefore can
> > accumulate
> > > > > > quite a
> > > > > > > > lot
> > > > > > > > > of data when summing the counter by region.
> > > > > > > > >
> > > > > > > > > I am doing a query like this:
> > > > > > > > >
> > > > > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > > > > >
> > > > > > > > > from FACT_TABLE
> > > > > > > > >
> > > > > > > > > join LOOKUP_TABLE
> > > > > > > > >
> > > > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > > > > >
> > > > > > > > > where date = '2015-01-01'
> > > > > > > > >
> > > > > > > > > group by REGION
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I get the following result:
> > > > > > > > >
> > > > > > > > > -917,164,421 EMEA --> negative result!
> > > > > > > > >
> > > > > > > > > 777,795,808 AMER
> > > > > > > > >
> > > > > > > > > 514,879,134 APAC
> > > > > > > > >
> > > > > > > > > 117,814,485 LATAM
> > > > > > > > >
> > > > > > > > > I ran the exact same query in Hive and got the following
> > > result:
> > > > > > > > >
> > > > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the
> > BIGINT
> > > > > limit
> > > > > > > > >
> > > > > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin
> > > result
> > > > > > > > >
> > > > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the
> Kylin
> > > > > result
> > > > > > > > >
> > > > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
> > > result
> > > > > > > > >
> > > > > > > > > Based on that it seems that the limit is the int limit
> 2^31-1
> > > > > > > > > (2,147,483,647) and that my negative result comes from an
> > > > overflow.
> > > > > > > > >
> > > > > > > > > In my cube I defined this measure as a bigint:
> > > > > > > > >
> > > > > > > > > Expression    Param Type    Param Value       Return Type
> > > > > > > > >
> > > > > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > > > > >
> > > > > > > > > The other thing that worries me a bit is that the other
> > numbers
> > > > are
> > > > > > not
> > > > > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > *Bin Mahone | 马洪宾*
> > > > > > > Apache Kylin: http://kylin.io
> > > > > > > Github: https://github.com/binmahone
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Li Yang <li...@apache.org>.

Interesting.. I'm thinking how to reproduce your problem then we can help..

How big is the test hive table? Possible to export and share with us?

On Thu, Aug 6, 2015 at 5:23 PM, alex schufo <al...@gmail.com> wrote:

> This is a QA environment where I fully control the Hive tables and I loaded
> this dataset manually and didn't make any change since. The cube itself has
> been created yesterday so no I don't see any Hive update after the cube
> build.
>
> I didn't have the chance to modify the Hive table use BIGINT as well but
> will do when I have time.
>
> On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <li...@apache.org> wrote:
>
> > Integer overflow is possible. I do remember some FIXME there.
> >
> > Incorrect SUM is quite impossible. Regression test reconciles SUM result
> > with H2 database every day. Maybe the hive table is updated after cube
> > build? Try refresh cube to ensure it captures latest hive data.
> >
> >
> >
> > On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <ma...@apache.org> wrote:
> >
> > > I'll try to reproduce the issue in my own environment
> > >
> > > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com> wrote:
> > >
> > > > That's fine, please change your hive schema first to make sure they
> are
> > > > same,
> > > > then try again.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > Best Regards!
> > > > ---------------------
> > > >
> > > > Luke Han
> > > >
> > > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <al...@gmail.com>
> > > wrote:
> > > >
> > > > > I am using 0.7.2.
> > > > >
> > > > > The Hive column is int, does that mean I should try to modify the
> > > > existing
> > > > > Hive table to move to Bigint as well to have the Kylin count as a
> > > Bigint?
> > > > >
> > > > > I probably won't be allowed to share this kind of data outside my
> > > > > organization unfortunately.
> > > > >
> > > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org>
> > > wrote:
> > > > >
> > > > > > except for what Luke asked, can you also share which version are
> > you
> > > > > using?
> > > > > > We'll look into this
> > > > > >
> > > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > What's the data type in your hive table of that column?
> > > > > > > We recommend to use same as your measure definition.
> > > > > > >
> > > > > > > For data quality, we have approach to ensure it will 100%
> > > > > > > match with source data, there's H2 comparison in our testing
> > case.
> > > > > > > But yes, let's double check the test case about huge number
> like
> > > your
> > > > > > case
> > > > > > >
> > > > > > > Is it possible to share your data and model to us if not
> > sensitive?
> > > > > > > It will be better to using that to do the verify and testing
> > here.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best Regards!
> > > > > > > ---------------------
> > > > > > >
> > > > > > > Luke Han
> > > > > > >
> > > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
> > alexschufo@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > I have a simple cube with a Fact Table and a Lookup Table.
> The
> > > Fact
> > > > > > Table
> > > > > > > > includes some counter. The Lookup Table includes a region
> > > property
> > > > > > which
> > > > > > > is
> > > > > > > > the highest level of the hierarchy and therefore can
> accumulate
> > > > > quite a
> > > > > > > lot
> > > > > > > > of data when summing the counter by region.
> > > > > > > >
> > > > > > > > I am doing a query like this:
> > > > > > > >
> > > > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > > > >
> > > > > > > > from FACT_TABLE
> > > > > > > >
> > > > > > > > join LOOKUP_TABLE
> > > > > > > >
> > > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > > > >
> > > > > > > > where date = '2015-01-01'
> > > > > > > >
> > > > > > > > group by REGION
> > > > > > > >
> > > > > > > >
> > > > > > > > I get the following result:
> > > > > > > >
> > > > > > > > -917,164,421 EMEA --> negative result!
> > > > > > > >
> > > > > > > > 777,795,808 AMER
> > > > > > > >
> > > > > > > > 514,879,134 APAC
> > > > > > > >
> > > > > > > > 117,814,485 LATAM
> > > > > > > >
> > > > > > > > I ran the exact same query in Hive and got the following
> > result:
> > > > > > > >
> > > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the
> BIGINT
> > > > limit
> > > > > > > >
> > > > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin
> > result
> > > > > > > >
> > > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin
> > > > result
> > > > > > > >
> > > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
> > result
> > > > > > > >
> > > > > > > > Based on that it seems that the limit is the int limit 2^31-1
> > > > > > > > (2,147,483,647) and that my negative result comes from an
> > > overflow.
> > > > > > > >
> > > > > > > > In my cube I defined this measure as a bigint:
> > > > > > > >
> > > > > > > > Expression    Param Type    Param Value       Return Type
> > > > > > > >
> > > > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > > > >
> > > > > > > > The other thing that worries me a bit is that the other
> numbers
> > > are
> > > > > not
> > > > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > *Bin Mahone | 马洪宾*
> > > > > > Apache Kylin: http://kylin.io
> > > > > > Github: https://github.com/binmahone
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by hongbin ma <ma...@apache.org>.

awesome!

On Tue, Sep 8, 2015 at 1:41 AM, alex schufo <al...@gmail.com> wrote:

> I just installed Kylin 1.0 and I can confirm the negative sum issue is
> solved (https://issues.apache.org/jira/browse/KYLIN-934), thanks a lot!
>
> On Thu, Aug 13, 2015 at 12:37 PM, Luke Han <lu...@gmail.com> wrote:
>
> > Soon, in this month.
> >
> > Thanks.
> >
> >
> > Best Regards!
> > ---------------------
> >
> > Luke Han
> >
> > On Thu, Aug 13, 2015 at 6:22 PM, alex schufo <al...@gmail.com>
> wrote:
> >
> > > Cool! Thanks!
> > >
> > > I will try to test this. When is the release for 0.7.3 planned?
> > >
> > > On Wed, Aug 12, 2015 at 2:33 AM, hongbin ma <ma...@apache.org>
> > wrote:
> > >
> > > > both on 0.7 and 0.8 , issue already updated
> > > >
> > > > On Tue, Aug 11, 2015 at 9:12 PM, Luke Han <lu...@gmail.com> wrote:
> > > >
> > > > > Cool!!!
> > > > >
> > > > > both 0.7 and 0.8 branch? could  you please help to update the
> detail
> > > > under
> > > > > KYLIN-934?
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > Best Regards!
> > > > > ---------------------
> > > > >
> > > > > Luke Han
> > > > >
> > > > > On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org>
> > > > wrote:
> > > > >
> > > > > > the issue is fixed on
> > > https://issues.apache.org/jira/browse/KYLIN-934
> > > > > >
> > > > > > On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Have discussed with Mahone, we will enlarge the data type (e.g.
> > > from
> > > > > int
> > > > > > to
> > > > > > > bigint) when declaring the table to calcite. So the table
> > > descriptor
> > > > > > remain
> > > > > > > in sync with Hive table.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > *Bin Mahone | 马洪宾*
> > > > > > Apache Kylin: http://kylin.io
> > > > > > Github: https://github.com/binmahone
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Luke Han <lu...@gmail.com>.

Glad to see the issue is resolved, please feel free to report any other
issue.

Thank you very much and enjoy with Kylin:)


Best Regards!
---------------------

Luke Han

On Tue, Sep 8, 2015 at 1:41 AM, alex schufo <al...@gmail.com> wrote:

> I just installed Kylin 1.0 and I can confirm the negative sum issue is
> solved (https://issues.apache.org/jira/browse/KYLIN-934), thanks a lot!
>
> On Thu, Aug 13, 2015 at 12:37 PM, Luke Han <lu...@gmail.com> wrote:
>
> > Soon, in this month.
> >
> > Thanks.
> >
> >
> > Best Regards!
> > ---------------------
> >
> > Luke Han
> >
> > On Thu, Aug 13, 2015 at 6:22 PM, alex schufo <al...@gmail.com>
> wrote:
> >
> > > Cool! Thanks!
> > >
> > > I will try to test this. When is the release for 0.7.3 planned?
> > >
> > > On Wed, Aug 12, 2015 at 2:33 AM, hongbin ma <ma...@apache.org>
> > wrote:
> > >
> > > > both on 0.7 and 0.8 , issue already updated
> > > >
> > > > On Tue, Aug 11, 2015 at 9:12 PM, Luke Han <lu...@gmail.com> wrote:
> > > >
> > > > > Cool!!!
> > > > >
> > > > > both 0.7 and 0.8 branch? could  you please help to update the
> detail
> > > > under
> > > > > KYLIN-934?
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > Best Regards!
> > > > > ---------------------
> > > > >
> > > > > Luke Han
> > > > >
> > > > > On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org>
> > > > wrote:
> > > > >
> > > > > > the issue is fixed on
> > > https://issues.apache.org/jira/browse/KYLIN-934
> > > > > >
> > > > > > On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org>
> > wrote:
> > > > > >
> > > > > > > Have discussed with Mahone, we will enlarge the data type (e.g.
> > > from
> > > > > int
> > > > > > to
> > > > > > > bigint) when declaring the table to calcite. So the table
> > > descriptor
> > > > > > remain
> > > > > > > in sync with Hive table.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > *Bin Mahone | 马洪宾*
> > > > > > Apache Kylin: http://kylin.io
> > > > > > Github: https://github.com/binmahone
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by alex schufo <al...@gmail.com>.

I just installed Kylin 1.0 and I can confirm the negative sum issue is
solved (https://issues.apache.org/jira/browse/KYLIN-934), thanks a lot!

On Thu, Aug 13, 2015 at 12:37 PM, Luke Han <lu...@gmail.com> wrote:

> Soon, in this month.
>
> Thanks.
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Thu, Aug 13, 2015 at 6:22 PM, alex schufo <al...@gmail.com> wrote:
>
> > Cool! Thanks!
> >
> > I will try to test this. When is the release for 0.7.3 planned?
> >
> > On Wed, Aug 12, 2015 at 2:33 AM, hongbin ma <ma...@apache.org>
> wrote:
> >
> > > both on 0.7 and 0.8 , issue already updated
> > >
> > > On Tue, Aug 11, 2015 at 9:12 PM, Luke Han <lu...@gmail.com> wrote:
> > >
> > > > Cool!!!
> > > >
> > > > both 0.7 and 0.8 branch? could  you please help to update the detail
> > > under
> > > > KYLIN-934?
> > > >
> > > > Thanks.
> > > >
> > > >
> > > > Best Regards!
> > > > ---------------------
> > > >
> > > > Luke Han
> > > >
> > > > On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org>
> > > wrote:
> > > >
> > > > > the issue is fixed on
> > https://issues.apache.org/jira/browse/KYLIN-934
> > > > >
> > > > > On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org>
> wrote:
> > > > >
> > > > > > Have discussed with Mahone, we will enlarge the data type (e.g.
> > from
> > > > int
> > > > > to
> > > > > > bigint) when declaring the table to calcite. So the table
> > descriptor
> > > > > remain
> > > > > > in sync with Hive table.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Luke Han <lu...@gmail.com>.

Soon, in this month.

Thanks.


Best Regards!
---------------------

Luke Han

On Thu, Aug 13, 2015 at 6:22 PM, alex schufo <al...@gmail.com> wrote:

> Cool! Thanks!
>
> I will try to test this. When is the release for 0.7.3 planned?
>
> On Wed, Aug 12, 2015 at 2:33 AM, hongbin ma <ma...@apache.org> wrote:
>
> > both on 0.7 and 0.8 , issue already updated
> >
> > On Tue, Aug 11, 2015 at 9:12 PM, Luke Han <lu...@gmail.com> wrote:
> >
> > > Cool!!!
> > >
> > > both 0.7 and 0.8 branch? could  you please help to update the detail
> > under
> > > KYLIN-934?
> > >
> > > Thanks.
> > >
> > >
> > > Best Regards!
> > > ---------------------
> > >
> > > Luke Han
> > >
> > > On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org>
> > wrote:
> > >
> > > > the issue is fixed on
> https://issues.apache.org/jira/browse/KYLIN-934
> > > >
> > > > On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org> wrote:
> > > >
> > > > > Have discussed with Mahone, we will enlarge the data type (e.g.
> from
> > > int
> > > > to
> > > > > bigint) when declaring the table to calcite. So the table
> descriptor
> > > > remain
> > > > > in sync with Hive table.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by alex schufo <al...@gmail.com>.

Cool! Thanks!

I will try to test this. When is the release for 0.7.3 planned?

On Wed, Aug 12, 2015 at 2:33 AM, hongbin ma <ma...@apache.org> wrote:

> both on 0.7 and 0.8 , issue already updated
>
> On Tue, Aug 11, 2015 at 9:12 PM, Luke Han <lu...@gmail.com> wrote:
>
> > Cool!!!
> >
> > both 0.7 and 0.8 branch? could  you please help to update the detail
> under
> > KYLIN-934?
> >
> > Thanks.
> >
> >
> > Best Regards!
> > ---------------------
> >
> > Luke Han
> >
> > On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org>
> wrote:
> >
> > > the issue is fixed on https://issues.apache.org/jira/browse/KYLIN-934
> > >
> > > On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org> wrote:
> > >
> > > > Have discussed with Mahone, we will enlarge the data type (e.g. from
> > int
> > > to
> > > > bigint) when declaring the table to calcite. So the table descriptor
> > > remain
> > > > in sync with Hive table.
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by hongbin ma <ma...@apache.org>.

both on 0.7 and 0.8 , issue already updated

On Tue, Aug 11, 2015 at 9:12 PM, Luke Han <lu...@gmail.com> wrote:

> Cool!!!
>
> both 0.7 and 0.8 branch? could  you please help to update the detail under
> KYLIN-934?
>
> Thanks.
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org> wrote:
>
> > the issue is fixed on https://issues.apache.org/jira/browse/KYLIN-934
> >
> > On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org> wrote:
> >
> > > Have discussed with Mahone, we will enlarge the data type (e.g. from
> int
> > to
> > > bigint) when declaring the table to calcite. So the table descriptor
> > remain
> > > in sync with Hive table.
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Luke Han <lu...@gmail.com>.

Cool!!!

both 0.7 and 0.8 branch? could  you please help to update the detail under
KYLIN-934?

Thanks.


Best Regards!
---------------------

Luke Han

On Tue, Aug 11, 2015 at 5:30 PM, hongbin ma <ma...@apache.org> wrote:

> the issue is fixed on https://issues.apache.org/jira/browse/KYLIN-934
>
> On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org> wrote:
>
> > Have discussed with Mahone, we will enlarge the data type (e.g. from int
> to
> > bigint) when declaring the table to calcite. So the table descriptor
> remain
> > in sync with Hive table.
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by hongbin ma <ma...@apache.org>.

the issue is fixed on https://issues.apache.org/jira/browse/KYLIN-934

On Tue, Aug 11, 2015 at 4:37 PM, Li Yang <li...@apache.org> wrote:

> Have discussed with Mahone, we will enlarge the data type (e.g. from int to
> bigint) when declaring the table to calcite. So the table descriptor remain
> in sync with Hive table.
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Li Yang <li...@apache.org>.

Have discussed with Mahone, we will enlarge the data type (e.g. from int to
bigint) when declaring the table to calcite. So the table descriptor remain
in sync with Hive table.

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by "Shi, Shaofeng" <sh...@ebay.com>.

I handled several such cases in eBay deployment, and usually it is the
type overflow; After change the type in hive (INT ->BIGINT, or FLOAD ->
DOUBLE), you also need sync the table schema to Kylin again; Just give a
try and let us know whether it solves your issue;

@yang, I do think Kylin should always use BITINT or DOUBLE for the measure
columns, instead of using the definition in hive; Was there any concern on
this?

On 8/6/15, 5:23 PM, "alex schufo" <al...@gmail.com> wrote:

>This is a QA environment where I fully control the Hive tables and I
>loaded
>this dataset manually and didn't make any change since. The cube itself
>has
>been created yesterday so no I don't see any Hive update after the cube
>build.
>
>I didn't have the chance to modify the Hive table use BIGINT as well but
>will do when I have time.
>
>On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <li...@apache.org> wrote:
>
>> Integer overflow is possible. I do remember some FIXME there.
>>
>> Incorrect SUM is quite impossible. Regression test reconciles SUM result
>> with H2 database every day. Maybe the hive table is updated after cube
>> build? Try refresh cube to ensure it captures latest hive data.
>>
>>
>>
>> On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <ma...@apache.org> wrote:
>>
>> > I'll try to reproduce the issue in my own environment
>> >
>> > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com> wrote:
>> >
>> > > That's fine, please change your hive schema first to make sure they
>>are
>> > > same,
>> > > then try again.
>> > >
>> > > Thanks.
>> > >
>> > >
>> > >
>> > > Best Regards!
>> > > ---------------------
>> > >
>> > > Luke Han
>> > >
>> > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <al...@gmail.com>
>> > wrote:
>> > >
>> > > > I am using 0.7.2.
>> > > >
>> > > > The Hive column is int, does that mean I should try to modify the
>> > > existing
>> > > > Hive table to move to Bigint as well to have the Kylin count as a
>> > Bigint?
>> > > >
>> > > > I probably won't be allowed to share this kind of data outside my
>> > > > organization unfortunately.
>> > > >
>> > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org>
>> > wrote:
>> > > >
>> > > > > except for what Luke asked, can you also share which version are
>> you
>> > > > using?
>> > > > > We'll look into this
>> > > > >
>> > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com>
>> wrote:
>> > > > >
>> > > > > > What's the data type in your hive table of that column?
>> > > > > > We recommend to use same as your measure definition.
>> > > > > >
>> > > > > > For data quality, we have approach to ensure it will 100%
>> > > > > > match with source data, there's H2 comparison in our testing
>> case.
>> > > > > > But yes, let's double check the test case about huge number
>>like
>> > your
>> > > > > case
>> > > > > >
>> > > > > > Is it possible to share your data and model to us if not
>> sensitive?
>> > > > > > It will be better to using that to do the verify and testing
>> here.
>> > > > > >
>> > > > > > Thanks.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > Best Regards!
>> > > > > > ---------------------
>> > > > > >
>> > > > > > Luke Han
>> > > > > >
>> > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
>> alexschufo@gmail.com
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > I have a simple cube with a Fact Table and a Lookup Table.
>>The
>> > Fact
>> > > > > Table
>> > > > > > > includes some counter. The Lookup Table includes a region
>> > property
>> > > > > which
>> > > > > > is
>> > > > > > > the highest level of the hierarchy and therefore can
>>accumulate
>> > > > quite a
>> > > > > > lot
>> > > > > > > of data when summing the counter by region.
>> > > > > > >
>> > > > > > > I am doing a query like this:
>> > > > > > >
>> > > > > > > select sum(MY_COUNTER) as tot_count, REGION
>> > > > > > >
>> > > > > > > from FACT_TABLE
>> > > > > > >
>> > > > > > > join LOOKUP_TABLE
>> > > > > > >
>> > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
>> > > > > > >
>> > > > > > > where date = '2015-01-01'
>> > > > > > >
>> > > > > > > group by REGION
>> > > > > > >
>> > > > > > >
>> > > > > > > I get the following result:
>> > > > > > >
>> > > > > > > -917,164,421 EMEA --> negative result!
>> > > > > > >
>> > > > > > > 777,795,808 AMER
>> > > > > > >
>> > > > > > > 514,879,134 APAC
>> > > > > > >
>> > > > > > > 117,814,485 LATAM
>> > > > > > >
>> > > > > > > I ran the exact same query in Hive and got the following
>> result:
>> > > > > > >
>> > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the
>>BIGINT
>> > > limit
>> > > > > > >
>> > > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin
>> result
>> > > > > > >
>> > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the
>>Kylin
>> > > result
>> > > > > > >
>> > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
>> result
>> > > > > > >
>> > > > > > > Based on that it seems that the limit is the int limit
>>2^31-1
>> > > > > > > (2,147,483,647) and that my negative result comes from an
>> > overflow.
>> > > > > > >
>> > > > > > > In my cube I defined this measure as a bigint:
>> > > > > > >
>> > > > > > > Expression    Param Type    Param Value       Return Type
>> > > > > > >
>> > > > > > > SUM              column            AVAIL_COUNT    bigint
>> > > > > > >
>> > > > > > > The other thing that worries me a bit is that the other
>>numbers
>> > are
>> > > > not
>> > > > > > > 100% accurate either (>100,000,000 difference!)
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Regards,
>> > > > >
>> > > > > *Bin Mahone | 马洪宾*
>> > > > > Apache Kylin: http://kylin.io
>> > > > > Github: https://github.com/binmahone
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Regards,
>> >
>> > *Bin Mahone | 马洪宾*
>> > Apache Kylin: http://kylin.io
>> > Github: https://github.com/binmahone
>> >
>>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by alex schufo <al...@gmail.com>.

This is a QA environment where I fully control the Hive tables and I loaded
this dataset manually and didn't make any change since. The cube itself has
been created yesterday so no I don't see any Hive update after the cube
build.

I didn't have the chance to modify the Hive table use BIGINT as well but
will do when I have time.

On Thu, Aug 6, 2015 at 10:17 AM, Li Yang <li...@apache.org> wrote:

> Integer overflow is possible. I do remember some FIXME there.
>
> Incorrect SUM is quite impossible. Regression test reconciles SUM result
> with H2 database every day. Maybe the hive table is updated after cube
> build? Try refresh cube to ensure it captures latest hive data.
>
>
>
> On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <ma...@apache.org> wrote:
>
> > I'll try to reproduce the issue in my own environment
> >
> > On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com> wrote:
> >
> > > That's fine, please change your hive schema first to make sure they are
> > > same,
> > > then try again.
> > >
> > > Thanks.
> > >
> > >
> > >
> > > Best Regards!
> > > ---------------------
> > >
> > > Luke Han
> > >
> > > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <al...@gmail.com>
> > wrote:
> > >
> > > > I am using 0.7.2.
> > > >
> > > > The Hive column is int, does that mean I should try to modify the
> > > existing
> > > > Hive table to move to Bigint as well to have the Kylin count as a
> > Bigint?
> > > >
> > > > I probably won't be allowed to share this kind of data outside my
> > > > organization unfortunately.
> > > >
> > > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org>
> > wrote:
> > > >
> > > > > except for what Luke asked, can you also share which version are
> you
> > > > using?
> > > > > We'll look into this
> > > > >
> > > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com>
> wrote:
> > > > >
> > > > > > What's the data type in your hive table of that column?
> > > > > > We recommend to use same as your measure definition.
> > > > > >
> > > > > > For data quality, we have approach to ensure it will 100%
> > > > > > match with source data, there's H2 comparison in our testing
> case.
> > > > > > But yes, let's double check the test case about huge number like
> > your
> > > > > case
> > > > > >
> > > > > > Is it possible to share your data and model to us if not
> sensitive?
> > > > > > It will be better to using that to do the verify and testing
> here.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards!
> > > > > > ---------------------
> > > > > >
> > > > > > Luke Han
> > > > > >
> > > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <
> alexschufo@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I have a simple cube with a Fact Table and a Lookup Table. The
> > Fact
> > > > > Table
> > > > > > > includes some counter. The Lookup Table includes a region
> > property
> > > > > which
> > > > > > is
> > > > > > > the highest level of the hierarchy and therefore can accumulate
> > > > quite a
> > > > > > lot
> > > > > > > of data when summing the counter by region.
> > > > > > >
> > > > > > > I am doing a query like this:
> > > > > > >
> > > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > > >
> > > > > > > from FACT_TABLE
> > > > > > >
> > > > > > > join LOOKUP_TABLE
> > > > > > >
> > > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > > >
> > > > > > > where date = '2015-01-01'
> > > > > > >
> > > > > > > group by REGION
> > > > > > >
> > > > > > >
> > > > > > > I get the following result:
> > > > > > >
> > > > > > > -917,164,421 EMEA --> negative result!
> > > > > > >
> > > > > > > 777,795,808 AMER
> > > > > > >
> > > > > > > 514,879,134 APAC
> > > > > > >
> > > > > > > 117,814,485 LATAM
> > > > > > >
> > > > > > > I ran the exact same query in Hive and got the following
> result:
> > > > > > >
> > > > > > > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT
> > > limit
> > > > > > >
> > > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin
> result
> > > > > > >
> > > > > > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin
> > > result
> > > > > > >
> > > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin
> result
> > > > > > >
> > > > > > > Based on that it seems that the limit is the int limit 2^31-1
> > > > > > > (2,147,483,647) and that my negative result comes from an
> > overflow.
> > > > > > >
> > > > > > > In my cube I defined this measure as a bigint:
> > > > > > >
> > > > > > > Expression    Param Type    Param Value       Return Type
> > > > > > >
> > > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > > >
> > > > > > > The other thing that worries me a bit is that the other numbers
> > are
> > > > not
> > > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > *Bin Mahone | 马洪宾*
> > > > > Apache Kylin: http://kylin.io
> > > > > Github: https://github.com/binmahone
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Li Yang <li...@apache.org>.

Integer overflow is possible. I do remember some FIXME there.

Incorrect SUM is quite impossible. Regression test reconciles SUM result
with H2 database every day. Maybe the hive table is updated after cube
build? Try refresh cube to ensure it captures latest hive data.



On Thu, Aug 6, 2015 at 9:23 AM, hongbin ma <ma...@apache.org> wrote:

> I'll try to reproduce the issue in my own environment
>
> On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com> wrote:
>
> > That's fine, please change your hive schema first to make sure they are
> > same,
> > then try again.
> >
> > Thanks.
> >
> >
> >
> > Best Regards!
> > ---------------------
> >
> > Luke Han
> >
> > On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <al...@gmail.com>
> wrote:
> >
> > > I am using 0.7.2.
> > >
> > > The Hive column is int, does that mean I should try to modify the
> > existing
> > > Hive table to move to Bigint as well to have the Kylin count as a
> Bigint?
> > >
> > > I probably won't be allowed to share this kind of data outside my
> > > organization unfortunately.
> > >
> > > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org>
> wrote:
> > >
> > > > except for what Luke asked, can you also share which version are you
> > > using?
> > > > We'll look into this
> > > >
> > > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com> wrote:
> > > >
> > > > > What's the data type in your hive table of that column?
> > > > > We recommend to use same as your measure definition.
> > > > >
> > > > > For data quality, we have approach to ensure it will 100%
> > > > > match with source data, there's H2 comparison in our testing case.
> > > > > But yes, let's double check the test case about huge number like
> your
> > > > case
> > > > >
> > > > > Is it possible to share your data and model to us if not sensitive?
> > > > > It will be better to using that to do the verify and testing here.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Best Regards!
> > > > > ---------------------
> > > > >
> > > > > Luke Han
> > > > >
> > > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <alexschufo@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > I have a simple cube with a Fact Table and a Lookup Table. The
> Fact
> > > > Table
> > > > > > includes some counter. The Lookup Table includes a region
> property
> > > > which
> > > > > is
> > > > > > the highest level of the hierarchy and therefore can accumulate
> > > quite a
> > > > > lot
> > > > > > of data when summing the counter by region.
> > > > > >
> > > > > > I am doing a query like this:
> > > > > >
> > > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > > >
> > > > > > from FACT_TABLE
> > > > > >
> > > > > > join LOOKUP_TABLE
> > > > > >
> > > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > > >
> > > > > > where date = '2015-01-01'
> > > > > >
> > > > > > group by REGION
> > > > > >
> > > > > >
> > > > > > I get the following result:
> > > > > >
> > > > > > -917,164,421 EMEA --> negative result!
> > > > > >
> > > > > > 777,795,808 AMER
> > > > > >
> > > > > > 514,879,134 APAC
> > > > > >
> > > > > > 117,814,485 LATAM
> > > > > >
> > > > > > I ran the exact same query in Hive and got the following result:
> > > > > >
> > > > > > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT
> > limit
> > > > > >
> > > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin result
> > > > > >
> > > > > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin
> > result
> > > > > >
> > > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin result
> > > > > >
> > > > > > Based on that it seems that the limit is the int limit 2^31-1
> > > > > > (2,147,483,647) and that my negative result comes from an
> overflow.
> > > > > >
> > > > > > In my cube I defined this measure as a bigint:
> > > > > >
> > > > > > Expression    Param Type    Param Value       Return Type
> > > > > >
> > > > > > SUM              column            AVAIL_COUNT    bigint
> > > > > >
> > > > > > The other thing that worries me a bit is that the other numbers
> are
> > > not
> > > > > > 100% accurate either (>100,000,000 difference!)
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > *Bin Mahone | 马洪宾*
> > > > Apache Kylin: http://kylin.io
> > > > Github: https://github.com/binmahone
> > > >
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by hongbin ma <ma...@apache.org>.

I'll try to reproduce the issue in my own environment

On Wed, Aug 5, 2015 at 11:51 PM, Luke Han <lu...@gmail.com> wrote:

> That's fine, please change your hive schema first to make sure they are
> same,
> then try again.
>
> Thanks.
>
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <al...@gmail.com> wrote:
>
> > I am using 0.7.2.
> >
> > The Hive column is int, does that mean I should try to modify the
> existing
> > Hive table to move to Bigint as well to have the Kylin count as a Bigint?
> >
> > I probably won't be allowed to share this kind of data outside my
> > organization unfortunately.
> >
> > On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org> wrote:
> >
> > > except for what Luke asked, can you also share which version are you
> > using?
> > > We'll look into this
> > >
> > > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com> wrote:
> > >
> > > > What's the data type in your hive table of that column?
> > > > We recommend to use same as your measure definition.
> > > >
> > > > For data quality, we have approach to ensure it will 100%
> > > > match with source data, there's H2 comparison in our testing case.
> > > > But yes, let's double check the test case about huge number like your
> > > case
> > > >
> > > > Is it possible to share your data and model to us if not sensitive?
> > > > It will be better to using that to do the verify and testing here.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Best Regards!
> > > > ---------------------
> > > >
> > > > Luke Han
> > > >
> > > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <al...@gmail.com>
> > > wrote:
> > > >
> > > > > I have a simple cube with a Fact Table and a Lookup Table. The Fact
> > > Table
> > > > > includes some counter. The Lookup Table includes a region property
> > > which
> > > > is
> > > > > the highest level of the hierarchy and therefore can accumulate
> > quite a
> > > > lot
> > > > > of data when summing the counter by region.
> > > > >
> > > > > I am doing a query like this:
> > > > >
> > > > > select sum(MY_COUNTER) as tot_count, REGION
> > > > >
> > > > > from FACT_TABLE
> > > > >
> > > > > join LOOKUP_TABLE
> > > > >
> > > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > > >
> > > > > where date = '2015-01-01'
> > > > >
> > > > > group by REGION
> > > > >
> > > > >
> > > > > I get the following result:
> > > > >
> > > > > -917,164,421 EMEA --> negative result!
> > > > >
> > > > > 777,795,808 AMER
> > > > >
> > > > > 514,879,134 APAC
> > > > >
> > > > > 117,814,485 LATAM
> > > > >
> > > > > I ran the exact same query in Hive and got the following result:
> > > > >
> > > > > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT
> limit
> > > > >
> > > > > 778,718,563    AMER  --> 922,755 difference with the Kylin result
> > > > >
> > > > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin
> result
> > > > >
> > > > > 117,913,857    LATAM --> 99,372 difference with the Kylin result
> > > > >
> > > > > Based on that it seems that the limit is the int limit 2^31-1
> > > > > (2,147,483,647) and that my negative result comes from an overflow.
> > > > >
> > > > > In my cube I defined this measure as a bigint:
> > > > >
> > > > > Expression    Param Type    Param Value       Return Type
> > > > >
> > > > > SUM              column            AVAIL_COUNT    bigint
> > > > >
> > > > > The other thing that worries me a bit is that the other numbers are
> > not
> > > > > 100% accurate either (>100,000,000 difference!)
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > *Bin Mahone | 马洪宾*
> > > Apache Kylin: http://kylin.io
> > > Github: https://github.com/binmahone
> > >
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Luke Han <lu...@gmail.com>.

That's fine, please change your hive schema first to make sure they are
same,
then try again.

Thanks.



Best Regards!
---------------------

Luke Han

On Wed, Aug 5, 2015 at 11:29 PM, alex schufo <al...@gmail.com> wrote:

> I am using 0.7.2.
>
> The Hive column is int, does that mean I should try to modify the existing
> Hive table to move to Bigint as well to have the Kylin count as a Bigint?
>
> I probably won't be allowed to share this kind of data outside my
> organization unfortunately.
>
> On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org> wrote:
>
> > except for what Luke asked, can you also share which version are you
> using?
> > We'll look into this
> >
> > On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com> wrote:
> >
> > > What's the data type in your hive table of that column?
> > > We recommend to use same as your measure definition.
> > >
> > > For data quality, we have approach to ensure it will 100%
> > > match with source data, there's H2 comparison in our testing case.
> > > But yes, let's double check the test case about huge number like your
> > case
> > >
> > > Is it possible to share your data and model to us if not sensitive?
> > > It will be better to using that to do the verify and testing here.
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Best Regards!
> > > ---------------------
> > >
> > > Luke Han
> > >
> > > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <al...@gmail.com>
> > wrote:
> > >
> > > > I have a simple cube with a Fact Table and a Lookup Table. The Fact
> > Table
> > > > includes some counter. The Lookup Table includes a region property
> > which
> > > is
> > > > the highest level of the hierarchy and therefore can accumulate
> quite a
> > > lot
> > > > of data when summing the counter by region.
> > > >
> > > > I am doing a query like this:
> > > >
> > > > select sum(MY_COUNTER) as tot_count, REGION
> > > >
> > > > from FACT_TABLE
> > > >
> > > > join LOOKUP_TABLE
> > > >
> > > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > > >
> > > > where date = '2015-01-01'
> > > >
> > > > group by REGION
> > > >
> > > >
> > > > I get the following result:
> > > >
> > > > -917,164,421 EMEA --> negative result!
> > > >
> > > > 777,795,808 AMER
> > > >
> > > > 514,879,134 APAC
> > > >
> > > > 117,814,485 LATAM
> > > >
> > > > I ran the exact same query in Hive and got the following result:
> > > >
> > > > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT limit
> > > >
> > > > 778,718,563    AMER  --> 922,755 difference with the Kylin result
> > > >
> > > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin result
> > > >
> > > > 117,913,857    LATAM --> 99,372 difference with the Kylin result
> > > >
> > > > Based on that it seems that the limit is the int limit 2^31-1
> > > > (2,147,483,647) and that my negative result comes from an overflow.
> > > >
> > > > In my cube I defined this measure as a bigint:
> > > >
> > > > Expression    Param Type    Param Value       Return Type
> > > >
> > > > SUM              column            AVAIL_COUNT    bigint
> > > >
> > > > The other thing that worries me a bit is that the other numbers are
> not
> > > > 100% accurate either (>100,000,000 difference!)
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by alex schufo <al...@gmail.com>.

I am using 0.7.2.

The Hive column is int, does that mean I should try to modify the existing
Hive table to move to Bigint as well to have the Kylin count as a Bigint?

I probably won't be allowed to share this kind of data outside my
organization unfortunately.

On Wed, Aug 5, 2015 at 4:22 PM, hongbin ma <ma...@apache.org> wrote:

> except for what Luke asked, can you also share which version are you using?
> We'll look into this
>
> On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com> wrote:
>
> > What's the data type in your hive table of that column?
> > We recommend to use same as your measure definition.
> >
> > For data quality, we have approach to ensure it will 100%
> > match with source data, there's H2 comparison in our testing case.
> > But yes, let's double check the test case about huge number like your
> case
> >
> > Is it possible to share your data and model to us if not sensitive?
> > It will be better to using that to do the verify and testing here.
> >
> > Thanks.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Best Regards!
> > ---------------------
> >
> > Luke Han
> >
> > On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <al...@gmail.com>
> wrote:
> >
> > > I have a simple cube with a Fact Table and a Lookup Table. The Fact
> Table
> > > includes some counter. The Lookup Table includes a region property
> which
> > is
> > > the highest level of the hierarchy and therefore can accumulate quite a
> > lot
> > > of data when summing the counter by region.
> > >
> > > I am doing a query like this:
> > >
> > > select sum(MY_COUNTER) as tot_count, REGION
> > >
> > > from FACT_TABLE
> > >
> > > join LOOKUP_TABLE
> > >
> > > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> > >
> > > where date = '2015-01-01'
> > >
> > > group by REGION
> > >
> > >
> > > I get the following result:
> > >
> > > -917,164,421 EMEA --> negative result!
> > >
> > > 777,795,808 AMER
> > >
> > > 514,879,134 APAC
> > >
> > > 117,814,485 LATAM
> > >
> > > I ran the exact same query in Hive and got the following result:
> > >
> > > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT limit
> > >
> > > 778,718,563    AMER  --> 922,755 difference with the Kylin result
> > >
> > > 520,253,610    APAC   --> 5,374,476 difference with the Kylin result
> > >
> > > 117,913,857    LATAM --> 99,372 difference with the Kylin result
> > >
> > > Based on that it seems that the limit is the int limit 2^31-1
> > > (2,147,483,647) and that my negative result comes from an overflow.
> > >
> > > In my cube I defined this measure as a bigint:
> > >
> > > Expression    Param Type    Param Value       Return Type
> > >
> > > SUM              column            AVAIL_COUNT    bigint
> > >
> > > The other thing that worries me a bit is that the other numbers are not
> > > 100% accurate either (>100,000,000 difference!)
> > >
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by hongbin ma <ma...@apache.org>.

except for what Luke asked, can you also share which version are you using?
We'll look into this

On Wed, Aug 5, 2015 at 11:08 PM, Luke Han <lu...@gmail.com> wrote:

> What's the data type in your hive table of that column?
> We recommend to use same as your measure definition.
>
> For data quality, we have approach to ensure it will 100%
> match with source data, there's H2 comparison in our testing case.
> But yes, let's double check the test case about huge number like your case
>
> Is it possible to share your data and model to us if not sensitive?
> It will be better to using that to do the verify and testing here.
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <al...@gmail.com> wrote:
>
> > I have a simple cube with a Fact Table and a Lookup Table. The Fact Table
> > includes some counter. The Lookup Table includes a region property which
> is
> > the highest level of the hierarchy and therefore can accumulate quite a
> lot
> > of data when summing the counter by region.
> >
> > I am doing a query like this:
> >
> > select sum(MY_COUNTER) as tot_count, REGION
> >
> > from FACT_TABLE
> >
> > join LOOKUP_TABLE
> >
> > on FACT_TABLE.FK = LOOKUP_TABLE.PK
> >
> > where date = '2015-01-01'
> >
> > group by REGION
> >
> >
> > I get the following result:
> >
> > -917,164,421 EMEA --> negative result!
> >
> > 777,795,808 AMER
> >
> > 514,879,134 APAC
> >
> > 117,814,485 LATAM
> >
> > I ran the exact same query in Hive and got the following result:
> >
> > 3,381,591,228  EMEA  --> big number but smaller than the BIGINT limit
> >
> > 778,718,563    AMER  --> 922,755 difference with the Kylin result
> >
> > 520,253,610    APAC   --> 5,374,476 difference with the Kylin result
> >
> > 117,913,857    LATAM --> 99,372 difference with the Kylin result
> >
> > Based on that it seems that the limit is the int limit 2^31-1
> > (2,147,483,647) and that my negative result comes from an overflow.
> >
> > In my cube I defined this measure as a bigint:
> >
> > Expression    Param Type    Param Value       Return Type
> >
> > SUM              column            AVAIL_COUNT    bigint
> >
> > The other thing that worries me a bit is that the other numbers are not
> > 100% accurate either (>100,000,000 difference!)
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Negative number in SUM result and Kylin results not matching exactly Hive results

Posted by Luke Han <lu...@gmail.com>.

What's the data type in your hive table of that column?
We recommend to use same as your measure definition.

For data quality, we have approach to ensure it will 100%
match with source data, there's H2 comparison in our testing case.
But yes, let's double check the test case about huge number like your case

Is it possible to share your data and model to us if not sensitive?
It will be better to using that to do the verify and testing here.

Thanks.










Best Regards!
---------------------

Luke Han

On Wed, Aug 5, 2015 at 10:57 PM, alex schufo <al...@gmail.com> wrote:

> I have a simple cube with a Fact Table and a Lookup Table. The Fact Table
> includes some counter. The Lookup Table includes a region property which is
> the highest level of the hierarchy and therefore can accumulate quite a lot
> of data when summing the counter by region.
>
> I am doing a query like this:
>
> select sum(MY_COUNTER) as tot_count, REGION
>
> from FACT_TABLE
>
> join LOOKUP_TABLE
>
> on FACT_TABLE.FK = LOOKUP_TABLE.PK
>
> where date = '2015-01-01'
>
> group by REGION
>
>
> I get the following result:
>
> -917,164,421 EMEA --> negative result!
>
> 777,795,808 AMER
>
> 514,879,134 APAC
>
> 117,814,485 LATAM
>
> I ran the exact same query in Hive and got the following result:
>
> 3,381,591,228  EMEA  --> big number but smaller than the BIGINT limit
>
> 778,718,563    AMER  --> 922,755 difference with the Kylin result
>
> 520,253,610    APAC   --> 5,374,476 difference with the Kylin result
>
> 117,913,857    LATAM --> 99,372 difference with the Kylin result
>
> Based on that it seems that the limit is the int limit 2^31-1
> (2,147,483,647) and that my negative result comes from an overflow.
>
> In my cube I defined this measure as a bigint:
>
> Expression    Param Type    Param Value       Return Type
>
> SUM              column            AVAIL_COUNT    bigint
>
> The other thing that worries me a bit is that the other numbers are not
> 100% accurate either (>100,000,000 difference!)
>