You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Alok Singh <al...@cloudability.com> on 2016/03/22 07:29:02 UTC

Bug in GROUP BY with timestamp column

Environment:
* Phoenix 4.6
* Hbase 1.1.2
* Hadoop 2.7.1
* JDK 1.7

Table:
COST (
     ACCOUNT_ID VARCHAR NOT NULL,
     PRECISION TINYINT NOT NULL,
     START_DATE TIMESTAMP NOT NULL,
     SECONDARY_ACCOUNT_ID VARCHAR NOT NULL,
     TAG VARCHAR NOT NULL,
     VENDOR_ID VARCHAR NOT NULL,
     SERVICE VARCHAR NOT NULL,
....

CONSTRAINT PK PRIMARY KEY (
               ACCOUNT_ID,
               PRECISION,
               START_DATE,
               SECONDARY_ACCOUNT_ID,
               TAG,
               VENDOR_ID,
               SERVICE
) COMPRESSION='SNAPPY', SALT_BUCKETS=16;

When querying this table with a GROUP BY clause that contains the
'START_DATE" column, the results returned are incorrectly aggregated. I
find multiple rows with the same "START_DATE" which should have been
aggregated to a single row. The workaround is to do a TO_CHAR(START_DATE)
which causes the query to return the correct results.

e.g:
----------------
select start_date, account_id,
       sum(quantity) as total,
       sum(discounted_quantity) as disc_total
from COST
where
    start_date >= (to_date('2016-03-01'))
    and precision = 1
    and account_id in ('1234', '5678')
group start_date, account_id

--RESULT--
START_DATE                                ACCOUNT_ID
         TOTAL                                     DISC_TOTAL

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-03-01 16:00:00.0                     1234
         312                                       0

....
2016-03-07 16:00:00.0                     1234
         312                                       0

2016-03-08 16:00:00.0                     1234
         312                                       0

2016-03-09 16:00:00.0                     1234
         216                                       0

2016-02-29 16:00:00.0                     5678
         194                                       24

2016-03-01 16:00:00.0                     5678
         262                                       0

....
2016-03-07 16:00:00.0                     5678
         237                                       48

2016-03-08 16:00:00.0                     5678
         178                                       0

2016-03-09 16:00:00.0                     1234
         96                                        0

2016-03-09 16:00:00.0                     5678
         173                                       50

2016-03-10 16:00:00.0                     1234
         324                                       0

....
2016-03-07 16:00:00.0                     5678
         178                                       24

2016-03-08 16:00:00.0                     5678
         218                                       0

2016-03-09 16:00:00.0                     5678
         218                                       48

^^^^^^^^^^^^^^^^^^^^^^
Repeated rows

--------------------
Workaround:
select to_char(start_date), account_id,
       sum(quantity) as total,
       sum(discounted_quantity) as disc_total
from COST
where
   start_date >= (to_date('2016-03-01'))
   and precision = 1
  and account_id in ('1234', '5678')
group to_char(start_date), account_id

--RESULT--
START_DATE                                ACCOUNT_ID
         TOTAL                                     DISC_TOTAL

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2016-03-01 00:00:00.000                   1234
         312                                       0

2016-03-01 00:00:00.000                   5678
         3,465                                     384

2016-03-02 00:00:00.000                   1234
         312                                       0

2016-03-02 00:00:00.000                   5678
         3,238                                     384

2016-03-03 00:00:00.000                   1234
         312                                       0

2016-03-03 00:00:00.000                   5678
         3,243                                     384

2016-03-04 00:00:00.000                   1234
         312                                       0

2016-03-04 00:00:00.000                   5678
         3,212                                     384

2016-03-05 00:00:00.000                   1234
         312                                       0

2016-03-05 00:00:00.000                   5678
         2,907                                     384

2016-03-06 00:00:00.000                   1234
         312                                       0

2016-03-06 00:00:00.000                   5678
         2,907                                     384

2016-03-07 00:00:00.000                   1234
         312                                       0

2016-03-07 00:00:00.000                   5678
         2,919                                     384

2016-03-08 00:00:00.000                   1234
         312                                       0

2016-03-08 00:00:00.000                   5678
         2,955                                     384

2016-03-09 00:00:00.000                   1234
         312                                       0

2016-03-09 00:00:00.000                   5678
         2,955                                     384

2016-03-10 00:00:00.000                   1234
         312                                       0

2016-03-10 00:00:00.000                   5678
         2,978                                     384

2016-03-11 00:00:00.000                   1234
         324                                       0

2016-03-11 00:00:00.000                   5678
         3,125                                     384

2016-03-12 00:00:00.000                   1234
         336                                       0

2016-03-12 00:00:00.000                   5678
         3,102                                     384

2016-03-13 00:00:00.000                   1234
         336                                       0

2016-03-13 00:00:00.000                   5678
         3,117                                     384

2016-03-14 00:00:00.000                   1234
         153                                       0

2016-03-14 00:00:00.000                   5678
         2,187                                     278


-----------------


Alok

alok@cloudability.com

Re: Bug in GROUP BY with timestamp column

Posted by James Taylor <ja...@apache.org>.
Alok,
Please file a JIRA with this info. We need a representative data set that
exhibits this bug - would it be possible to provide that? The smaller the
better.
Thanks,
James

On Monday, March 21, 2016, Alok Singh <al...@cloudability.com> wrote:

> Environment:
> * Phoenix 4.6
> * Hbase 1.1.2
> * Hadoop 2.7.1
> * JDK 1.7
>
> Table:
> COST (
>      ACCOUNT_ID VARCHAR NOT NULL,
>      PRECISION TINYINT NOT NULL,
>      START_DATE TIMESTAMP NOT NULL,
>      SECONDARY_ACCOUNT_ID VARCHAR NOT NULL,
>      TAG VARCHAR NOT NULL,
>      VENDOR_ID VARCHAR NOT NULL,
>      SERVICE VARCHAR NOT NULL,
> ....
>
> CONSTRAINT PK PRIMARY KEY (
>                ACCOUNT_ID,
>                PRECISION,
>                START_DATE,
>                SECONDARY_ACCOUNT_ID,
>                TAG,
>                VENDOR_ID,
>                SERVICE
> ) COMPRESSION='SNAPPY', SALT_BUCKETS=16;
>
> When querying this table with a GROUP BY clause that contains the
> 'START_DATE" column, the results returned are incorrectly aggregated. I
> find multiple rows with the same "START_DATE" which should have been
> aggregated to a single row. The workaround is to do a TO_CHAR(START_DATE)
> which causes the query to return the correct results.
>
> e.g:
> ----------------
> select start_date, account_id,
>        sum(quantity) as total,
>        sum(discounted_quantity) as disc_total
> from COST
> where
>     start_date >= (to_date('2016-03-01'))
>     and precision = 1
>     and account_id in ('1234', '5678')
> group start_date, account_id
>
> --RESULT--
> START_DATE                                ACCOUNT_ID
>          TOTAL                                     DISC_TOTAL
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-03-01 16:00:00.0                     1234
>          312                                       0
>
> ....
> 2016-03-07 16:00:00.0                     1234
>          312                                       0
>
> 2016-03-08 16:00:00.0                     1234
>          312                                       0
>
> 2016-03-09 16:00:00.0                     1234
>          216                                       0
>
> 2016-02-29 16:00:00.0                     5678
>          194                                       24
>
> 2016-03-01 16:00:00.0                     5678
>          262                                       0
>
> ....
> 2016-03-07 16:00:00.0                     5678
>          237                                       48
>
> 2016-03-08 16:00:00.0                     5678
>          178                                       0
>
> 2016-03-09 16:00:00.0                     1234
>          96                                        0
>
> 2016-03-09 16:00:00.0                     5678
>          173                                       50
>
> 2016-03-10 16:00:00.0                     1234
>          324                                       0
>
> ....
> 2016-03-07 16:00:00.0                     5678
>          178                                       24
>
> 2016-03-08 16:00:00.0                     5678
>          218                                       0
>
> 2016-03-09 16:00:00.0                     5678
>          218                                       48
>
> ^^^^^^^^^^^^^^^^^^^^^^
> Repeated rows
>
> --------------------
> Workaround:
> select to_char(start_date), account_id,
>        sum(quantity) as total,
>        sum(discounted_quantity) as disc_total
> from COST
> where
>    start_date >= (to_date('2016-03-01'))
>    and precision = 1
>   and account_id in ('1234', '5678')
> group to_char(start_date), account_id
>
> --RESULT--
> START_DATE                                ACCOUNT_ID
>          TOTAL                                     DISC_TOTAL
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 2016-03-01 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-01 00:00:00.000                   5678
>          3,465                                     384
>
> 2016-03-02 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-02 00:00:00.000                   5678
>          3,238                                     384
>
> 2016-03-03 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-03 00:00:00.000                   5678
>          3,243                                     384
>
> 2016-03-04 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-04 00:00:00.000                   5678
>          3,212                                     384
>
> 2016-03-05 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-05 00:00:00.000                   5678
>          2,907                                     384
>
> 2016-03-06 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-06 00:00:00.000                   5678
>          2,907                                     384
>
> 2016-03-07 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-07 00:00:00.000                   5678
>          2,919                                     384
>
> 2016-03-08 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-08 00:00:00.000                   5678
>          2,955                                     384
>
> 2016-03-09 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-09 00:00:00.000                   5678
>          2,955                                     384
>
> 2016-03-10 00:00:00.000                   1234
>          312                                       0
>
> 2016-03-10 00:00:00.000                   5678
>          2,978                                     384
>
> 2016-03-11 00:00:00.000                   1234
>          324                                       0
>
> 2016-03-11 00:00:00.000                   5678
>          3,125                                     384
>
> 2016-03-12 00:00:00.000                   1234
>          336                                       0
>
> 2016-03-12 00:00:00.000                   5678
>          3,102                                     384
>
> 2016-03-13 00:00:00.000                   1234
>          336                                       0
>
> 2016-03-13 00:00:00.000                   5678
>          3,117                                     384
>
> 2016-03-14 00:00:00.000                   1234
>          153                                       0
>
> 2016-03-14 00:00:00.000                   5678
>          2,187                                     278
>
>
> -----------------
>
>
> Alok
>
> alok@cloudability.com
> <javascript:_e(%7B%7D,'cvml','alok@cloudability.com');>
>