You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Johannes Mayer (JIRA)" <ji...@apache.org> on 2017/09/11 07:59:00 UTC

[jira] [Updated] (HIVE-17499) Hive Cube Operator returns duplicate rows

     [ https://issues.apache.org/jira/browse/HIVE-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johannes Mayer updated HIVE-17499:
----------------------------------
    Description: 
The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below)


{code:sql}
insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01')
select year(ds), month(ds), day(ds), c8, c11, count(*)
from thesis.clickstream_landing
where ds = '2016-04-01'
group by year(ds), month(ds), day(ds), c8, c11
With Cube;
{code}

Then I check for duplicates:

{code:sql}
select year, month, day, country, city, count (*) from thesis.clickstream_export
where ds_year = '2016' and ds_month = '04' and ds_day = '01'
group by year, month, day, country, city
having count(*) > 1;
{code}

The result is:
year 	month 	day 	country 	city 	_c5
null 	null 	null 	null 	null 	4
null 	null 	1 	null 	null 	4
null 	4 	null 	null 	null 	4
null 	4 	1 	null 	null 	4
2016 	null 	null 	null 	null 	4
2016 	null 	1 	null 	null 	4
2016 	4 	null 	null 	null 	4
2016 	4 	1 	null 	null 	4


When i do the same thing in Pig, everything is fine:
{code:none}
DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader();

FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city;

CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city);

D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count;

STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02');
{code}

Then again I check for duplicates:
{code:sql}
select year, month, day, country, city, count (*) from thesis.clickstream_export
where ds_year = '2016' and ds_month = '04' and ds_day = '02'
group by year, month, day, country, city
having count(*) > 1;
{code}

And the result is empty as it should be.


  was:
The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below)


{code:sql}
insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01')
select year(ds), month(ds), day(ds), c8, c11, count(*)
from thesis.clickstream_landing
where ds = '2016-04-01'
group by year(ds), month(ds), day(ds), c8, c11
With Cube;
{code}

Then I check for duplicates:

{code:sql}
select year, month, day, country, city, count (*) from thesis.clickstream_export
where ds_year = '2016' and ds_month = '04' and ds_day = '01'
group by year, month, day, country, city
having count(*) > 1;
{code}

The result is:
year 	month 	day 	country 	city 	_c5
null 	null 	null 	null 	null 	4
null 	null 	1 	null 	null 	4
null 	4 	null 	null 	null 	4
null 	4 	1 	null 	null 	4
2016 	null 	null 	null 	null 	4
2016 	null 	1 	null 	null 	4
2016 	4 	null 	null 	null 	4
2016 	4 	1 	null 	null 	4


When i do the same thing in Pig, everything is fine:
{code:pig}
DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader();

FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city;

CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city);

D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count;

STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02');
{code}

Then again I check for duplicates:
{code:sql}
select year, month, day, country, city, count (*) from thesis.clickstream_export
where ds_year = '2016' and ds_month = '04' and ds_day = '02'
group by year, month, day, country, city
having count(*) > 1;
{code}

And the result is empty as it should be.



> Hive Cube Operator returns duplicate rows
> -----------------------------------------
>
>                 Key: HIVE-17499
>                 URL: https://issues.apache.org/jira/browse/HIVE-17499
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.1.0
>         Environment: Hortonworks HDP 2.6.0.1
>            Reporter: Johannes Mayer
>            Priority: Critical
>
> The cube Operator returns duplicate rows, when it shouldnt. I ran the same query in Pig and got the correct result. (see the example below)
> {code:sql}
> insert overwrite table thesis.clickstream_export PARTITION (ds_year = '2016' , ds_month = '04' , ds_day = '01')
> select year(ds), month(ds), day(ds), c8, c11, count(*)
> from thesis.clickstream_landing
> where ds = '2016-04-01'
> group by year(ds), month(ds), day(ds), c8, c11
> With Cube;
> {code}
> Then I check for duplicates:
> {code:sql}
> select year, month, day, country, city, count (*) from thesis.clickstream_export
> where ds_year = '2016' and ds_month = '04' and ds_day = '01'
> group by year, month, day, country, city
> having count(*) > 1;
> {code}
> The result is:
> year 	month 	day 	country 	city 	_c5
> null 	null 	null 	null 	null 	4
> null 	null 	1 	null 	null 	4
> null 	4 	null 	null 	null 	4
> null 	4 	1 	null 	null 	4
> 2016 	null 	null 	null 	null 	4
> 2016 	null 	1 	null 	null 	4
> 2016 	4 	null 	null 	null 	4
> 2016 	4 	1 	null 	null 	4
> When i do the same thing in Pig, everything is fine:
> {code:none}
> DATA = LOAD 'thesis.clickstream_landing' USING org.apache.hive.hcatalog.pig.HCatLoader();
> FILTERED = FOREACH DATA GENERATE GetYear(ToDate(ds, 'yyyy-MM-dd')) AS year, GetMonth(ToDate(ds, 'yyyy-MM-dd')) AS month, GetDay(ToDate(ds, 'yyyy-MM-dd')) AS day, c8 AS country, c11 AS city;
> CUBED = CUBE FILTERED BY CUBE(year, month, day, country, city);
> D = FOREACH CUBED GENERATE FLATTEN(group) AS (year, month, day, country, city), COUNT_STAR(cube) As click_count;
> STORE D INTO 'thesis.clickstream_export' USING org.apache.hive.hcatalog.pig.HCatStorer('ds_year=2016, ds_month=04, ds_day=02');
> {code}
> Then again I check for duplicates:
> {code:sql}
> select year, month, day, country, city, count (*) from thesis.clickstream_export
> where ds_year = '2016' and ds_month = '04' and ds_day = '02'
> group by year, month, day, country, city
> having count(*) > 1;
> {code}
> And the result is empty as it should be.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)