You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "dima machlin (JIRA)" <ji...@apache.org> on 2014/05/12 16:10:14 UTC
[jira] [Updated] (HIVE-7045) Wrong results in multi-table insert aggregating without group by clause

     [ https://issues.apache.org/jira/browse/HIVE-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dima machlin updated HIVE-7045:
-------------------------------

    Description: 
The scenario :

CREATE  TABLE t1 (a int, b int);
CREATE  TABLE t2 (cnt int) PARTITIONED BY (var_name string);

insert into table t1 select 1,1 from asd limit 1;
insert into table t1 select 2,2 from asd limit 1;

t1 contains :
1 1
2 2

from  t1
insert overwrite table t2 partition(var_name='a') select count(a) cnt 
insert overwrite table t2 partition(var_name='b') select count(b) cnt ;

select * from t2;
returns : 
2 a
2 b

as expected.

Setting the number of reducers higher than 1 :

set mapred.reduce.tasks=2;

from  t1
insert overwrite table t2 partition(var_name='a') select count(a) cnt
insert overwrite table t2 partition(var_name='b') select count(b) cnt;

select * from t2;
1	a
1	a
1	b
1	b

Wrong results.

This happens when ever t1 is big enough to automatically generate more than 1 reducers and without specifying it directly.

adding "group by 1" in the end of each insert solves the problem :

from  t1
insert overwrite table t2 partition(var_name='a') select count(a) cnt group by 1
insert overwrite table t2 partition(var_name='b') select count(b) cnt group by 1;

generates : 
2 a
2 b

This should work without the group by...
The number of rows for each partition will be the amount of reducers.
Each reducer calculated a sub total of the count.



  was:
The scenario :

CREATE  TABLE t1 (a int, b int);
CREATE  TABLE t2 (cnt int) PARTITIONED BY (var_name string);

insert into table t1 select 1,1 from asd limit 1;
insert into table t1 select 2,2 from asd limit 1;

from  t1
insert overwrite table t2 partition(var_name='a') select count(a) cnt 
insert overwrite table t2 partition(var_name='b') select count(b) cnt ;

select * from t2;
returns : 
2 a
2 b

as expected.

Setting the number of reducers higher than 1 :

set mapred.reduce.tasks=2;

from  t1
insert overwrite table t2 partition(var_name='a') select count(a) cnt
insert overwrite table t2 partition(var_name='b') select count(b) cnt;

select * from t2;
1	a
1	a
1	b
1	b

Wrong results.

This happens when ever t1 is big enough to automatically generate more than 1 reducers and without specifying it directly.

adding "group by 1" in the end of each insert solves the problem :

from  t1
insert overwrite table t2 partition(var_name='a') select count(a) cnt group by 1
insert overwrite table t2 partition(var_name='b') select count(b) cnt group by 1;

generates : 
2 a
2 b

This should work without the group by...
The number of rows for each partition will be the amount of reducers.
Each reducer calculated a sub total of the count.




> Wrong results in multi-table insert aggregating without group by clause
> -----------------------------------------------------------------------
>
>                 Key: HIVE-7045
>                 URL: https://issues.apache.org/jira/browse/HIVE-7045
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.12.0
>            Reporter: dima machlin
>
> The scenario :
> CREATE  TABLE t1 (a int, b int);
> CREATE  TABLE t2 (cnt int) PARTITIONED BY (var_name string);
> insert into table t1 select 1,1 from asd limit 1;
> insert into table t1 select 2,2 from asd limit 1;
> t1 contains :
> 1 1
> 2 2
> from  t1
> insert overwrite table t2 partition(var_name='a') select count(a) cnt 
> insert overwrite table t2 partition(var_name='b') select count(b) cnt ;
> select * from t2;
> returns : 
> 2 a
> 2 b
> as expected.
> Setting the number of reducers higher than 1 :
> set mapred.reduce.tasks=2;
> from  t1
> insert overwrite table t2 partition(var_name='a') select count(a) cnt
> insert overwrite table t2 partition(var_name='b') select count(b) cnt;
> select * from t2;
> 1	a
> 1	a
> 1	b
> 1	b
> Wrong results.
> This happens when ever t1 is big enough to automatically generate more than 1 reducers and without specifying it directly.
> adding "group by 1" in the end of each insert solves the problem :
> from  t1
> insert overwrite table t2 partition(var_name='a') select count(a) cnt group by 1
> insert overwrite table t2 partition(var_name='b') select count(b) cnt group by 1;
> generates : 
> 2 a
> 2 b
> This should work without the group by...
> The number of rows for each partition will be the amount of reducers.
> Each reducer calculated a sub total of the count.



--
This message was sent by Atlassian JIRA
(v6.2#6252)