You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Pavan Gadam Manohar (JIRA)" <ji...@apache.org> on 2013/12/23 20:46:51 UTC

[jira] [Created] (HIVE-6099) Multi insert does not work properly with distinct count

Pavan Gadam Manohar created HIVE-6099:
-----------------------------------------

             Summary: Multi insert does not work properly with distinct count
                 Key: HIVE-6099
                 URL: https://issues.apache.org/jira/browse/HIVE-6099
             Project: Hive
          Issue Type: Bug
    Affects Versions: 0.10.0, 0.9.0
            Reporter: Pavan Gadam Manohar


Need 2 rows to reproduce this Bug. Here are the steps.

Step 1) Create a table Table_A
CREATE EXTERNAL TABLE Table_A
(
user string
, type int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|' 
 STORED AS RCFILE
LOCATION '/hive/<path>/Table_A';

Step 2) Scenario: Lets us say consider user tommy belong to both usertypes 111 and 123. Insert 2 records into the table created above.

select * from  Table_A;

hive>  select * from table_a;
OK
tommy   123     2013-12-02
tommy   111     2013-12-02

Step 3) Create 2 destination tables to simulate multi-insert.
CREATE EXTERNAL TABLE dest_Table_A
(
p_date string
, Distinct_Users int
, Type111Users int
, Type123Users int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|' 
 STORED AS RCFILE
LOCATION '/hive/<path>/dest_Table_A';
 
CREATE EXTERNAL TABLE dest_Table_B
(
p_date string
, Distinct_Users int
, Type111Users int
, Type123Users int
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|' 
 STORED AS RCFILE
LOCATION '/hive/<path>/dest_Table_B';

Step 4) Multi insert statement
from Table_A a
INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02')
select a.dt
,count(distinct a.user) as AllDist
,count(distinct case when a.type = 111 then a.user else null end) as Type111User
,count(distinct case when a.type != 111 then a.user else null end) as Type123User
group by a.dt
 
INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02')
select a.dt
,count(distinct a.user) as AllDist
,count(distinct case when a.type = 111 then a.user else null end) as Type111User
,count(distinct case when a.type != 111 then a.user else null end) as Type123User
group by a.dt
;
 
Step 5) Verify results.
hive>  select * from dest_table_a;
OK
2013-12-02      2       1       1       2013-12-02
Time taken: 0.116 seconds
hive>  select * from dest_table_b;
OK
2013-12-02      2       1       1       2013-12-02
Time taken: 0.13 seconds

Conclusion: Hive gives a count of 2 for distinct users although there is 
only one distinct user. After trying many datasets observed that Hive is doing Type111Users + Typoe123Users = DistinctUsers which is wrong.

hive> select count(distinct a.user) from table_a a;
Gives:
Total MapReduce CPU Time Spent: 4 seconds 350 msec
OK
1






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)