You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rakesh Setty <se...@yahoo-inc.com> on 2009/07/02 21:37:00 UTC

distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Hi Namit,

This is the plan generated.

hive> explain extended
    >
    > select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM (TOK_TABREF page_views pv)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF user) username)))) (TOK_QUERY (TOK_FROM (TOK_TABREF users u)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF name) username))))) ur)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECTDI (TOK_SELEXPR (TOK_COLREF username)))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery2:ur-subquery2:u
            Select Operator
              expressions:
                    expr: name
                    type: string
              Select Operator
                expressions:
                      expr: 0
                      type: string
                Union
                  Group By Operator
                    keys:
                          expr: 0
                          type: string
                    mode: hash
                    Reduce Output Operator
                      key expressions:
                            expr: 0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: 0
                            type: string
                      tag: -1
        null-subquery1:ur-subquery1:pv
            Select Operator
              expressions:
                    expr: user
                    type: string
              Select Operator
                expressions:
                      expr: 0
                      type: string
                Union
                  Group By Operator
                    keys:
                          expr: 0
                          type: string
                    mode: hash
                    Reduce Output Operator
                      key expressions:
                            expr: 0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: 0
                            type: string
                      tag: -1
      Needs Tagging: false
      Path -> Alias:
        /user/serakesh/users_hive
        /user/serakesh/page_views_hive
      Path -> Partition:
        /user/serakesh/users_hive
          Partition

              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
              properties:
                line.delim

                name users
                field.delim
                columns.types string:string:string:string:string:int
                serialization.ddl struct users { string name, string phone, string address, string city, string state, i32 zip}
                columns name,phone,address,city,state,zip
                serialization.format
                bucket_count -1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                location /user/serakesh/users_hive
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: users
        /user/serakesh/page_views_hive
          Partition

              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
              properties:
                line.delim

                name page_views
                field.delim
                columns.types string:int:int:string:bigint:bigint:double
                serialization.ddl struct page_views { string user, i32 action, i32 timespent, string query_term, i64 ip_addr, i64 time_stamp, double estimated_revenue}
                columns user,action,timespent,query_term,ip_addr,time_stamp,estimated_revenue
                serialization.format
                bucket_count -1
                serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                file.inputformat org.apache.hadoop.mapred.TextInputFormat
                file.outputformat org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                location /user/serakesh/page_views_hive
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: page_views
      Reduce Operator Tree:
        Group By Operator
          keys:
                expr: KEY.0
                type: string
          mode: mergepartial
          Select Operator
            expressions:
                  expr: 0
                  type: string
            File Output Operator
              compressed: false
              GlobalTableId: 0
              directory: /tmp/hive-serakesh/135690254.10001.insclause-0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                  properties:
                    columns username
                    serialization.format 1

  Stage: Stage-0
    Fetch Operator
      limit: -1


Thanks,
Rakesh
________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 2:24 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Can you do :

explain extended
select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;


and send the plan ?




From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 1:54 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Can you do :

explain extended
select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;


and send the plan ?




From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 1:54 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Your plan looks OK.
As Amr said, try to find the offending users.


From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Can you tell the exact command to generate the data ?
Detailed instructions.

Do I need to install pig, or is the standalone perl file good enough ?

Thanks,
-namit


[njain@dev029 ~/pigmix]$ perl generate_data.pl
Usage: generate_data.pl filetype numrows tablename [nosql]
        Valid filetypes [studenttab, studentcolon, studentusrdef,
                votertab, reg1459894, textdoc, unicode]
Died at generate_data.pl line 145.
[njain@dev029 ~/pigmix]$



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Tuesday, July 07, 2009 1:23 PM
To: Namit Jain
Subject: RE: distinct with union all

Hi Namit,

            The query that I am working on is test L11 in PigMix converted to Hive. To generate the data, you can use the patch attached to https://issues.apache.org/jira/browse/PIG-200.

The exact Hive query I am using is

insert overwrite table L11out
select distinct username from
(select field1 as username from widerow w
union all select user as username from page_views pv) ur;

            The output says 1642777 rows are loaded. If I do

select count(distinct user) from L11out

I get 652384 as the output.

Please let me know if you have any other questions.

Thanks,
Rakesh




________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Tuesday, July 07, 2009 12:01 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Rakesh,

Did the queries work for you ?

Thanks,
-namit

From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 2:22 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Similar queries seem to be working for me.


1.       Which version of hive are using ?

2.       Would it be possible for you to ship the data to us ?


Thanks,
-namit

From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:36 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Following are the counts

select count(distinct user) from Tb1;
976272

select count(1) from Tb1;
1642777

select count(distinct user) from Tb1Debug;
653824

select count(1) from Tb1Debug;
1095933

Thanks,
Rakesh

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 1:29 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Can you send the following counts:

select count(distinct user) from Tb1;
select count(1) from Tb1;
select count(distinct user) from Tb1Debug;
select count(1) from Tb1Debug;



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:20 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi,

>From the counts, I could see that grouping was happening. The output is too big, so I am attaching just partial output here

AAAGd`RH_HaA`AZNgA_WDeq 2
AAAZTMIxhpSjx[Oyb       1
AABIxQR_Un]DLGXNwla\`o  3
AAB[_pudLZeuOe  7
AAClxPCotyLn\qC 2
AAEdZeHVNj\[XECl^Amin   3
AAEkewjwSCDfCgmPbxkQEG` 2
AAFQxFnlnImcPKpvQMM     1
AAGbQojol_EXcPk 1
AAHPZBe^Q`UeMBmlhhcAVV  3
AAHs]A_gvZxfKKCI^\      2
AAIAGwnFPsrGFxaHYgOAiCHe        2
AAIjiO]r`rkrobp_xRn     1
AAIwuSGOmu_L`YGCoHrmvub`        1
AAJG[RPx\RvEYljsBgDdR   3
AAJ^BFCNpTXt]wOEir[     1
AAJvlnBfYXNnXulGOHlAeW  2
AAKwheCincrxm_jkbm      2

The number of rows written is varying between the queries

insert overwrite table Tbl1 select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

and

insert overwrite table Tbl1Debug select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

The query

select count(distinct user) from Tb1;

returns different value compared to the number of rows created by the first query. It is also different from the query

select count(distinct user) from Tb1Debug;

@Ashish - I had attached the explain plan output in an earlier reply which Namit said looks fine.

Please let me know what could be the problem.

Thanks,
Rakesh


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Rakesh,

Did the queries work for you ?

Thanks,
-namit

From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 2:22 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Similar queries seem to be working for me.


1.       Which version of hive are using ?

2.       Would it be possible for you to ship the data to us ?


Thanks,
-namit

From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:36 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Following are the counts

select count(distinct user) from Tb1;
976272

select count(1) from Tb1;
1642777

select count(distinct user) from Tb1Debug;
653824

select count(1) from Tb1Debug;
1095933

Thanks,
Rakesh

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 1:29 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Can you send the following counts:

select count(distinct user) from Tb1;
select count(1) from Tb1;
select count(distinct user) from Tb1Debug;
select count(1) from Tb1Debug;



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:20 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi,

>From the counts, I could see that grouping was happening. The output is too big, so I am attaching just partial output here

AAAGd`RH_HaA`AZNgA_WDeq 2
AAAZTMIxhpSjx[Oyb       1
AABIxQR_Un]DLGXNwla\`o  3
AAB[_pudLZeuOe  7
AAClxPCotyLn\qC 2
AAEdZeHVNj\[XECl^Amin   3
AAEkewjwSCDfCgmPbxkQEG` 2
AAFQxFnlnImcPKpvQMM     1
AAGbQojol_EXcPk 1
AAHPZBe^Q`UeMBmlhhcAVV  3
AAHs]A_gvZxfKKCI^\      2
AAIAGwnFPsrGFxaHYgOAiCHe        2
AAIjiO]r`rkrobp_xRn     1
AAIwuSGOmu_L`YGCoHrmvub`        1
AAJG[RPx\RvEYljsBgDdR   3
AAJ^BFCNpTXt]wOEir[     1
AAJvlnBfYXNnXulGOHlAeW  2
AAKwheCincrxm_jkbm      2

The number of rows written is varying between the queries

insert overwrite table Tbl1 select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

and

insert overwrite table Tbl1Debug select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

The query

select count(distinct user) from Tb1;

returns different value compared to the number of rows created by the first query. It is also different from the query

select count(distinct user) from Tb1Debug;

@Ashish - I had attached the explain plan output in an earlier reply which Namit said looks fine.

Please let me know what could be the problem.

Thanks,
Rakesh


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Similar queries seem to be working for me.


1.       Which version of hive are using ?

2.       Would it be possible for you to ship the data to us ?


Thanks,
-namit

From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:36 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Following are the counts

select count(distinct user) from Tb1;
976272

select count(1) from Tb1;
1642777

select count(distinct user) from Tb1Debug;
653824

select count(1) from Tb1Debug;
1095933

Thanks,
Rakesh

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 1:29 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Can you send the following counts:

select count(distinct user) from Tb1;
select count(1) from Tb1;
select count(distinct user) from Tb1Debug;
select count(1) from Tb1Debug;



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:20 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi,

>From the counts, I could see that grouping was happening. The output is too big, so I am attaching just partial output here

AAAGd`RH_HaA`AZNgA_WDeq 2
AAAZTMIxhpSjx[Oyb       1
AABIxQR_Un]DLGXNwla\`o  3
AAB[_pudLZeuOe  7
AAClxPCotyLn\qC 2
AAEdZeHVNj\[XECl^Amin   3
AAEkewjwSCDfCgmPbxkQEG` 2
AAFQxFnlnImcPKpvQMM     1
AAGbQojol_EXcPk 1
AAHPZBe^Q`UeMBmlhhcAVV  3
AAHs]A_gvZxfKKCI^\      2
AAIAGwnFPsrGFxaHYgOAiCHe        2
AAIjiO]r`rkrobp_xRn     1
AAIwuSGOmu_L`YGCoHrmvub`        1
AAJG[RPx\RvEYljsBgDdR   3
AAJ^BFCNpTXt]wOEir[     1
AAJvlnBfYXNnXulGOHlAeW  2
AAKwheCincrxm_jkbm      2

The number of rows written is varying between the queries

insert overwrite table Tbl1 select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

and

insert overwrite table Tbl1Debug select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

The query

select count(distinct user) from Tb1;

returns different value compared to the number of rows created by the first query. It is also different from the query

select count(distinct user) from Tb1Debug;

@Ashish - I had attached the explain plan output in an earlier reply which Namit said looks fine.

Please let me know what could be the problem.

Thanks,
Rakesh


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Following are the counts

select count(distinct user) from Tb1;
976272

select count(1) from Tb1;
1642777

select count(distinct user) from Tb1Debug;
653824

select count(1) from Tb1Debug;
1095933

Thanks,
Rakesh

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 1:29 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Can you send the following counts:

select count(distinct user) from Tb1;
select count(1) from Tb1;
select count(distinct user) from Tb1Debug;
select count(1) from Tb1Debug;



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:20 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi,

>From the counts, I could see that grouping was happening. The output is too big, so I am attaching just partial output here

AAAGd`RH_HaA`AZNgA_WDeq 2
AAAZTMIxhpSjx[Oyb       1
AABIxQR_Un]DLGXNwla\`o  3
AAB[_pudLZeuOe  7
AAClxPCotyLn\qC 2
AAEdZeHVNj\[XECl^Amin   3
AAEkewjwSCDfCgmPbxkQEG` 2
AAFQxFnlnImcPKpvQMM     1
AAGbQojol_EXcPk 1
AAHPZBe^Q`UeMBmlhhcAVV  3
AAHs]A_gvZxfKKCI^\      2
AAIAGwnFPsrGFxaHYgOAiCHe        2
AAIjiO]r`rkrobp_xRn     1
AAIwuSGOmu_L`YGCoHrmvub`        1
AAJG[RPx\RvEYljsBgDdR   3
AAJ^BFCNpTXt]wOEir[     1
AAJvlnBfYXNnXulGOHlAeW  2
AAKwheCincrxm_jkbm      2

The number of rows written is varying between the queries

insert overwrite table Tbl1 select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

and

insert overwrite table Tbl1Debug select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

The query

select count(distinct user) from Tb1;

returns different value compared to the number of rows created by the first query. It is also different from the query

select count(distinct user) from Tb1Debug;

@Ashish - I had attached the explain plan output in an earlier reply which Namit said looks fine.

Please let me know what could be the problem.

Thanks,
Rakesh


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Can you send the following counts:

select count(distinct user) from Tb1;
select count(1) from Tb1;
select count(distinct user) from Tb1Debug;
select count(1) from Tb1Debug;



From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 1:20 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi,

>From the counts, I could see that grouping was happening. The output is too big, so I am attaching just partial output here

AAAGd`RH_HaA`AZNgA_WDeq 2
AAAZTMIxhpSjx[Oyb       1
AABIxQR_Un]DLGXNwla\`o  3
AAB[_pudLZeuOe  7
AAClxPCotyLn\qC 2
AAEdZeHVNj\[XECl^Amin   3
AAEkewjwSCDfCgmPbxkQEG` 2
AAFQxFnlnImcPKpvQMM     1
AAGbQojol_EXcPk 1
AAHPZBe^Q`UeMBmlhhcAVV  3
AAHs]A_gvZxfKKCI^\      2
AAIAGwnFPsrGFxaHYgOAiCHe        2
AAIjiO]r`rkrobp_xRn     1
AAIwuSGOmu_L`YGCoHrmvub`        1
AAJG[RPx\RvEYljsBgDdR   3
AAJ^BFCNpTXt]wOEir[     1
AAJvlnBfYXNnXulGOHlAeW  2
AAKwheCincrxm_jkbm      2

The number of rows written is varying between the queries

insert overwrite table Tbl1 select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

and

insert overwrite table Tbl1Debug select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

The query

select count(distinct user) from Tb1;

returns different value compared to the number of rows created by the first query. It is also different from the query

select count(distinct user) from Tb1Debug;

@Ashish - I had attached the explain plan output in an earlier reply which Namit said looks fine.

Please let me know what could be the problem.

Thanks,
Rakesh


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Hi,

>From the counts, I could see that grouping was happening. The output is too big, so I am attaching just partial output here

AAAGd`RH_HaA`AZNgA_WDeq 2
AAAZTMIxhpSjx[Oyb       1
AABIxQR_Un]DLGXNwla\`o  3
AAB[_pudLZeuOe  7
AAClxPCotyLn\qC 2
AAEdZeHVNj\[XECl^Amin   3
AAEkewjwSCDfCgmPbxkQEG` 2
AAFQxFnlnImcPKpvQMM     1
AAGbQojol_EXcPk 1
AAHPZBe^Q`UeMBmlhhcAVV  3
AAHs]A_gvZxfKKCI^\      2
AAIAGwnFPsrGFxaHYgOAiCHe        2
AAIjiO]r`rkrobp_xRn     1
AAIwuSGOmu_L`YGCoHrmvub`        1
AAJG[RPx\RvEYljsBgDdR   3
AAJ^BFCNpTXt]wOEir[     1
AAJvlnBfYXNnXulGOHlAeW  2
AAKwheCincrxm_jkbm      2

The number of rows written is varying between the queries

insert overwrite table Tbl1 select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

and

insert overwrite table Tbl1Debug select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

The query

select count(distinct user) from Tb1;

returns different value compared to the number of rows created by the first query. It is also different from the query

select count(distinct user) from Tb1Debug;

@Ashish - I had attached the explain plan output in an earlier reply which Namit said looks fine.

Please let me know what could be the problem.

Thanks,
Rakesh


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Ashish Thusoo <at...@facebook.com>.
Can you also send the explain plan outputs for the distinct query?

Ashish

________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Monday, July 06, 2009 10:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Nothing special for these characters.

select  username, count(1) from (select user as username from page_views pv union all select name as username from users u) ur group by username;

should return the counts.

When you see the duplicate usernames, can you send them -


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Monday, July 06, 2009 9:45 AM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Hi Amr,

            The group by clause that you suggest shows that grouping is happening as the count column suggests. Perhaps it is not happening across the tables?
            I see characters like \, [, _, ^, etc in username. Are these considered as special characters? If so, is there any workaround to deal with this?

Thanks,
Rakesh

________________________________
From: Amr Awadallah [mailto:aaa@cloudera.com]
Sent: Thursday, July 02, 2009 5:53 PM
To: hive-user@hadoop.apache.org
Subject: Re: distinct with union all


make sure you don't have any leading or trailing spaces (or special characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

Re: distinct with union all

Posted by Amr Awadallah <aa...@cloudera.com>.
make sure you don't have any leading or trailing spaces (or special 
characters) for the usernames being extracted.

also to debug, try to do a select username, count(1) then group by username

-- amr

Rakesh Setty wrote:
>
> Yes, I am getting duplicate usernames.
>
>  
>
>  
>
> ------------------------------------------------------------------------
>
> *From:* Namit Jain [mailto:njain@facebook.com]
> *Sent:* Thursday, July 02, 2009 1:25 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* RE: distinct with union all
>
>  
>
> Are you getting duplicate usernames ?
>
>  
>
>  
>
> *From:* Rakesh Setty [mailto:serakesh@yahoo-inc.com]
> *Sent:* Thursday, July 02, 2009 12:37 PM
> *To:* hive-user@hadoop.apache.org
> *Subject:* distinct with union all
>
>  
>
> Hi,
>
>  
>
> I have a query like
>
>  
>
> select distinct username from (select user as username from page_views 
> pv union all select name as username from users u) ur;
>
>  
>
> But I see that result is not actually distinct. Am I missing something 
> here?
>
>  
>
> Thanks,
>
> Rakesh
>

RE: distinct with union all

Posted by Rakesh Setty <se...@yahoo-inc.com>.
Yes, I am getting duplicate usernames.


________________________________
From: Namit Jain [mailto:njain@facebook.com]
Sent: Thursday, July 02, 2009 1:25 PM
To: hive-user@hadoop.apache.org
Subject: RE: distinct with union all

Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh

RE: distinct with union all

Posted by Namit Jain <nj...@facebook.com>.
Are you getting duplicate usernames ?


From: Rakesh Setty [mailto:serakesh@yahoo-inc.com]
Sent: Thursday, July 02, 2009 12:37 PM
To: hive-user@hadoop.apache.org
Subject: distinct with union all

Hi,

I have a query like

select distinct username from (select user as username from page_views pv union all select name as username from users u) ur;

But I see that result is not actually distinct. Am I missing something here?

Thanks,
Rakesh