You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mayank Bansal <Ma...@mu-sigma.com> on 2013/11/21 16:13:55 UTC

Difference in number of row observstions from distinct and group by

Hi,

I have a table which has 3 columns combined together to form a primary key. If I do

Select count(distinct col1,col2,col3) from table_name;

And

Select count(a.*) from (select col1,col2,col3,count(*) from table_name group by col1,col2,col3)a ;

While running the first query, the count of rows that I get is 400 less than what I get by the second query.
Can someone please explain to me the difference in number of observations from both the queries?

Thanks
Mayank
This email message may contain proprietary, private and confidential information. The information transmitted is intended only for the person(s) or entities to which it is addressed. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited and may be illegal. If you received this in error, please contact the sender and delete the message from your system. Mu Sigma takes all reasonable steps to ensure that its electronic communications are free from viruses. However, given Internet accessibility, the Company cannot accept liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software.

Re: Difference in number of row observstions from distinct and group by

Posted by David Morel <dm...@gmail.com>.

On 25 Nov 2013, at 9:06, Mayank Bansal wrote:

> Hi,
>
> I was also thinking that this might be the case. For that reason I ran 
> this query
>
> Select * from (select col1,col2,col3,count(*)  as val from table_name
> group by col1,col2,col3)a where a.val>1 ;
>
> The output that I receive from this query is blank, then I ended up 
> doing count(*) and I got the same number of rows as originally in the 
> table. Please help me figure this out.

Instead of going circles, I would put the 2 result sets in 2 tables 
(with a concatenated PK made of your 3 columns, with some separator like 
'-'), and do a left outer join of table 1 on table 2.

you'd be able to identify quickly what went wrong. Sort the result so 
you get unlikely dupes, and all. Just trial and error until you nail it.

David

RE: Difference in number of row observstions from distinct and group by

Posted by Mayank Bansal <Ma...@mu-sigma.com>.

Hi,

I was also thinking that this might be the case. For that reason I ran this query

Select * from (select col1,col2,col3,count(*)  as val from table_name
group by col1,col2,col3)a where a.val>1 ;

The output that I receive from this query is blank, then I ended up doing count(*) and I got the same number of rows as originally in the table. Please help me figure this out.

Thanks,
Mayank




-----Original Message-----
From: Thejas Nair [mailto:thejas@hortonworks.com]
Sent: Friday, November 22, 2013 1:49 AM
To: <us...@hive.apache.org>
Subject: Re: Difference in number of row observstions from distinct and group by

You probably have 400 rows where col1, col2 and col3 have null values.
"count(distinct col1,col2,col3) " will not count those rows.


On Thu, Nov 21, 2013 at 7:13 AM, Mayank Bansal <Ma...@mu-sigma.com>> wrote:
> Hi,
>
>
>
> I have a table which has 3 columns combined together to form a primary key.
> If I do
>
>
>
> Select count(distinct col1,col2,col3) from table_name;
>
>
>
> And
>
>
>
> Select count(a.*) from (select col1,col2,col3,count(*) from table_name
> group by col1,col2,col3)a ;
>
>
>
> While running the first query, the count of rows that I get is 400
> less than what I get by the second query.
>
> Can someone please explain to me the difference in number of
> observations from both the queries?
>
>
>
> Thanks
>
> Mayank
>
> This email message may contain proprietary, private and confidential
> information. The information transmitted is intended only for the
> person(s) or entities to which it is addressed. Any review,
> retransmission, dissemination or other use of, or taking of any action
> in reliance upon, this information by persons or entities other than
> the intended recipient is prohibited and may be illegal. If you
> received this in error, please contact the sender and delete the
> message from your system. Mu Sigma takes all reasonable steps to
> ensure that its electronic communications are free from viruses.
> However, given Internet accessibility, the Company cannot accept
> liability for any virus introduced by this e-mail or any attachment and you are advised to use up-to-date virus checking software.

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

Disclaimer: http://www.mu-sigma.com/disclaimer.html

Re: Difference in number of row observstions from distinct and group by

Posted by Thejas Nair <th...@hortonworks.com>.

You probably have 400 rows where col1, col2 and col3 have null values.
"count(distinct col1,col2,col3) " will not count those rows.


On Thu, Nov 21, 2013 at 7:13 AM, Mayank Bansal
<Ma...@mu-sigma.com> wrote:
> Hi,
>
>
>
> I have a table which has 3 columns combined together to form a primary key.
> If I do
>
>
>
> Select count(distinct col1,col2,col3) from table_name;
>
>
>
> And
>
>
>
> Select count(a.*) from (select col1,col2,col3,count(*) from table_name group
> by col1,col2,col3)a ;
>
>
>
> While running the first query, the count of rows that I get is 400 less than
> what I get by the second query.
>
> Can someone please explain to me the difference in number of observations
> from both the queries?
>
>
>
> Thanks
>
> Mayank
>
> This email message may contain proprietary, private and confidential
> information. The information transmitted is intended only for the person(s)
> or entities to which it is addressed. Any review, retransmission,
> dissemination or other use of, or taking of any action in reliance upon,
> this information by persons or entities other than the intended recipient is
> prohibited and may be illegal. If you received this in error, please contact
> the sender and delete the message from your system. Mu Sigma takes all
> reasonable steps to ensure that its electronic communications are free from
> viruses. However, given Internet accessibility, the Company cannot accept
> liability for any virus introduced by this e-mail or any attachment and you
> are advised to use up-to-date virus checking software.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.