You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by MIS <mi...@gmail.com> on 2012/06/22 14:32:13 UTC

count of distinct FROM multiple columns

Hi All,

I have a table in Hive as below:

dummy {
    col1 STRING,
    col2 INT,
    col3 INT
}

And in that there is some sample data as :

*col1  col2 col3 *
ABC    4    5
XYZ    1    2
ABC    1    3
ABC    5    1
XYZ    3    1

What should be my query so as to get the below result:

*ABC 4
XYZ 3*

Basically I'm trying to get a count of distinct elements from *col2 and
col3 combined* and group them against col1.

I tried with a query as:

select col1, count(distinct col2, col3) from dummy group by col1

But didn't get the expected output.

Can anybody point me in the correct direction and suggest a correct query.

Thanks.

Re: count of distinct FROM multiple columns

Posted by MIS <mi...@gmail.com>.

Thanks for your replies guys.
Of the three solutions, the one from Jan worked. The other two solutions
failed because of parse errors.

@Edward,
I don't think the syntax you suggested exits.
Yes Hive supports Multi Column distinct, But it is actually distinct
*ON*multiple columns and i think not
*FROM* multiple columns.

If anybody wants to have a go at it, below are the table creation steps for
them:
{Just changed the data types of Column2 and Column3 to be String instead of
int}

*CREATE TABLE dummy(column1 STRING, column2 STRING, column3 STRING) ROW
FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*)",
"output.format.string" = "%1$s %2$s %3$s") STORED AS TEXTFILE;*

have the above mentioned data in the dummy.txt file and then load it with:

*LOAD DATA LOCAL INPATH '/tmp/dummy.txt.gz' OVERWRITE INTO TABLE dummy*

Since the actual query will run against roughly 7 million rows, want to
know the best possible query for the same.

Thanks.

On Fri, Jun 22, 2012 at 11:12 PM, Edward Capriolo <ed...@gmail.com>wrote:

> I think your syntax is wrong. Hive should support multi-column
> distinct and at that point counting should work.
>
> You did:
> select col1, count(distinct col2, col3) from dummy group by col1
>
> I think the correct syntax is:
> select col1, count(distinct (col2, col3)) from dummy group by col1
>
> On Fri, Jun 22, 2012 at 1:29 PM, Mark Grover <mg...@oanda.com> wrote:
> > A solution that comes to my mind is to use a union. Something like
> (untested):
> >
> > select
> >   first,
> >   count(distinct second)
> > from
> >   (select
> >      col1 as first,
> >      col2 as second
> >   from
> >      dummy
> >   union all
> >   select
> >      col1 as first,
> >      col3 as second
> >   from
> >      dummy
> >   )t
> > group by
> >   first;
> >
> > Mark
> > ----- Original Message -----
> > From: "Jan Dolinár" <do...@gmail.com>
> > To: user@hive.apache.org
> > Sent: Friday, June 22, 2012 8:52:43 AM
> > Subject: Re: count of distinct FROM multiple columns
> >
> >
> > Hi
> >
> >
> > A quick solution that comes first to my mind is to join the columns you
> want to combine into an array and then use the explode UDTF:
> >
> > SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW
> explode(array(col2, col3)) t AS combined GROUP BY col1;
> >
> > Although I believe there might be simpler and/or better solutions.
> >
> >
> > Jan
> >
> >
> >
> > On Fri, Jun 22, 2012 at 2:32 PM, MIS < misapache@gmail.com > wrote:
> >
> >
> > Hi All,
> >
> > I have a table in Hive as below:
> >
> > dummy {
> > col1 STRING,
> > col2 INT,
> > col3 INT
> > }
> >
> > And in that there is some sample data as :
> >
> > col1 col2 col3
> > ABC 4 5
> > XYZ 1 2
> > ABC 1 3
> > ABC 5 1
> > XYZ 3 1
> >
> > What should be my query so as to get the below result:
> >
> > ABC 4
> > XYZ 3
> >
> > Basically I'm trying to get a count of distinct elements from col2 and
> col3 combined and group them against col1.
> >
> > I tried with a query as:
> >
> > select col1, count(distinct col2, col3) from dummy group by col1
> >
> > But didn't get the expected output.
> >
> > Can anybody point me in the correct direction and suggest a correct
> query.
> >
> > Thanks.
> >
>

Re: count of distinct FROM multiple columns

Posted by Edward Capriolo <ed...@gmail.com>.

I think your syntax is wrong. Hive should support multi-column
distinct and at that point counting should work.

You did:
select col1, count(distinct col2, col3) from dummy group by col1

I think the correct syntax is:
select col1, count(distinct (col2, col3)) from dummy group by col1

On Fri, Jun 22, 2012 at 1:29 PM, Mark Grover <mg...@oanda.com> wrote:
> A solution that comes to my mind is to use a union. Something like (untested):
>
> select
>   first,
>   count(distinct second)
> from
>   (select
>      col1 as first,
>      col2 as second
>   from
>      dummy
>   union all
>   select
>      col1 as first,
>      col3 as second
>   from
>      dummy
>   )t
> group by
>   first;
>
> Mark
> ----- Original Message -----
> From: "Jan Dolinár" <do...@gmail.com>
> To: user@hive.apache.org
> Sent: Friday, June 22, 2012 8:52:43 AM
> Subject: Re: count of distinct FROM multiple columns
>
>
> Hi
>
>
> A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF:
>
> SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1;
>
> Although I believe there might be simpler and/or better solutions.
>
>
> Jan
>
>
>
> On Fri, Jun 22, 2012 at 2:32 PM, MIS < misapache@gmail.com > wrote:
>
>
> Hi All,
>
> I have a table in Hive as below:
>
> dummy {
> col1 STRING,
> col2 INT,
> col3 INT
> }
>
> And in that there is some sample data as :
>
> col1 col2 col3
> ABC 4 5
> XYZ 1 2
> ABC 1 3
> ABC 5 1
> XYZ 3 1
>
> What should be my query so as to get the below result:
>
> ABC 4
> XYZ 3
>
> Basically I'm trying to get a count of distinct elements from col2 and col3 combined and group them against col1.
>
> I tried with a query as:
>
> select col1, count(distinct col2, col3) from dummy group by col1
>
> But didn't get the expected output.
>
> Can anybody point me in the correct direction and suggest a correct query.
>
> Thanks.
>

Re: count of distinct FROM multiple columns

Posted by Mark Grover <mg...@oanda.com>.

A solution that comes to my mind is to use a union. Something like (untested):

select
   first,
   count(distinct second)
from
   (select
      col1 as first,
      col2 as second
   from
      dummy
   union all
   select 
      col1 as first,
      col3 as second
   from
      dummy
   )t
group by
   first;

Mark
----- Original Message -----
From: "Jan Dolinár" <do...@gmail.com>
To: user@hive.apache.org
Sent: Friday, June 22, 2012 8:52:43 AM
Subject: Re: count of distinct FROM multiple columns


Hi 


A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF: 

SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1; 

Although I believe there might be simpler and/or better solutions. 


Jan 



On Fri, Jun 22, 2012 at 2:32 PM, MIS < misapache@gmail.com > wrote: 


Hi All, 

I have a table in Hive as below: 

dummy { 
col1 STRING, 
col2 INT, 
col3 INT 
} 

And in that there is some sample data as : 

col1 col2 col3 
ABC 4 5 
XYZ 1 2 
ABC 1 3 
ABC 5 1 
XYZ 3 1 

What should be my query so as to get the below result: 

ABC 4 
XYZ 3 

Basically I'm trying to get a count of distinct elements from col2 and col3 combined and group them against col1. 

I tried with a query as: 

select col1, count(distinct col2, col3) from dummy group by col1 

But didn't get the expected output. 

Can anybody point me in the correct direction and suggest a correct query. 

Thanks.

Re: count of distinct FROM multiple columns

Posted by Jan Dolinár <do...@gmail.com>.

Hi

A quick solution that comes first to my mind is to join the columns you
want to combine into an array and then use the explode UDTF:

SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW
explode(array(col2, col3)) t AS combined GROUP BY col1;

Although I believe there might be simpler and/or better solutions.

Jan


On Fri, Jun 22, 2012 at 2:32 PM, MIS <mi...@gmail.com> wrote:

> Hi All,
>
> I have a table in Hive as below:
>
> dummy {
>     col1 STRING,
>     col2 INT,
>     col3 INT
> }
>
> And in that there is some sample data as :
>
> *col1  col2 col3 *
> ABC    4    5
> XYZ    1    2
> ABC    1    3
> ABC    5    1
> XYZ    3    1
>
> What should be my query so as to get the below result:
>
> *ABC 4
> XYZ 3*
>
> Basically I'm trying to get a count of distinct elements from *col2 and
> col3 combined* and group them against col1.
>
> I tried with a query as:
>
> select col1, count(distinct col2, col3) from dummy group by col1
>
> But didn't get the expected output.
>
> Can anybody point me in the correct direction and suggest a correct query.
>
> Thanks.
>