You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by MIS <mi...@gmail.com> on 2012/06/22 14:32:13 UTC
count of distinct FROM multiple columns
Hi All,
I have a table in Hive as below:
dummy {
col1 STRING,
col2 INT,
col3 INT
}
And in that there is some sample data as :
*col1 col2 col3 *
ABC 4 5
XYZ 1 2
ABC 1 3
ABC 5 1
XYZ 3 1
What should be my query so as to get the below result:
*ABC 4
XYZ 3*
Basically I'm trying to get a count of distinct elements from *col2 and
col3 combined* and group them against col1.
I tried with a query as:
select col1, count(distinct col2, col3) from dummy group by col1
But didn't get the expected output.
Can anybody point me in the correct direction and suggest a correct query.
Thanks.
Re: count of distinct FROM multiple columns
Posted by MIS <mi...@gmail.com>.
Thanks for your replies guys.
Of the three solutions, the one from Jan worked. The other two solutions
failed because of parse errors.
@Edward,
I don't think the syntax you suggested exits.
Yes Hive supports Multi Column distinct, But it is actually distinct
*ON*multiple columns and i think not
*FROM* multiple columns.
If anybody wants to have a go at it, below are the table creation steps for
them:
{Just changed the data types of Column2 and Column3 to be String instead of
int}
*CREATE TABLE dummy(column1 STRING, column2 STRING, column3 STRING) ROW
FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*)",
"output.format.string" = "%1$s %2$s %3$s") STORED AS TEXTFILE;*
have the above mentioned data in the dummy.txt file and then load it with:
*LOAD DATA LOCAL INPATH '/tmp/dummy.txt.gz' OVERWRITE INTO TABLE dummy*
Since the actual query will run against roughly 7 million rows, want to
know the best possible query for the same.
Thanks.
On Fri, Jun 22, 2012 at 11:12 PM, Edward Capriolo <ed...@gmail.com>wrote:
> I think your syntax is wrong. Hive should support multi-column
> distinct and at that point counting should work.
>
> You did:
> select col1, count(distinct col2, col3) from dummy group by col1
>
> I think the correct syntax is:
> select col1, count(distinct (col2, col3)) from dummy group by col1
>
> On Fri, Jun 22, 2012 at 1:29 PM, Mark Grover <mg...@oanda.com> wrote:
> > A solution that comes to my mind is to use a union. Something like
> (untested):
> >
> > select
> > first,
> > count(distinct second)
> > from
> > (select
> > col1 as first,
> > col2 as second
> > from
> > dummy
> > union all
> > select
> > col1 as first,
> > col3 as second
> > from
> > dummy
> > )t
> > group by
> > first;
> >
> > Mark
> > ----- Original Message -----
> > From: "Jan Dolinár" <do...@gmail.com>
> > To: user@hive.apache.org
> > Sent: Friday, June 22, 2012 8:52:43 AM
> > Subject: Re: count of distinct FROM multiple columns
> >
> >
> > Hi
> >
> >
> > A quick solution that comes first to my mind is to join the columns you
> want to combine into an array and then use the explode UDTF:
> >
> > SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW
> explode(array(col2, col3)) t AS combined GROUP BY col1;
> >
> > Although I believe there might be simpler and/or better solutions.
> >
> >
> > Jan
> >
> >
> >
> > On Fri, Jun 22, 2012 at 2:32 PM, MIS < misapache@gmail.com > wrote:
> >
> >
> > Hi All,
> >
> > I have a table in Hive as below:
> >
> > dummy {
> > col1 STRING,
> > col2 INT,
> > col3 INT
> > }
> >
> > And in that there is some sample data as :
> >
> > col1 col2 col3
> > ABC 4 5
> > XYZ 1 2
> > ABC 1 3
> > ABC 5 1
> > XYZ 3 1
> >
> > What should be my query so as to get the below result:
> >
> > ABC 4
> > XYZ 3
> >
> > Basically I'm trying to get a count of distinct elements from col2 and
> col3 combined and group them against col1.
> >
> > I tried with a query as:
> >
> > select col1, count(distinct col2, col3) from dummy group by col1
> >
> > But didn't get the expected output.
> >
> > Can anybody point me in the correct direction and suggest a correct
> query.
> >
> > Thanks.
> >
>
Re: count of distinct FROM multiple columns
Posted by Edward Capriolo <ed...@gmail.com>.
I think your syntax is wrong. Hive should support multi-column
distinct and at that point counting should work.
You did:
select col1, count(distinct col2, col3) from dummy group by col1
I think the correct syntax is:
select col1, count(distinct (col2, col3)) from dummy group by col1
On Fri, Jun 22, 2012 at 1:29 PM, Mark Grover <mg...@oanda.com> wrote:
> A solution that comes to my mind is to use a union. Something like (untested):
>
> select
> first,
> count(distinct second)
> from
> (select
> col1 as first,
> col2 as second
> from
> dummy
> union all
> select
> col1 as first,
> col3 as second
> from
> dummy
> )t
> group by
> first;
>
> Mark
> ----- Original Message -----
> From: "Jan Dolinár" <do...@gmail.com>
> To: user@hive.apache.org
> Sent: Friday, June 22, 2012 8:52:43 AM
> Subject: Re: count of distinct FROM multiple columns
>
>
> Hi
>
>
> A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF:
>
> SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1;
>
> Although I believe there might be simpler and/or better solutions.
>
>
> Jan
>
>
>
> On Fri, Jun 22, 2012 at 2:32 PM, MIS < misapache@gmail.com > wrote:
>
>
> Hi All,
>
> I have a table in Hive as below:
>
> dummy {
> col1 STRING,
> col2 INT,
> col3 INT
> }
>
> And in that there is some sample data as :
>
> col1 col2 col3
> ABC 4 5
> XYZ 1 2
> ABC 1 3
> ABC 5 1
> XYZ 3 1
>
> What should be my query so as to get the below result:
>
> ABC 4
> XYZ 3
>
> Basically I'm trying to get a count of distinct elements from col2 and col3 combined and group them against col1.
>
> I tried with a query as:
>
> select col1, count(distinct col2, col3) from dummy group by col1
>
> But didn't get the expected output.
>
> Can anybody point me in the correct direction and suggest a correct query.
>
> Thanks.
>
Re: count of distinct FROM multiple columns
Posted by Mark Grover <mg...@oanda.com>.
A solution that comes to my mind is to use a union. Something like (untested):
select
first,
count(distinct second)
from
(select
col1 as first,
col2 as second
from
dummy
union all
select
col1 as first,
col3 as second
from
dummy
)t
group by
first;
Mark
----- Original Message -----
From: "Jan Dolinár" <do...@gmail.com>
To: user@hive.apache.org
Sent: Friday, June 22, 2012 8:52:43 AM
Subject: Re: count of distinct FROM multiple columns
Hi
A quick solution that comes first to my mind is to join the columns you want to combine into an array and then use the explode UDTF:
SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW explode(array(col2, col3)) t AS combined GROUP BY col1;
Although I believe there might be simpler and/or better solutions.
Jan
On Fri, Jun 22, 2012 at 2:32 PM, MIS < misapache@gmail.com > wrote:
Hi All,
I have a table in Hive as below:
dummy {
col1 STRING,
col2 INT,
col3 INT
}
And in that there is some sample data as :
col1 col2 col3
ABC 4 5
XYZ 1 2
ABC 1 3
ABC 5 1
XYZ 3 1
What should be my query so as to get the below result:
ABC 4
XYZ 3
Basically I'm trying to get a count of distinct elements from col2 and col3 combined and group them against col1.
I tried with a query as:
select col1, count(distinct col2, col3) from dummy group by col1
But didn't get the expected output.
Can anybody point me in the correct direction and suggest a correct query.
Thanks.
Re: count of distinct FROM multiple columns
Posted by Jan Dolinár <do...@gmail.com>.
Hi
A quick solution that comes first to my mind is to join the columns you
want to combine into an array and then use the explode UDTF:
SELECT col1, COUNT(distinct combined) FROM dummy LATERAL VIEW
explode(array(col2, col3)) t AS combined GROUP BY col1;
Although I believe there might be simpler and/or better solutions.
Jan
On Fri, Jun 22, 2012 at 2:32 PM, MIS <mi...@gmail.com> wrote:
> Hi All,
>
> I have a table in Hive as below:
>
> dummy {
> col1 STRING,
> col2 INT,
> col3 INT
> }
>
> And in that there is some sample data as :
>
> *col1 col2 col3 *
> ABC 4 5
> XYZ 1 2
> ABC 1 3
> ABC 5 1
> XYZ 3 1
>
> What should be my query so as to get the below result:
>
> *ABC 4
> XYZ 3*
>
> Basically I'm trying to get a count of distinct elements from *col2 and
> col3 combined* and group them against col1.
>
> I tried with a query as:
>
> select col1, count(distinct col2, col3) from dummy group by col1
>
> But didn't get the expected output.
>
> Can anybody point me in the correct direction and suggest a correct query.
>
> Thanks.
>