You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by 周梦想 <ab...@gmail.com> on 2013/03/27 04:34:01 UTC

how to make data statistics efficiency in hive?

hello,
about hsql statistics.

table mytable
date,uid,a,b,c
--------------------
03/13/13       185690475      0       1       1
03/13/13       187270278      0       1       0
03/13/13       185690475      1       1       0
03/13/13       186012530      1       0       1
03/13/13       180286243      0       1       0
03/13/13       185690475      1       1       0
03/13/13       186012530      0       1       0
03/13/13       183256782      1       0       0
03/14/13       185690475      0       0       1

I want to get one day,each user total count,count a=1 ,count b=1, count c=1
the out put should like:

key,total, counta, countb, countc
-----------------------
03/13/13:185690475      3      2       3       1
03/13/13:187270278      1      0       1       0
03/13/13:186012530      2      1       1       1
03/13/13:180286243      1      0       1       0
03/13/13:183256782      1      1       0       0
03/14/13:185690475      1      0       0       1

the hsql i want is:
select concat(date:uid),count(1),count(a=1),count(b=1),count(c=1) from
mytable group by uid,date;

but I have to write ugly and inefficiency hsql like :
select  concat(s1.date:s1.uid),s1.total,s2.ca,s3.cb,s4.cc from
(
select date,uid,count(1) total total from mytable group by uid,date) s1
inner outer join
(select date,uid,count(1) ca total from mytable where a=1 group by
uid,date)s2
inner outer join
(select date,uid,count(1) cb total from mytable where b=1 group by
uid,date)s3
inner outer join
(select date,uid,count(1) cc total from mytable where c=1 group by
uid,date)s4
);

each select sub-clause should run a map-reduce.

if I have to count a very big number of  columns table, this should be a
very long task.
some one have any good ideals?

Thank you!

Best Regards,
Andy Zhou

Re: how to make data statistics efficiency in hive?

Posted by 周梦想 <ab...@gmail.com>.
Hi Jan,

You are so great! This expression is just I'm looking for.

Thank you very much!

Best Regards,
Andy Zhou

2013/3/27 Jan Dolinár <do...@gmail.com>

> Hi Andy,
>
> I'm not sure if I entirely understood your question, but I think you're
> looking for something like this:
>
> select
>     concat(date,':',uid),
>     sum(1) as total,
>     sum(if(a=1,1,0)) AS ca,
>     sum(if(b=1,1,0)) AS cb,
>     sum(if(c=1,1,0)) AS cc
> from mytable
> group by uid, date;
>
> Query like this allows you to efficiently compute sums of multiple columns
> in a single map-reduce.
>
> Best regards,
> Jan
>
>
> On Wed, Mar 27, 2013 at 4:34 AM, 周梦想 <ab...@gmail.com> wrote:
>
>> hello,
>> about hsql statistics.
>>
>> table mytable
>> date,uid,a,b,c
>> --------------------
>> 03/13/13       185690475      0       1       1
>> 03/13/13       187270278      0       1       0
>> 03/13/13       185690475      1       1       0
>> 03/13/13       186012530      1       0       1
>> 03/13/13       180286243      0       1       0
>> 03/13/13       185690475      1       1       0
>> 03/13/13       186012530      0       1       0
>> 03/13/13       183256782      1       0       0
>> 03/14/13       185690475      0       0       1
>>
>> I want to get one day,each user total count,count a=1 ,count b=1, count
>> c=1
>> the out put should like:
>>
>> key,total, counta, countb, countc
>> -----------------------
>> 03/13/13:185690475      3      2       3       1
>> 03/13/13:187270278      1      0       1       0
>> 03/13/13:186012530      2      1       1       1
>> 03/13/13:180286243      1      0       1       0
>> 03/13/13:183256782      1      1       0       0
>> 03/14/13:185690475      1      0       0       1
>>
>> the hsql i want is:
>> select concat(date:uid),count(1),count(a=1),count(b=1),count(c=1) from
>> mytable group by uid,date;
>>
>> but I have to write ugly and inefficiency hsql like :
>> select  concat(s1.date:s1.uid),s1.total,s2.ca,s3.cb,s4.cc from
>> (
>> select date,uid,count(1) total total from mytable group by uid,date) s1
>> inner outer join
>> (select date,uid,count(1) ca total from mytable where a=1 group by
>> uid,date)s2
>> inner outer join
>> (select date,uid,count(1) cb total from mytable where b=1 group by
>> uid,date)s3
>> inner outer join
>> (select date,uid,count(1) cc total from mytable where c=1 group by
>> uid,date)s4
>> );
>>
>> each select sub-clause should run a map-reduce.
>>
>> if I have to count a very big number of  columns table, this should be a
>> very long task.
>> some one have any good ideals?
>>
>> Thank you!
>>
>> Best Regards,
>> Andy Zhou
>>
>
>

Re: how to make data statistics efficiency in hive?

Posted by Jan Dolinár <do...@gmail.com>.
Hi Andy,

I'm not sure if I entirely understood your question, but I think you're
looking for something like this:

select
    concat(date,':',uid),
    sum(1) as total,
    sum(if(a=1,1,0)) AS ca,
    sum(if(b=1,1,0)) AS cb,
    sum(if(c=1,1,0)) AS cc
from mytable
group by uid, date;

Query like this allows you to efficiently compute sums of multiple columns
in a single map-reduce.

Best regards,
Jan


On Wed, Mar 27, 2013 at 4:34 AM, 周梦想 <ab...@gmail.com> wrote:

> hello,
> about hsql statistics.
>
> table mytable
> date,uid,a,b,c
> --------------------
> 03/13/13       185690475      0       1       1
> 03/13/13       187270278      0       1       0
> 03/13/13       185690475      1       1       0
> 03/13/13       186012530      1       0       1
> 03/13/13       180286243      0       1       0
> 03/13/13       185690475      1       1       0
> 03/13/13       186012530      0       1       0
> 03/13/13       183256782      1       0       0
> 03/14/13       185690475      0       0       1
>
> I want to get one day,each user total count,count a=1 ,count b=1, count c=1
> the out put should like:
>
> key,total, counta, countb, countc
> -----------------------
> 03/13/13:185690475      3      2       3       1
> 03/13/13:187270278      1      0       1       0
> 03/13/13:186012530      2      1       1       1
> 03/13/13:180286243      1      0       1       0
> 03/13/13:183256782      1      1       0       0
> 03/14/13:185690475      1      0       0       1
>
> the hsql i want is:
> select concat(date:uid),count(1),count(a=1),count(b=1),count(c=1) from
> mytable group by uid,date;
>
> but I have to write ugly and inefficiency hsql like :
> select  concat(s1.date:s1.uid),s1.total,s2.ca,s3.cb,s4.cc from
> (
> select date,uid,count(1) total total from mytable group by uid,date) s1
> inner outer join
> (select date,uid,count(1) ca total from mytable where a=1 group by
> uid,date)s2
> inner outer join
> (select date,uid,count(1) cb total from mytable where b=1 group by
> uid,date)s3
> inner outer join
> (select date,uid,count(1) cc total from mytable where c=1 group by
> uid,date)s4
> );
>
> each select sub-clause should run a map-reduce.
>
> if I have to count a very big number of  columns table, this should be a
> very long task.
> some one have any good ideals?
>
> Thank you!
>
> Best Regards,
> Andy Zhou
>