You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Awhan Patnaik <aw...@spotzot.com> on 2016/03/07 18:04:25 UTC

count(*) not allowed in order by

I have to take the first 25 IDs ranked by count(*). But the following is
not allowed in Hive

select id from T order by count(*) desc limit 25;

Which yields a "NOt yet supported place for UDAF count". The way around it
is the following

select id, count(*) as cnt from T group by id order by cnt desc limit 25;

However I need to put this query in a subquery like so

select id, XXX from T t join .... where t.id in (select id, count(*) as cnt
from T group by id order by cnt desc limit 25) group by ... ;

which does NOT work because a subquery is allowed to return only one thing.
Here XXX are complex constructs like distance calculation and binning.
These are time consuming and complex operations.

The only way I see this would work is if I use the following sub-sub query

select id, XXX from T t join ... where t.id in (select sub.id from (select
id, count(*) as cnt from T group by id order by cnt desc limit 25)sub)
group by ... ;

The reason I don't want to put the limit in the outermost query is because
those XXX queries are expensive and I don't want them to be performed on
the entire result only to retain the top 25 and throw away the rest of the
results. The count(*) operation of the rest of the IDs is not very
expensive or time consuming.

Is there some other elegant way of handling this without using a
sub-sub-query approach?

Re: count(*) not allowed in order by

Posted by Devopam Mittra <de...@gmail.com>.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

This should help you , try rank/ dense rank as appropriate and mold it to
best use for yourself
Regards
Dev
On Mar 7, 2016 10:35 PM, "Awhan Patnaik" <aw...@spotzot.com> wrote:

> I have to take the first 25 IDs ranked by count(*). But the following is
> not allowed in Hive
>
> select id from T order by count(*) desc limit 25;
>
> Which yields a "NOt yet supported place for UDAF count". The way around it
> is the following
>
> select id, count(*) as cnt from T group by id order by cnt desc limit 25;
>
> However I need to put this query in a subquery like so
>
> select id, XXX from T t join .... where t.id in (select id, count(*) as
> cnt from T group by id order by cnt desc limit 25) group by ... ;
>
> which does NOT work because a subquery is allowed to return only one
> thing. Here XXX are complex constructs like distance calculation and
> binning. These are time consuming and complex operations.
>
> The only way I see this would work is if I use the following sub-sub query
>
> select id, XXX from T t join ... where t.id in (select sub.id from
> (select id, count(*) as cnt from T group by id order by cnt desc limit
> 25)sub) group by ... ;
>
> The reason I don't want to put the limit in the outermost query is because
> those XXX queries are expensive and I don't want them to be performed on
> the entire result only to retain the top 25 and throw away the rest of the
> results. The count(*) operation of the rest of the IDs is not very
> expensive or time consuming.
>
> Is there some other elegant way of handling this without using a
> sub-sub-query approach?
>

Re: count(*) not allowed in order by

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

You arte looking at the top 25 of result set so you will have to get full
result set before looking at top 25

Something like this

 select rs.prod_id, rs.score from
 (
   prod_id, count(prod_id) AS Score from sales GROUP BY prod_id ORDER BY
Score DESC
 )rs
 LIMIT 25;

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 7 March 2016 at 17:04, Awhan Patnaik <aw...@spotzot.com> wrote:

> I have to take the first 25 IDs ranked by count(*). But the following is
> not allowed in Hive
>
> select id from T order by count(*) desc limit 25;
>
> Which yields a "NOt yet supported place for UDAF count". The way around it
> is the following
>
> select id, count(*) as cnt from T group by id order by cnt desc limit 25;
>
> However I need to put this query in a subquery like so
>
> select id, XXX from T t join .... where t.id in (select id, count(*) as
> cnt from T group by id order by cnt desc limit 25) group by ... ;
>
> which does NOT work because a subquery is allowed to return only one
> thing. Here XXX are complex constructs like distance calculation and
> binning. These are time consuming and complex operations.
>
> The only way I see this would work is if I use the following sub-sub query
>
> select id, XXX from T t join ... where t.id in (select sub.id from
> (select id, count(*) as cnt from T group by id order by cnt desc limit
> 25)sub) group by ... ;
>
> The reason I don't want to put the limit in the outermost query is because
> those XXX queries are expensive and I don't want them to be performed on
> the entire result only to retain the top 25 and throw away the rest of the
> results. The count(*) operation of the rest of the IDs is not very
> expensive or time consuming.
>
> Is there some other elegant way of handling this without using a
> sub-sub-query approach?
>