You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by "Michael E. Driscoll" <m....@gmail.com> on 2009/07/03 11:35:15 UTC

aggregations over multiple columns?

Hi HIVErs,

I'm trying to perform the following aggregation query in HIVE, which
finds the largest purchase for all combinations of customer and store:

  SELECT customer, store, max(purchasePrice)
  FROM transactions
  GROUP BY customer, store

If aggregation over multiple columns is not currently supported, how
might I reformulate this to work in HIVE, possibly via a simpler
series of queries?

(I will post the exact error and reproducible code if it turns out
this query is valid).

regards,

Mike

b: www.dataspora.com/blog
t: www.twitter.com/dataspora

RE: aggregations over multiple columns?

Posted by Ashish Thusoo <at...@facebook.com>.
Hi Michael,

This is supported and should work.

Ashish 

-----Original Message-----
From: Michael E. Driscoll [mailto:m.e.driscoll@gmail.com] 
Sent: Friday, July 03, 2009 2:35 AM
To: hive-user@hadoop.apache.org
Subject: aggregations over multiple columns?

Hi HIVErs,

I'm trying to perform the following aggregation query in HIVE, which finds the largest purchase for all combinations of customer and store:

  SELECT customer, store, max(purchasePrice)
  FROM transactions
  GROUP BY customer, store

If aggregation over multiple columns is not currently supported, how might I reformulate this to work in HIVE, possibly via a simpler series of queries?

(I will post the exact error and reproducible code if it turns out this query is valid).

regards,

Mike

b: www.dataspora.com/blog
t: www.twitter.com/dataspora

Re: aggregations over multiple columns?

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Avram: see https://issues.apache.org/jira/browse/HIVE-474.

On Mon, Jul 6, 2009 at 9:17 AM, Avram Aelony <Av...@eharmony.com>wrote:

>
> The documentation appears to state the following:
>
> "Multiple aggregations can be done at the same time, however, no two
> aggregations can have different DISTINCT columns .e.g while the following is
> possible "
>
> "  INSERT OVERWRITE TABLE pv_gender_agg
>  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1),
> sum(DISTINCT pv_users.userid)
>  FROM pv_users
>  GROUP BY pv_users.gender;
> However, the following query is not allowed. We don't allow multiple
> DISTINCT expressions in the same query.
>
>  INSERT OVERWRITE TABLE pv_gender_agg
>  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
> pv_users.ip)
>  FROM pv_users
>  GROUP BY pv_users.gender;"
>
>
> http://wiki.apache.org/hadoop/Hive/LanguageManual/GroupBy
>
>
> Is there an effort underway to allow multiple DISTINCT expressions in the
> same query in the (near) future as well?
>
> Thanks & regards,
> Avram
>
>
>
>
> -----Original Message-----
> From: Amr Awadallah [mailto:aaa@cloudera.com]
> Sent: Saturday, July 04, 2009 12:02 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: aggregations over multiple columns?
>
> Mike,
>
>  This is a valid query, group by over multiple columns works in hive.
>
> -- amr
>
> Michael E. Driscoll wrote:
> > Hi HIVErs,
> >
> > I'm trying to perform the following aggregation query in HIVE, which
> > finds the largest purchase for all combinations of customer and store:
> >
> >   SELECT customer, store, max(purchasePrice)
> >   FROM transactions
> >   GROUP BY customer, store
> >
> > If aggregation over multiple columns is not currently supported, how
> > might I reformulate this to work in HIVE, possibly via a simpler
> > series of queries?
> >
> > (I will post the exact error and reproducible code if it turns out
> > this query is valid).
> >
> > regards,
> >
> > Mike
> >
> > b: www.dataspora.com/blog
> > t: www.twitter.com/dataspora
> >
>

RE: aggregations over multiple columns?

Posted by Avram Aelony <Av...@eharmony.com>.
The documentation appears to state the following:

"Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible "

"  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(1), sum(DISTINCT pv_users.userid)
  FROM pv_users
  GROUP BY pv_users.gender;
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.

  INSERT OVERWRITE TABLE pv_gender_agg
  SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
  FROM pv_users
  GROUP BY pv_users.gender;"


http://wiki.apache.org/hadoop/Hive/LanguageManual/GroupBy


Is there an effort underway to allow multiple DISTINCT expressions in the same query in the (near) future as well?

Thanks & regards,
Avram




-----Original Message-----
From: Amr Awadallah [mailto:aaa@cloudera.com] 
Sent: Saturday, July 04, 2009 12:02 AM
To: hive-user@hadoop.apache.org
Subject: Re: aggregations over multiple columns?

Mike,

  This is a valid query, group by over multiple columns works in hive.

-- amr

Michael E. Driscoll wrote:
> Hi HIVErs,
>
> I'm trying to perform the following aggregation query in HIVE, which
> finds the largest purchase for all combinations of customer and store:
>
>   SELECT customer, store, max(purchasePrice)
>   FROM transactions
>   GROUP BY customer, store
>
> If aggregation over multiple columns is not currently supported, how
> might I reformulate this to work in HIVE, possibly via a simpler
> series of queries?
>
> (I will post the exact error and reproducible code if it turns out
> this query is valid).
>
> regards,
>
> Mike
>
> b: www.dataspora.com/blog
> t: www.twitter.com/dataspora
>   

Re: aggregations over multiple columns?

Posted by Amr Awadallah <aa...@cloudera.com>.
Mike,

  This is a valid query, group by over multiple columns works in hive.

-- amr

Michael E. Driscoll wrote:
> Hi HIVErs,
>
> I'm trying to perform the following aggregation query in HIVE, which
> finds the largest purchase for all combinations of customer and store:
>
>   SELECT customer, store, max(purchasePrice)
>   FROM transactions
>   GROUP BY customer, store
>
> If aggregation over multiple columns is not currently supported, how
> might I reformulate this to work in HIVE, possibly via a simpler
> series of queries?
>
> (I will post the exact error and reproducible code if it turns out
> this query is valid).
>
> regards,
>
> Mike
>
> b: www.dataspora.com/blog
> t: www.twitter.com/dataspora
>