You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Andraz Tori <an...@zemanta.com> on 2009/07/31 13:18:25 UTC

Group by week?

Two questions:

1. How would one group by week (instead of date or time)?

My first idea was the following:

INSERT OVERWRITE TABLE platforms_weekly select platform, count(distinct
apikey), count(1), date_add(pdate, datediff('2007-12-31', pdate) / 7)
from shairlogs group by platform, datediff('2007-12-31', pdate) / 7


but the issue is that Hive can't know there's a subcalculation that is
the same in both group part and output part.

I tried to use AS statement with group expression, but no luck there [or
maybe I didn't know how].


2. When grouping by months there seems to be a problem of getting the
proper 'yyyy-mm' as part of output columns. 

Here my idea was:
INSERT OVERWRITE TABLE platforms_monthly select platform, count(distinct
apikey), count(1), concat(concat(CAST(year(pdate) as STRING), '-'),
CAST(month(pdate) as STRING)) from shairlogs group by platform,
year(pdate), month(pdate)"

But the issue is that output yyyy-mm strings are not zero-padded for
months and lexical sort then breaks on them...









-- 
Andraz Tori, CTO
Zemanta Ltd, New York, London, Ljubljana
www.zemanta.com
mail: andraz@zemanta.com
tel: +386 41 515 767
twitter: andraz, skype: minmax_test

RE: Group by week?

Posted by Ashish Thusoo <at...@facebook.com>.

Hive expects the expressions in the group by clause to be top level expressions in the select list

so

select platform, datediff('2007-12-31', pdate)/7, count....
from
group by platform, datediff('2007-12-31', pdate)/7

should work. You could also put the select list into a subquery and then do a group by on top of it.

For getting numeric sorting you can do an order by CAST(month(pdate) AS INT). That will probably work for you?

Ashish
________________________________________
From: Andraz Tori [andraz@zemanta.com]
Sent: Friday, July 31, 2009 4:18 AM
To: hive-user@hadoop.apache.org
Subject: Group by week?

Two questions:

1. How would one group by week (instead of date or time)?

My first idea was the following:

INSERT OVERWRITE TABLE platforms_weekly select platform, count(distinct
apikey), count(1), date_add(pdate, datediff('2007-12-31', pdate) / 7)
from shairlogs group by platform, datediff('2007-12-31', pdate) / 7


but the issue is that Hive can't know there's a subcalculation that is
the same in both group part and output part.

I tried to use AS statement with group expression, but no luck there [or
maybe I didn't know how].


2. When grouping by months there seems to be a problem of getting the
proper 'yyyy-mm' as part of output columns.

Here my idea was:
INSERT OVERWRITE TABLE platforms_monthly select platform, count(distinct
apikey), count(1), concat(concat(CAST(year(pdate) as STRING), '-'),
CAST(month(pdate) as STRING)) from shairlogs group by platform,
year(pdate), month(pdate)"

But the issue is that output yyyy-mm strings are not zero-padded for
months and lexical sort then breaks on them...









--
Andraz Tori, CTO
Zemanta Ltd, New York, London, Ljubljana
www.zemanta.com
mail: andraz@zemanta.com
tel: +386 41 515 767
twitter: andraz, skype: minmax_test