You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/05/04 01:43:43 UTC

[Hadoop Wiki] Update of "Hive/LanguageManual/UDF" by PhiloVivero

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/UDF" page has been changed by PhiloVivero.
The comment on this change is: Thanks to Andrew Done for this cool workaround..
http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF?action=diff&rev1=64&rev2=65

--------------------------------------------------

  
  == GROUPing and SORTing on f(column) ==
  
- A typical OLAP pattern is that you have a timestamp column and you want to group by daily or other less granular date windows than by second. So you might want to select concat(year(dt),month(dt)) and then group on that concat(). But if you attempt to GROUP BY or SORT BY a column on which you've applied a function, like this:
+ A typical OLAP pattern is that you have a timestamp column and you want to group by daily or other less granular date windows than by second. So you might want to select concat(year(dt),month(dt)) and then group on that concat(). But if you attempt to GROUP BY or SORT BY a column on which you've applied a function and alias, like this:
  
  {{{
  select f(col) as fc, count(*) from table_name group by fc
@@ -338, +338 @@

  FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc
  }}}
  
- Because you are not able to GROUP BY or SORT BY a column on which a function has been applied. However, you can reformulate this query with subqueries:
+ Because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied. There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat complicated:
- 
- {{{
- select sq.fc,count(*) from (select f(col) as fc from table_name) sq group by sq.fc
- }}}
- 
- You will have to specify all the columns you want along with the f(col) in both the subquery and the outside (which is obvious on retrospect). The general formula for the f(col) reformulation is:
  
  {{{
  select sq.fc,col1,col2,...,colN,count(*) from
@@ -352, +346 @@

   group by sq.fc,col1,col2,...,colN
  }}}
  
+ Or you can make sure not to use a column alias, which is simpler:
+ 
+ {{{
+ select f(col) as fc, count(*) from table_name group by f(col)
+ }}}
+ 
  Contact Tim Ellis (tellis) at RiotGames dot com if you would like to discuss this in further detail.