You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/15 19:31:57 UTC

[Hadoop Wiki] Update of "Hive/LanguageManual/SortBy" by Ning Zhang

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/SortBy" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy?action=diff&rev1=5&rev2=6

--------------------------------------------------

  <<TableOfContents>>
+ 
+ == Syntax of Order By ==
+ The ''ORDER BY'' syntax in Hive QL is similar to the syntax of ''ORDER BY'' in SQL language.
+ 
+ {{{
+ colOrder: ( ASC | DESC )
+ orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
+ query: SELECT expression (',' expression)* FROM src orderBy
+ }}}
+ 
+ There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish. 
  
  == Syntax of Sort By ==
  The ''SORT BY'' syntax is similar to the syntax of ''ORDER BY'' in SQL language.
@@ -11, +22 @@

  query: SELECT expression (',' expression)* FROM src sortBy
  }}}
  
- Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows to a single reducer.  The sort order will be dependent on the column types.  If the column is of numeric type, then the sort order is also in numeric order.  If the column is of string type, then the sort order will be lexicographical order.
+ Hive uses the columns in ''SORT BY'' to sort the rows before feeding the rows to a reducer.  The sort order will be dependent on the column types.  If the column is of numeric type, then the sort order is also in numeric order.  If the column is of string type, then the sort order will be lexicographical order.
  
  
  === Difference between Sort By and Order By ===
+ Hive supports ''SORT BY'' which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows withint a reducer. If there are more than one reducer, "sort by" may give partially ordered final results. 
- Most database systems supports ''ORDER BY'', which Hive does not support directly.
- Hive supports ''SORT BY'' which sorts the data per reducer.
  
  Basically, the data in each reducer will be sorted according to the order that the user specified.  The following example shows 
  
@@ -38, +48 @@

  1   1
  2   5
  }}}
- 
- 
- === Simulating Order By ===
- 
- We can set the number of reducers to 1, to make sure we have the same result as ''ORDER BY''.
- 
- {{{
- set mapred.reduce.tasks=1;
- SELECT key, value FROM src SORT BY key ASC, value DESC;
- }}}
- 
- This sometimes will make the reducer a performance bottleneck.  A lot of cases the user only wants to see the top N rows where N is a small number.  In this case, we can use LIMIT clause.  We don't have an example here but users are encouraged to provide one.
  
  === Setting Types for Sort By ===