You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/02 20:34:25 UTC

[Hadoop Wiki] Update of "Hive/LanguageManual/SortBy" by Ning Zhang

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/LanguageManual/SortBy" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/LanguageManual/SortBy?action=diff&rev1=7&rev2=8

--------------------------------------------------

  
  ''Cluster By'' is a short-cut for both ''Distribute By'' and ''Sort By''.
  
- Hive uses the columns in ''Distribute By'' to distribute the rows among reducers.  All rows with the same ''Distribute By'' columns will go to the same reducer.
+ Hive uses the columns in ''Distribute By'' to distribute the rows among reducers.  All rows with the same ''Distribute By'' columns will go to the same reducer. However, ''Distribute By'' does not guarantee clustering or sorting properties on the distributed keys. For example, we are distributing 5 rows to 2 reducer by column x whose values are x1, x1, x2, x3, and x4. Reducer 1 got x1, x2, x1, and reducer 2 got x3 and x4. Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but the order of rows does not guarantee that all rows with x1 as key be clustered in adjacent order. 
  
  Instead of specifying ''Cluster By'', the user can specify ''Distribute By'' and ''Sort By'', so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.