You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/05/13 00:54:39 UTC

[Hadoop Wiki] Trivial Update of "Hive/Tutorial" by Ning Zhang

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by Ning Zhang.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=24&rev2=25

--------------------------------------------------

    * Currently dynamic partition insert will not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. The reason is that if either of the merge parameters is set to true, there will be a map reduce job for that particular partition to merge all files into one. In dynamic partition insert, we do not know the number of partitions at compile time thus no MapReduce job could be generated. There is a new JIRA HIVE-1307 filed for this task. 
  
  Troubleshooting and best practices:
-   * As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:
+   * As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like: {{{
- {{{
      hive> set hive.exec.dynamic.partition.mode=nonstrict;
      hive> FROM page_view_stg pvs
-           INSERT OVERWRITE TABLE page_view PARTITION(dt=, country) 
+           INSERT OVERWRITE TABLE page_view PARTITION(dt, country) 
                   SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, 
                          from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
- 
  ...
  2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
  [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
  Ended Job = job_201005052204_28178 with errors
- ...
- }}}
- The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode.  One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:
+ ...}}} The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode.  One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to: {{{
- {{{
      hive> set hive.exec.dynamic.partition.mode=nonstrict;
      hive> FROM page_view_stg pvs
-           INSERT OVERWRITE TABLE page_view PARTITION(dt=, country) 
+           INSERT OVERWRITE TABLE page_view PARTITION(dt, country) 
                   SELECT * FROM (
                     SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, 
                            from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country 
                     DISTRIBUTE BY ds, country
                     ) T;
- }}}
- This query will generate a MapReduce job rather than Map-only job. The inner-most SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions. 
+ }}} This query will generate a MapReduce job rather than Map-only job. The inner-most SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions. 
  
  == Inserting into local files ==
  In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command: