You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/08/12 10:53:15 UTC

[Hadoop Wiki] Update of "Hive/GettingStarted" by JoydeepSensarma

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GettingStarted" page has been changed by JoydeepSensarma.
http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=34&rev2=35

--------------------------------------------------

        this sets the variables x1 and x2 to y1 and y2 respectively
      * By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above
  
+ === Runtime configuration ===
+ 
+   * Hive queries are executed using map-reduce queries and, therefore, the behavior 
+   of such queries can be controlled by the hadoop configuration variables.
+ 
+   * The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
+ 
+ {{{
+     hive> SET mapred.job.tracker=myhost.mycompany.com:50030
+     hive> SET -v 
+ }}}
+ 
+   The latter shows all the current settings. Without the -v option only the 
+   variables that differ from the base hadoop configuration are displayed
+ 
+ === Hive, Map-Reduce and Local-Mode ===
+ 
+ Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
+ {{{ 
+   mapred.job.tracker
+ }}}
+ 
+ While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets. 
+ 
+ Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:
+ {{{
+   hive> SET mapred.job.tracker=local;
+ }}}
+ In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space). 
+ 
+ Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are:
+ {{{
+   hive> SET hive.exec.mode.local.auto=false;
+ }}}
+ 
+ note that this feature is ''disabled'' by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
+   * The total input size of the job is lower than: ''hive.exec.mode.local.auto.inputbytes.max'' (128MB by default)
+   * The total number of map-tasks is less than: ''hive.exec.mode.local.auto.tasks.max'' (4 by default)
+   * The total number of reduce tasks required is 1 or 0.
+ 
+ So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally. Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode.
+ 
  === Error Logs ===
  Hive uses log4j for logging. By default logs are not emitted to the 
  console by the CLI. The default logging level is WARN and the logs are stored in the folder:
@@ -218, +260 @@

  Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.
  
  == SQL Operations ==
- === Runtime configuration ===
- 
-   * Hive queries are executed using map-reduce queries and, therefore, the behavior 
-   of such queries can be controlled by the hadoop configuration variables.
- 
-   * The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
- 
- {{{
-     hive> SET mapred.job.tracker=myhost.mycompany.com:50030
-     hive> SET -v 
- }}}
- 
-   The latter shows all the current settings. Without the -v option only the 
-   variables that differ from the base hadoop configuration are displayed
-   * In particular, the number of reducers should be set to a reasonable number 
-   to get good performance (the default is 1!)
- 
- 
  === Example Queries ===
  
  Some example queries are shown below. They are available in build/dist/examples/queries.