You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2009/05/17 18:22:34 UTC
[Hadoop Wiki] Trivial Update of "Hive/HiveAws" by JoydeepSensarma

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JoydeepSensarma:
http://wiki.apache.org/hadoop/Hive/HiveAws

------------------------------------------------------------------------------
    * If the default Derby database is used - then one has to think about persisting state beyond the lifetime of one hadoop cluster. S3 is an obvious choice - but the user must restore and backup Hive metadata at the launch and termination of the Hadoop cluster.
  
   2. Run Hive CLI remotely from outside EC2. In this case, the user installs a Hive distribution on a personal workstation, - the main trick with this option is connecting to the Hadoop cluster - both for submitting jobs and for reading and writing files to HDFS. The section on [[http://wiki.apache.org/hadoop/AmazonEC2#FromRemoteMachine Running jobs from a remote machine]] details how this can be done. [wiki:/HivingS3nRemotely Case Study 1] goes into the setup for this in more detail. This option solves the problems mentioned above:
-   * Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation, launch a Hadoop cluster with the desired version etc. on EC2 and start running queries.
+   * Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation, launch a Hadoop cluster with the desired Hadoop version etc. on EC2 and start running queries.
    * Map-reduce scripts are automatically pushed by Hive into Hadoop's distributed cache at job submission time and do not need to be copied to the Hadoop machines.
    * Hive Metadata can be stored on local disk painlessly.
  
@@ -56, +56 @@

  
  == Submitting jobs to a Hadoop cluster ==
  This applies particularly when Hive CLI is run remotely. A single Hive CLI session can switch across different hadoop clusters (especially as clusters are bought up and terminated). Only two configuration variables:
-  * fs.default.name
+  * {{{fs.default.name}}}
-  * mapred.job.tracker
+  * {{{mapred.job.tracker}}}
  need to be changed to point the CLI from one Hadoop cluster to another. Beware though that tables stored in previous HDFS instance will not be accessible as the CLI switches from one cluster to another. Again - more details can be found in [wiki:/HivingS3nRemotely Case Study 1].
  
  == Case Studies ==
   1. [wiki:/HivingS3nRemotely Querying files in S3 using EC2, Hive and Hadoop ] 
  
  == Appendix ==
- 
  [[Anchor(S3n00b)]]
  === S3 for n00bs ===
- One of the things useful to understand is how S3 is used as a file system normally. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory. In particular - S3Fox does (by creating a empty key representing the directory). Other popular tools like [[http://timkay.com/aws/ aws], [[http://s3tools.org/s3cmd s3cmd] and [[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128 s3curl]] provide convenient ways of accessing S3 from the command line - but don't have the capability of creating empty dire
 ctories.
+ One of the things useful to understand is how S3 is used as a file system normally. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [[https://addons.mozilla.org/en-US/firefox/addon/3247 S3Fox]] and native S3 !FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory. In particular - S3Fox does (by creating a empty key representing the directory). Other popular tools like [[http://timkay.com/aws/ aws], [[http://s3tools.org/s3cmd s3cmd] and [[http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128 s3curl]] provide convenient ways of accessing S3 from the command line - but don't have the capability of creating empty dir
 ectories.