You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hive.apache.org by jv...@apache.org on 2010/10/18 19:45:42 UTC

svn commit: r1023923 - in /hadoop/hive/trunk: CHANGES.txt README.txt

Author: jvs
Date: Mon Oct 18 17:45:42 2010
New Revision: 1023923

URL: http://svn.apache.org/viewvc?rev=1023923&view=rev
Log:
HIVE-1726. Update README file for 0.6.0 release
(Carl Steinbach via jvs)


Modified:
    hadoop/hive/trunk/CHANGES.txt
    hadoop/hive/trunk/README.txt

Modified: hadoop/hive/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/CHANGES.txt?rev=1023923&r1=1023922&r2=1023923&view=diff
==============================================================================
--- hadoop/hive/trunk/CHANGES.txt (original)
+++ hadoop/hive/trunk/CHANGES.txt Mon Oct 18 17:45:42 2010
@@ -649,6 +649,9 @@ Release 0.6.0 -  Unreleased
     HIVE-1725. Include metastore upgrade scripts in release tarball
     (Carl Steinbach via jvs)
 
+    HIVE-1726. Update README file for 0.6.0 release
+    (Carl Steinbach via jvs)
+
   OPTIMIZATIONS
 
     HIVE-1348. Move inputFileChanged() from ExecMapper to where it is needed

Modified: hadoop/hive/trunk/README.txt
URL: http://svn.apache.org/viewvc/hadoop/hive/trunk/README.txt?rev=1023923&r1=1023922&r2=1023923&view=diff
==============================================================================
--- hadoop/hive/trunk/README.txt (original)
+++ hadoop/hive/trunk/README.txt Mon Oct 18 17:45:42 2010
@@ -1,379 +1,105 @@
-How to Hive
------------
+Apache Hive 0.6.0
+=================
 
-DISCLAIMER: This is a prototype version of Hive and is NOT production
-quality. This is provided mainly as a way of illustrating the
-capabilities of Hive and is provided as-is. However - we are working
-hard to make Hive a production quality system. Hive has only been
-tested on Unix(Linux) and Mac OS X using Java 1.6 for now - although
-it may very well work on other similar platforms. It does not work on
-Cygwin. Most of our testing has been on Hadoop 0.17 - so we would
-advise running it against this version of hadoop - even though it may
-compile/work against other versions
-
-Useful mailing lists
---------------------
-
-1. hive-user@hadoop.apache.org - To discuss and ask usage
-   questions. Send an empty email to
-   hive-user-subscribe@hadoop.apache.org in order to subscribe to this
-   mailing list.
-
-2. hive-core@hadoop.apache.org - For discussions around code, design
-   and features. Send an empty email to
-   hive-core-subscribe@hadoop.apache.org in order to subscribe to this
-   mailing list.
-
-3. hive-commits@hadoop.apache.org - In order to monitor the commits to
-   the source repository. Send an empty email to
-   hive-commits-subscribe@hadoop.apache.org in order to subscribe to
-   this mailing list.
-
-Downloading and building
-------------------------
-
-- svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive_trunk
-- cd hive_trunk
-- hive_trunk> ant package -Dtarget.dir=<install-dir> -Dhadoop.version=0.17.2.1
-
-If you do not specify a value for target.dir it will use the default
-value build/dist. Then you can run the unit tests if you want to make
-sure it is correctly built:
-
-- hive_trunk> ant test -Dtarget.dir=<install-dir> -Dhadoop.version=0.17.2.1 \
-              -logfile test.log
-
-You can replace 0.17.2.1 with 0.18.3, 0.19.0, etc. to match the
-version of hadoop that you are using. If you do not specify
-hadoop.version it will use the default value specified in
-build.properties file.
-
-
-In the rest of the README, we use dist and <install-dir> interchangeably.
-
-Running Hive
-------------
-
-Hive uses hadoop that means:
-- you must have hadoop in your path OR
-- export HADOOP=<hadoop-install-dir>/bin/hadoop
-
-To use hive command line interface (cli) from the shell:
-$ cd <install-dir>
-$ bin/hive
-
-Using Hive
-----------
-
-Configuration management overview
----------------------------------
-
-- hive configuration is stored in <install-dir>/conf/hive-default.xml
-  and log4j in hive-log4j.properties
-- hive configuration is an overlay on top of hadoop - meaning the
-  hadoop configuration variables are inherited by default.
-- hive configuration can be manipulated by:
-  o editing hive-default.xml and defining any desired variables
-    (including hadoop variables) in it
-  o from the cli using the set command (see below)
-  o by invoking hive using the syntax:
-     * bin/hive -hiveconf x1=y1 -hiveconf x2=y2
-    this sets the variables x1 and x2 to y1 and y2
-
-
-Error Logs
-----------
-Hive uses log4j for logging. By default logs are not emitted to the
-console by the cli. They are stored in the file:
-- /tmp/{user.name}/hive.log
-
-If the user wishes - the logs can be emitted to the console by adding
-the arguments shown below:
-- bin/hive -hiveconf hive.root.logger=INFO,console
-
-Note that setting hive.root.logger via the 'set' command does not
-change logging properties since they are determined at initialization time.
-
-Error logs are very useful to debug problems. Please send them with
-any bugs (of which there are many!) to the JIRA at
-https://issues.apache.org/jira/browse/HIVE.
-
-
-DDL Operations
---------------
-
-Creating Hive tables and browsing through them
-
-hive> CREATE TABLE pokes (foo INT, bar STRING);
-
-Creates a table called pokes with two columns, first being an
-integer and other a string columns
-
-hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
-
-Creates a table called invites with two columns and a partition column
-called ds. The partition column is a virtual column.  It is not part
-of the data itself, but is derived from the partition that a
-particular dataset is loaded into.
-
-
-By default tables are assumed to be of text input format and the
-delimiters are assumed to be ^A(ctrl-a). We will be soon publish
-additional commands/recipes to add binary (sequencefiles) data and
-configurable delimiters etc.
-
-hive> SHOW TABLES;
-
-lists all the tables
-
-hive> SHOW TABLES '.*s';
-
-lists all the table that end with 's'. The pattern matching follows Java
-regular expressions. Check out this link for documentation
-http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
-
-hive> DESCRIBE invites;
-
-shows the list of columns
-
-hive> DESCRIBE EXTENDED invites;
-
-shows the list of columns plus any other meta information about the table
-
-Altering tables. Table name can be changed and additional columns can be dropped
-
-hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
-hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
-hive> ALTER TABLE pokes REPLACE COLUMNS (c1 INT, c2 STRING);
-hive> ALTER TABLE events RENAME TO 3koobecaf;
-
-Dropping tables
-hive> DROP TABLE pokes;
+Hive is a data warehouse system for Hadoop that facilitates
+easy data summarization, ad-hoc querying and analysis of large
+datasets stored in Hadoop compatible file systems. Hive provides a
+mechanism to put structure on this data and query the data using a
+SQL-like language called HiveQL. At the same time this language also
+allows traditional map/reduce programmers to plug in their custom
+mappers and reducers when it is inconvenient or inefficient to express
+this logic in HiveQL.
 
+Please note that Hadoop is a batch processing system and Hadoop jobs
+tend to have high latency and incur substantial overheads in job
+submission and scheduling. Consequently the average latency for Hive
+queries is generally very high (minutes) even when data sets involved
+are very small (say a few hundred megabytes). As a result it cannot be
+compared with systems such as Oracle where analyses are conducted on a
+significantly smaller amount of data but the analyses proceed much
+more iteratively with the response times between iterations being less
+than a few minutes. Hive aims to provide acceptable (but not optimal)
+latency for interactive data browsing, queries over small data sets or
+test queries.
 
-Metadata Store
---------------
+Hive is not designed for online transaction processing and does not
+support real-time queries or row level insert/updates. It is best used
+for batch jobs over large sets of immutable data (like web logs). What
+Hive values most are scalability (scale out with more machines added
+dynamically to the Hadoop cluster), extensibility (with MapReduce
+framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with
+its input formats.
 
-Metadata is in an embedded Derby database whose location is determined by the
-hive configuration variable named javax.jdo.option.ConnectionURL. By default
-(see conf/hive-default.xml) - this location is ./metastore_db
 
-Right now, in the default configuration, this metadata can only be seen by
-one user at a time.
+General Info
+============
 
-Metastore can be stored in any database that is supported by JPOX. The
-location and the type of the RDBMS can be controlled by the two variables
-'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'.
-Refer to JDO (or JPOX) documentation for more details on supported databases.
-The database schema is defined in JDO metadata annotations file package.jdo
-at src/contrib/hive/metastore/src/model.
+For the latest information about Hive, please visit out website at:
 
-In the future, the metastore itself can be a standalone server.
+  http://hive.apache.org/
 
+and our wiki at:
 
-DML Operations
---------------
+  http://wiki.apache.org/hadoop/Hive/
 
-Loading data from flat files into Hive
 
-hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
+Getting Started
+===============
 
-Loads a file that contains two columns separated by ctrl-a into pokes
-table.  'local' signifies that the input file is on the local
-system. If 'local' is omitted then it looks for the file in HDFS.
+- Installation Instructions and a quick tutorial:
+  http://wiki.apache.org/hadoop/Hive/GettingStarted
 
-The keyword 'overwrite' signifies that existing data in the table is
-deleted.  If the 'overwrite' keyword is omitted - then data files are
-appended to existing data sets.
+- A longer tutorial that covers more features of HiveQL:
+  http://wiki.apache.org/hadoop/Hive/Tutorial
 
-NOTES:
+- The HiveQL Language Manual:
+  http://wiki.apache.org/hadoop/Hive/LanguageManual
 
-- NO verification of data against the schema
 
-- If the file is in hdfs it is moved into hive controlled file system
-  namespace.  The root of the hive directory is specified by the
-  option hive.metastore.warehouse.dir in hive-default.xml. We would
-  advise that this directory be pre-existing before trying to create
-  tables via Hive.
+Requirements
+============
 
-hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' \
-      OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
+- Java 1.6
 
-hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' \
-      OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
+- Hadoop 0.17, 0.18, 0.19, or 0.20.
 
-The two LOAD statements above load data into two different partitions
-of the table invites. Table invites must be created as partitioned by
-the key ds for this to succeed.
+  *NOTE*: We strongly recommend that you use Hadoop 0.20
+  since the majority of our testing is done against this
+  version and because support for pre-0.20 versions of
+  Hadoop will be dropped in Hive 0.7.
 
-Loading/Extracting data using Queries
--------------------------------------
 
-Runtime configuration
----------------------
+Upgrading from older versions of Hive
+=====================================
 
-- Hives queries are executed using map-reduce queries and as such the
-  behavior of such queries can be controlled by the hadoop
-  configuration variables
+- Hive 0.6.0 includes changes to the MetaStore schema. If
+  you are upgrading from an earlier version of Hive it is
+  imperative that you upgrade the MetaStore schema by
+  running the appropriate schema upgrade script located in
+  the scripts/metastore/upgrade directory.
 
-- The cli can be used to set any hadoop (or hive) configuration
-  variable. For example:
+  We have provided upgrade scripts for Derby, MySQL, and PostgreSQL
+  databases. If you are using a different database for your MetaStore
+  you will need to provide your own upgrade script.
 
-   hive> SET mapred.job.tracker=myhost.mycompany.com:50030
-   hive> SET - v
+- Hive 0.6.0 includes new configuration properties. If you
+  are upgrading from an earlier version of Hive it is imperative
+  that you replace all of the old copies of the hive-default.xml
+  configuration file with the new version located in the conf/
+  directory.
 
-  The latter shows all the current settings. Without the v option only
-  the variables that differ from the base hadoop configuration are
-  displayed
 
-- In particular the number of reducers should be set to a reasonable
-  number to get good performance (the default is 1!)
-
-
-EXAMPLE QUERIES
----------------
-
-Some example queries are shown below. More are available in the hive
-code: ql/src/test/queries/{positive,clientpositive}.
-
-SELECTS and FILTERS
--------------------
-
-hive> SELECT a.foo FROM invites a;
-
-Select column 'foo' from all rows of invites table. The results are
-not stored anywhere, but are displayed on the console.
-
-Note that in all the examples that follow, INSERT (into a hive table,
-local directory or HDFS directory) is optional.
-
-hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a;
-
-Select all rows from invites table into an HDFS directory. The result
-data is in files (depending on the number of mappers) in that
-directory.
-
-NOTE: partition columns if any are selected by the use of *. They can
-also be specified in the projection clauses.
-
-hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
-
-Select all rows from pokes table into a local directory
-
-hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
-hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
-hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
-hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
-hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a;
-hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
-hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
-
-Sum of a column. avg, min, max can also be used
-
-NOTE: there are some flaws with the type system that cause doubles to
-be returned with integer types would be expected. We expect to fix
-these in the coming week.
-
-GROUP BY
---------
-
-hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
-hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
-
-NOTE: Currently Hive always uses two stage map-reduce for groupby
-operation. This is to handle skews in input data. We will be
-optimizing this in the coming weeks.
-
-JOIN
-----
-
-hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo
-
-MULTITABLE INSERT
------------------
-FROM src
-INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
-INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
-INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
-INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300
-
-STREAMING
----------
-hive> FROM invites a INSERT OVERWRITE TABLE events
-    > MAP a.foo, a.bar USING '/bin/cat'
-    > AS oof, rab WHERE a.ds > '2008-08-09';
-
-This streams the data in the map phase through the script /bin/cat
-(like hadoop streaming).  Similarly, streaming can be used on the
-reduce side. Please look for files mapreduce*.q.
-
-KNOWN BUGS/ISSUES
------------------
-
-* Hive cli may hang for a couple of minutes because of a bug in
-  getting metadata from the derby database. let it run and you'll be
-  fine!
-
-* Hive cli creates derby.log in the directory from which it has been
-  invoked.
-
-* COUNT(*) does not work for now. Use COUNT(1) instead.
-
-* ORDER BY not supported yet.
-
-* CASE not supported yet.
-
-* Only string and Thrift types (http://developers.facebook.com/thrift)
-  have been tested.
-
-* When doing Join, please put the table with big number of rows
-  containing the same join key to the rightmost in the JOIN
-  clause. Otherwise we may see OutOfMemory errors.
-
-FUTURE FEATURES
----------------
-* EXPLODE function to generate multiple rows from a column of list type.
-* Table statistics for query optimization.
-
-Developing Hive using Eclipse
------------------------------
-1. Follow the 3 steps in "Downloading and building" section above
-
-2. Generate test files for eclipse:
-
-   hive_trunk> cd metastore; ant model-jar; cd ..
-   hive_trunk> cd ql; ant gen-test; cd ..
-   hive_trunk> ant eclipse-files
-
-   This will generate .project, .settings, .externalToolBuilders,
-   .classpath, and a set of *.launch files for Eclipse-based
-   debugging.
-
-3. Launch Eclipse and import the Hive project by navigating to
-   File->Import->General->Existing Projects menu and selecting
-   the hive_trunk directory.
-
-4. Run the CLI or a set of test cases by right clicking on one of the
-   *.launch configurations and selection Run or Debug.
-
-Enabling Checkstyle Plugin in Eclipse
--------------------------------------
-1. Follow the steps in "Developing Hive using Eclipse" to import
-   Hive project in your Eclipse workbench.
-
-2. If you do not have Checkstyle plugin for eclipse, install it
-   before proceeding to next step. For more information refer to
-   http://eclipse-cs.sourceforge.net/.
-
-3. In the package navigator, select the hive project, right-click
-   and select Checkstyle > Activate Checkstyle. This will cause
-   the checkstyle plugin to activate and analyze the project sources.
- 
+Useful mailing lists
+====================
 
-Development Tips
-------------------------
-* You may use the following line to test a specific testcase with a specific query file.
-ant -Dhadoop.version='0.17.0' -Dtestcase=TestParse -Dqfile=udf4.q test
-ant -Dhadoop.version='0.17.0' -Dtestcase=TestParseNegative -Dqfile=invalid_dot.q test
-ant -Dhadoop.version='0.17.0' -Dtestcase=TestCliDriver -Dqfile=udf1.q test
-ant -Dhadoop.version='0.17.0' -Dtestcase=TestNegativeCliDriver -Dqfile=invalid_tbl_name.q test
+1. user@hive.apache.org - To discuss and ask usage questions. Send an
+   empty email to user-subscribe@hive.apache.org in order to subscribe
+   to this mailing list.
+
+2. dev@hive.apache.org - For discussions about code, design and features.
+   Send an empty email to dev-subscribe@hive.apache.org in order to subscribe
+   to this mailing list.
+
+3. commits@hive.apache.org - In order to monitor commits to the source
+   repository. Send an empty email to commits-subscribe@hive.apache.org
+   in order to subscribe to this mailing list.