You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hcatalog-commits@incubator.apache.org by ga...@apache.org on 2012/03/15 21:00:34 UTC

svn commit: r1301196 [1/2] - in /incubator/hcatalog/trunk: ./ src/docs/src/documentation/content/xdocs/ src/docs/src/documentation/content/xdocs/images/

Author: gates
Date: Thu Mar 15 21:00:34 2012
New Revision: 1301196

URL: http://svn.apache.org/viewvc?rev=1301196&view=rev
Log:
HCATALOG-130 Documentation improvements

Removed:
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/importexport.xml
Modified:
    incubator/hcatalog/trunk/CHANGES.txt
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/cli.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/dynpartition.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/images/hcat-product.jpg
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/index.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/inputoutput.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/install.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/loadstore.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/notification.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/rpminstall.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/site.xml
    incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/supportedformats.xml

Modified: incubator/hcatalog/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/CHANGES.txt?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/CHANGES.txt (original)
+++ incubator/hcatalog/trunk/CHANGES.txt Thu Mar 15 21:00:34 2012
@@ -66,6 +66,8 @@ Release 0.4.0 - Unreleased
   HCAT-2 Support nested schema conversion between Hive an Pig (julienledem via hashutosh)
 
   IMPROVEMENTS
+  HCAT-130 Documentation improvements (gates and lefty via gates)
+
   HCAT-266 Upgrade HBase dependency to 0.92 (thw via toffer)
 
   HCAT-243 HCat e2e tests need to change to not use StorageDrivers (gaets)

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/cli.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/cli.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/cli.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/cli.xml Thu Mar 15 21:00:34 2012
@@ -29,13 +29,6 @@
 <p>The HCatalog command line interface (CLI) can be invoked as <code>hcat</code>. </p>
 
 
-<p><strong>Authentication</strong></p>
-<table>
-	<tr>
-	<td><p>If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/&lt;username&gt;/hive.log, then make sure you have run "kinit &lt;username&gt;@FOO.COM" to get a kerberos ticket and to be able to authenticate to the HCatalog server. </p></td>
-	</tr>
-</table>
-<p>If other errors occur while using the HCatalog CLI, more detailed messages (if any) are written to /tmp/&lt;username&gt;/hive.log. </p>
 </section>
 
 <section>
@@ -43,11 +36,11 @@
 
 <p>The HCatalog CLI supports these command line options:</p>
 <ul>
-<li><strong>-g</strong>: Usage is -g mygroup .... This indicates to HCatalog that table that needs to be created must have group as "mygroup" </li>
-<li><strong>-p</strong>: Usage is -p rwxr-xr-x .... This indicates to HCatalog that table that needs to be created must have permissions as "rwxr-xr-x" </li>
-<li><strong>-f</strong>: Usage is -f myscript.hcatalog .... This indicates to hcatalog that myscript.hcatalog is a file which contains DDL commands it needs to execute. </li>
-<li><strong>-e</strong>: Usage is -e 'create table mytable(a int);' .... This indicates to HCatalog to treat the following string as DDL command and execute it. </li>
-<li><strong>-D</strong>: Usage is -Dname=value .... This sets the hadoop value for given property</li>
+<li><strong>-g</strong>: Usage is -g mygroup .... This indicates to HCatalog that table that needs to be created must have group "mygroup" </li>
+<li><strong>-p</strong>: Usage is -p rwxr-xr-x .... This indicates to HCatalog that table that needs to be created must have permissions "rwxr-xr-x" </li>
+<li><strong>-f</strong>: Usage is -f myscript.hcatalog .... This indicates to HCatalog that myscript.hcatalog is a file which contains DDL commands it needs to execute. </li>
+<li><strong>-e</strong>: Usage is -e 'create table mytable(a int);' .... This indicates to HCatalog to treat the following string as a DDL command and execute it. </li>
+<li><strong>-D</strong>: Usage is -Dkey=value .... The key value pair is passed to HCatalog as a Java System Property.</li>
 </ul>
 <p></p>	
 <p>Note the following:</p>
@@ -67,8 +60,6 @@ Usage: hcat  { -e "&lt;query&gt;" | -f "
 <p></p>
 <p><strong>Assumptions</strong></p>
 <p>When using the HCatalog CLI, you cannot specify a permission string without read permissions for owner, such as -wxrwxr-x. If such a permission setting is desired, you can use the octal version instead, which in this case would be 375. Also, any other kind of permission string where the owner has read permissions (for example r-x------ or r--r--r--) will work fine.</p>
-
-
 	
 </section>
 
@@ -76,117 +67,56 @@ Usage: hcat  { -e "&lt;query&gt;" | -f "
 <section>
 	<title>HCatalog DDL</title>
 	
-<p>HCatalog supports a subset of the <a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL">Hive Data Definition Language</a>. For those commands that are supported, any variances are noted below.</p>	
+<p>HCatalog supports all <a href="http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL">Hive Data Definition Language</a> except those operations that require running a MapReduce job. For commands that are supported, any variances are noted below.</p>
+<p>HCatalog does not support the following Hive DDL commands:</p> 
+   <ul>
+     <li>IMPORT FROM ...</li>
+     <li>EXPORT TABLE</li>
+     <li>CREATE TABLE ... AS SELECT</li> 
+     <li>ALTER TABLE ... REBUILD</li> 
+     <li>ALTER TABLE ... CONCATENATE</li>
+     <li>ANALYZE TABLE ... COMPUTE STATISTICS</li>
+   </ul>
 
 <section>
 	<title>Create/Drop/Alter Table</title>
-<p><strong>CREATE TABLE</strong></p>	
 
-<p>The STORED AS clause in Hive is:</p>
-<source>
-[STORED AS file_format]
-file_format:
-  : SEQUENCEFILE
-  | TEXTFILE
-  | RCFILE     
-  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
-</source>
 
-<p>The STORED AS clause in HCatalog is:</p>	
-<source>
-[STORED AS file_format]
-file_format:
-  : RCFILE     
-  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname 
-                   INPUTDRIVER input_driver_classname OUTPUTDRIVER output_driver_classname
-</source>
+<p><strong>CREATE TABLE</strong></p>	
+
+<p>If you create a table with a CLUSTERED BY clause you will not be able to write to it with Pig or MapReduce. This is because they do not understand how to partition the table, so attempting to write to it would cause data corruption.</p>
 
-<p>Note the following:</p>	
-<ul>
-<li>CREATE TABLE command must contain a "STORED AS" clause; if it doesn't it will result in an exception containing message "STORED AS specification is either incomplete or incorrect."  <br></br> <br></br>
-<table>
-	<tr>
-	<td><p>In this release, HCatalog supports only reading PigStorage formated text files and only writing RCFile formatted files. Therefore, for this release, the command must contain a "STORED AS" clause and either use RCFILE as the file format or specify org.apache.hadoop.hive.ql.io.RCFileInputFormat and org.apache.hadoop.hive.ql.io.RCFileOutputFormat as INPUTFORMAT and OUTPUTFORMAT respectively. </p></td>
-	</tr>
-</table>
-<br></br>
-</li>
-<li>For partitioned tables, partition columns can only be of type String. 
-</li>
-<li>CLUSTERED BY clause is not supported. If provided error message will contain "Operation not supported. HCatalog doesn't allow Clustered By in create table." 
-</li>
-</ul>
 <p></p>
 <p><strong>CREATE TABLE AS SELECT</strong></p>
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>	
-<p><strong>CREATE TABLE LIKE</strong></p>
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>	
+<p>Not supported. Throws an exception with the message "Operation Not Supported". </p>	
+	
 <p><strong>DROP TABLE</strong></p>
 <p>Supported. Behavior the same as Hive.</p>	
 
 
 <!-- ==================================================================== -->
 <p><strong>ALTER TABLE</strong></p>
-<source>
-ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...
- partition_spec:
-  : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
-</source>
-<p>Note the following:</p>	
-<ul>
-<li>Allowed only if TABLE table_name was created using HCatalog. Else, throws an exception containing error message "Operation not supported. Partitions can be added only in a table created through HCatalog. It seems table tablename was not created through HCatalog" 
-</li>
-</ul>
-<p></p>
-<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->	
-<p><strong>ALTER TABLE FILE FORMAT</strong></p>
-<source>
-ALTER TABLE table_name SET FILEFORMAT file_format 
-</source>
-<p>Note the following:</p>	
-<ul>
-<li>Here file_format must be same as the one described above in CREATE TABLE. Else, throw an exception "Operation not supported. Not a valid file format." </li>
-<li>CLUSTERED BY clause is not supported. If provided will result in an exception "Operation not supported." </li>
-</ul>
-
-<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->	
-<p><strong>ALTER TABLE Change Column Name/Type/Position/Comment </strong></p>
-<source>
-ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
-</source>
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>
 
-<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->	
-<p><strong>ALTER TABLE Add/Replace Columns</strong></p>
-<source>
-ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
-</source>
-<p>Note the following:</p>	
-<ul>
-<li>ADD Columns is allowed. Behavior same as of Hive. </li>
-<li>Replace column is not supported. Throws an exception with message "Operation Not Supported". </li>
-</ul>
+<p>Supported except for the REBUILD and CONCATENATE options. Behavior the same as Hive.</p>
 
-<!-- ++++++++++++++++++++++++++++++++++++++++++++++++++++++ -->	
-<p><strong>ALTER TABLE TOUCH</strong></p>
-<source>
-ALTER TABLE table_name TOUCH;
-ALTER TABLE table_name TOUCH PARTITION partition_spec;
-</source>
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>	
+<p></p>
+	
 </section>
 
 <!-- ==================================================================== -->
 <section>
 	<title>Create/Drop/Alter View</title>
+<p>Note: Pig and MapReduce coannot read from or write to views.</p>
+
 <p><strong>CREATE VIEW</strong></p>	
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>		
+<p>Supported. Behavior same as Hive.</p>		
 	
 <p><strong>DROP VIEW</strong></p>	
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>			
+<p>Supported. Behavior same as Hive.</p>		
 
 <p><strong>ALTER VIEW</strong></p>	
-<p>Not supported. Throws an exception with message "Operation Not Supported". </p>			
+<p>Supported. Behavior same as Hive.</p>		
+
 </section>
 
 <!-- ==================================================================== -->
@@ -204,16 +134,41 @@ ALTER TABLE table_name TOUCH PARTITION p
 
 <p><strong>DESCRIBE</strong></p>
 <p>Supported. Behavior same as Hive.</p>
+
+</section>
+
+	<!-- ==================================================================== -->
+<section>
+	<title>Create/Drop Index</title>
+
+<p>CREATE and DROP INDEX operations are supported.</p>
+<p>Note: Pig and MapReduce cannot write to a table that has auto rebuild on, because Pig and MapReduce do not know how to rebuild the index.</p>
+</section>
+
+	<!-- ==================================================================== -->
+<section>
+	<title>Create/Drop Function</title>
+
+<p>CREATE and DROP FUNCTION operations are supported, but created functions must still be registered in Pig and placed in CLASSPATH for MapReduce.</p>
+
 </section>
 	
 	<!-- ==================================================================== -->
 <section>
 	<title>Other Commands</title>
-	<p>Any command not listed above is NOT supported and throws an exception with message "Operation Not Supported". </p>
+	<p>Any command not listed above is NOT supported and throws an exception with the message "Operation Not Supported". </p>
 </section>
 
 </section>
 
+<p><strong>Authentication</strong></p>
+<table>
+	<tr>
+	<td><p>If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/&lt;username&gt;/hive.log, then make sure you have run "kinit &lt;username&gt;@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server. </p></td>
+	</tr>
+</table>
+<p>If other errors occur while using the HCatalog CLI, more detailed messages are written to /tmp/&lt;username&gt;/hive.log. </p>
+
 
 
   </body>

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/dynpartition.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/dynpartition.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/dynpartition.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/dynpartition.xml Thu Mar 15 21:00:34 2012
@@ -27,7 +27,7 @@
 <section>
     <title>Overview</title>
     
- <p>In earlier versions of HCatalog, to read data users could specify that they were interested in reading from the table and specify various partition key/value combinations to prune, as if specifying a SQL-like where clause. However, to write data the abstraction was not as seamless. We still required users to write out data to the table, partition-by-partition, but these partitions required fine-grained knowledge of which key/value pairs they needed. We required this knowledge in advance, and we required the user to have already grouped the requisite data accordingly before attempting to store. </p>
+ <p>When writing data in HCatalog it is possible to write all records to a single partition. In this case the partition column(s) need not be in the output data.</p>
     
     <p>The following Pig script illustrates this: </p>
 <source>
@@ -40,17 +40,16 @@ store for_asia into 'processed' using HC
 </source>    
 <p></p>   
     
-    <p>This approach had a major issue. MapReduce programs and Pig scripts needed to be aware of all the possible values of a key, and these values needed to be maintained and/or modified when new values were introduced. With more partitions, scripts began to look cumbersome. And if each partition being written launched a separate HCatalog store, we were increasing the load on the HCatalog server and launching more jobs for the store by a factor of the number of partitions.</p>
-    
-    <p>A better approach is to have HCatalog determine all the partitions required from the data being written. This would allow us to simplify the above script into the following: </p>  
+    <p>In cases where you want to write data to multiple partitions simultaneously, this can be done by placing partition columns in the data and not specifying partition values when storing the data.</p>  
     
 <source>
 A = load 'raw' using HCatLoader(); 
 ... 
-store Z into 'processed' using HCatStorer("ds=20110110"); 
+store Z into 'processed' using HCatStorer(); 
 </source> 
 
-<p>The way dynamic partitioning works is that HCatalog locates partition columns in the data passed to it and uses the data in these columns to split the rows across multiple partitions. (The data passed to HCatalog <strong>must</strong> have a schema that matches the schema of the destination table and hence should always contain partition columns.)  It is important to note that partition columns can’t contain null values or the whole process will fail. It is also important note that all partitions created during a single run are part of a transaction and if any part of the process fails none of the partitions will be added to the table.</p>
+<p>The way dynamic partitioning works is that HCatalog locates partition columns in the data passed to it and uses the data in these columns to split the rows across multiple partitions. (The data passed to HCatalog <strong>must</strong> have a schema that matches the schema of the destination table and hence should always contain partition columns.)  It is important to note that partition columns can’t contain null values or the whole process will fail.</p>
+<p>It is also important to note that all partitions created during a single run are part of a transaction and if any part of the process fails none of the partitions will be added to the table.</p>
 </section>
   
 <!-- ==================================================================== -->  
@@ -80,7 +79,7 @@ store A into 'mytable' using HCatStorer(
 
 <p>On the other hand, if there is data that spans more than one partition, then HCatOutputFormat will automatically figure out how to spray the data appropriately. </p>
 
-<p>For example, let's say a=1 for all values across our dataset and b takes the value 1 and 2. Then the following statement... </p>
+<p>For example, let's say a=1 for all values across our dataset and b takes the values 1 and 2. Then the following statement... </p>
 <source>
 store A into 'mytable' using HCatStorer();
 </source>
@@ -108,28 +107,14 @@ store A2 into 'mytable' using HCatStorer
 Map&lt;String, String&gt; partitionValues = new HashMap&lt;String, String&gt;();
 partitionValues.put("a", "1");
 partitionValues.put("b", "1");
-HCatTableInfo info = HCatTableInfo.getOutputTableInfo(
-    serverUri, serverKerberosPrincipal, dbName, tblName, partitionValues);
+HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
 HCatOutputFormat.setOutput(job, info);
 </source> 
 
 <p>And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.</p>   
 
-<p>With dynamic partition, we simply specify only as many keys as we know about, or as required. It will figure out the rest of the keys by itself and spray out necessary partitions, being able to create multiple partitions with a single job.</p>   
-
-</section>
-
-<!-- ==================================================================== -->  
-<section>
-      <title>Compaction</title> 
-<p>Dynamic partitioning potentially results in a large number of files and more namenode load. To address this issue, we utilize HAR to archive partitions after writing out as part of the HCatOutputCommitter action. Compaction is disabled by default. To enable compaction, use the Hive parameter hive.archive.enabled, specified in the client side hive-site.xml. The current behavior of compaction is to fail the entire job if compaction fails. </p>
-</section>
+<p>With dynamic partitioning, we simply specify only as many keys as we know about, or as required. It will figure out the rest of the keys by itself and spray out necessary partitions, being able to create multiple partitions with a single job.</p>   
 
-<!-- ==================================================================== -->  
-<section>
-      <title>References</title>
-   <p>See <a href="https://cwiki.apache.org/HCATALOG/hcatalog02design.html">HCatalog 0.2 Architecture</a>  </p>      
-      
 </section>
   
   </body>

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/images/hcat-product.jpg
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/images/hcat-product.jpg?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
Binary files - no diff available.

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/index.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/index.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/index.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/index.xml Thu Mar 15 21:00:34 2012
@@ -25,8 +25,8 @@
    <section>
       <title>HCatalog </title>
       
-       <p>HCatalog is a table management and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, Hive, Streaming – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, sequence files. </p>
-<p>(Note: In this release, Streaming is not supported. Also, HCatalog supports only writing RCFile formatted files and only reading PigStorage formated text files.)</p>
+       <p>HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files. </p>
+<p>HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.</p>
 <p></p>
 <figure src="images/hcat-product.jpg" align="left" alt="HCatalog Product"/>
 
@@ -36,46 +36,43 @@
       
       <section>
       <title>HCatalog Architecture</title>
-      <p>HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and a command line interface for data definitions.</p>
-<p>(Note: HCatalog notification is not available in this release.)</p>
-
-<figure src="images/hcat-archt.jpg" align="left" alt="HCatalog Architecture"/>
+      <p>HCatalog is built on top of the Hive metastore and incorporates components from the Hive DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses
+      Hive's command line interface for issuing data definition and metadata exploration commands.</p>
 
 <p></p>
 
 <section>
 <title>Interfaces</title>   
-<p>The HCatalog interface for Pig – HCatLoader and HCatStorer – is an implementation of the Pig load and store interfaces. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and a specification of partition keys to create a new partition. Currently HCatStorer only supports writing to one partition. HCatLoader and HCatStorer are implemented on top of HCatInputFormat and HCatOutputFormat respectively (see <a href="loadstore.html">HCatalog Load and Store</a>).</p>
+<p>The HCatalog interface for Pig – HCatLoader and HCatStorer – is an implementation of the Pig load and store interfaces. HCatLoader accepts a table to read data from; you can indicate which partitions to scan by immediately following the load statement with a partition filter statement. HCatStorer accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. HCatLoader and HCatStorer are implemented on top of HCatInputFormat and HCatOutputFormat, respectively (see <a href="loadstore.html">HCatalog Load and Store</a>).</p>
 
-<p>The HCatalog interface for MapReduce – HCatInputFormat and HCatOutputFormat – is an implementation of Hadoop InputFormat and OutputFormat. HCatInputFormat accepts a table to read data from and a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and a specification of partition keys to create a new partition. Currently HCatOutputFormat only supports writing to one partition (see <a href="inputoutput.html">HCatalog Input and Output</a>).</p>
+<p>The HCatalog interface for MapReduce – HCatInputFormat and HCatOutputFormat – is an implementation of Hadoop InputFormat and OutputFormat. HCatInputFormat accepts a table to read data from and optionally a selection predicate to indicate which partitions to scan. HCatOutputFormat accepts a table to write to and optionally a specification of partition keys to create a new partition. You can write to a single partition by specifying the partition key(s) and value(s) in the STORE clause; and you can write to multiple partitions if the partition key(s) are columns in the data being stored. (See <a href="inputoutput.html">HCatalog Input and Output</a>.)</p>
 
-<p><strong>Note:</strong> Currently there is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly as long as a SerDe for that data already exists. In the future we plan to write a HCatalogSerDe so that users won't need storage-specific SerDes and so that Hive users can write data to HCatalog. Currently, this is supported - if a Hive user writes data in the RCFile format, it is possible to read the data through HCatalog. Also, see <a href="supportedformats.html">Supported data formats</a>.</p>
+<p>Note: There is no Hive-specific interface. Since HCatalog uses Hive's metastore, Hive can read data in HCatalog directly.</p>
 
-<p>Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports most of the DDL portion of Hive's query language, allowing users to create, alter, drop tables, etc. The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc. (see the <a href="cli.html">HCatalog Command Line Interface</a>).</p> 
+<p>Data is defined using HCatalog's command line interface (CLI). The HCatalog CLI supports all Hive DDL that does not require MapReduce to execute, allowing users to create, alter, drop tables, etc. (Unsupported Hive DDL includes import/export, CREATE TABLE AS SELECT, ALTER TABLE options REBUILD and CONCATENATE, and ANALYZE TABLE ... COMPUTE STATISTICS.) The CLI also supports the data exploration part of the Hive command line, such as SHOW TABLES, DESCRIBE TABLE, etc. (see the <a href="cli.html">HCatalog Command Line Interface</a>).</p> 
 </section>
 
 <section>
 <title>Data Model</title>
-<p>HCatalog presents a relational view of data in HDFS. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.</p>
+<p>HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed in databases. Tables can also be hash partitioned on one or more keys; that is, for a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values). For example, if a table is partitioned on date and there are three days of data in the table, there will be three partitions in the table. New partitions can be added to a table, and partitions can be dropped from a table. Partitioned tables have no partitions at create time. Unpartitioned tables effectively have one default partition that must be created at table creation time. There is no guaranteed read consistency when a partition is dropped.</p>
 
-<p>Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. (In the future some ability to integrate changes to a partition will be added.) Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive (see <a href="loadstore.html">HCatalog Load and Store</a>). </p>
+<p>Partitions contain records. Once a partition is created records cannot be added to it, removed from it, or updated in it. Partitions are multi-dimensional and not hierarchical. Records are divided into columns. Columns have a name and a datatype. HCatalog supports the same datatypes as Hive (see <a href="loadstore.html">HCatalog Load and Store</a>). </p>
 </section>
      </section>
      
   <section>
   <title>Data Flow Example</title>
-  <p>This simple data flow example shows how HCatalog is used to move data from the grid into a database. 
-  From the database, the data can then be analyzed using Hive.</p>
+  <p>This simple data flow example shows how HCatalog can help grid users share and access data.</p>
   
  <p><strong>First</strong> Joe in data acquisition uses distcp to get data onto the grid.</p>
  <source>
 hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
 
-hcat "alter table rawevents add partition 20100819 hdfs://data/rawevents/20100819/data"
+hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
 </source>
   
 <p><strong>Second</strong> Sally in data processing uses Pig to cleanse and prepare the data.</p>  
-<p>Without HCatalog, Sally must be manually informed by Joe that data is available, or use Oozie and poll on HDFS.</p>
+<p>Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.</p>
 <source>
 A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …);
 B = filter A by bot_finder(zeta) = 0;
@@ -83,7 +80,7 @@ B = filter A by bot_finder(zeta) = 0;
 store Z into 'data/processedevents/20100819/data';
 </source>
 
-<p>With HCatalog, Oozie will be notified by HCatalog data is available and can then start the Pig job</p>
+<p>With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.</p>
 <source>
 A = load 'rawevents' using HCatLoader;
 B = filter A by date = '20100819' and by bot_finder(zeta) = 0;
@@ -99,14 +96,14 @@ alter table processedevents add partitio
 select advertiser_id, count(clicks)
 from processedevents
 where date = '20100819' 
-group by adverstiser_id;
+group by advertiser_id;
 </source> 
 <p>With HCatalog, Robert does not need to modify the table structure.</p>
  <source>
 select advertiser_id, count(clicks)
 from processedevents
 where date = ‘20100819’ 
-group by adverstiser_id;
+group by advertiser_id;
 </source>
 
 </section>

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/inputoutput.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/inputoutput.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/inputoutput.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/inputoutput.xml Thu Mar 15 21:00:34 2012
@@ -28,60 +28,60 @@
   <title>Set Up</title>
   <p>No HCatalog-specific setup is required for the HCatInputFormat and HCatOutputFormat interfaces.</p>
   <p></p>
-<p><strong>Authentication</strong></p>
-<table>
-	<tr>
-	<td><p>If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/&lt;username&gt;/hive.log, then make sure you have run "kinit &lt;username&gt;@FOO.COM" to get a kerberos ticket and to be able to authenticate to the HCatalog server. </p></td>
-	</tr>
-</table>
   </section>
 
 <!-- ==================================================================== -->
 <section>
 	<title>HCatInputFormat</title>
 	<p>The HCatInputFormat is used with MapReduce jobs to read data from HCatalog managed tables.</p> 
-	<p>HCatInputFormat exposes a new Hadoop 20 MapReduce API for reading data as if it had been published to a table. If a MapReduce job uses this InputFormat to write output, the default InputFormat configured for the table is used as the underlying InputFormat and the new partition is published to the table after the job completes. Also, the maximum number of partitions that a job can work on is limited to 100K.</p> 
+	<p>HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table.</p> 
 	
 <section>
 	<title>API</title>
 	<p>The API exposed by HCatInputFormat is shown below.</p>
 	
-	<p>To use HCatInputFormat to read data, first instantiate a <code>HCatTableInfo</code> with the necessary information from the table being read 
-	and then call setInput on the <code>HCatInputFormat</code>.</p>
+	<p>To use HCatInputFormat to read data, first instantiate as <code>InputJobInfo</code> with the necessary information from the table being read 
+	and then call setInput with the <code>InputJobInfo</code>.</p>
 
-<p>You can use the <code>setOutputSchema</code> method to include a projection schema, to specify specific output fields. If a schema is not specified, this default to the table level schema.</p>
+<p>You can use the <code>setOutputSchema</code> method to include a projection schema, to
+specify specific output fields. If a schema is not specified all the columns in the table
+will be returned.</p>
 
 <p>You can use the <code>getTableSchema</code> methods to determine the table schema for a specified input table.</p>
 	
-	
 <source>
-    /**
-     * Set the input to use for the Job. This queries the metadata server with
-     * the specified partition predicates, gets the matching partitions, puts
-     * the information in the conf object. The inputInfo object is updated with
-     * information needed in the client context
-     * @param job the job object
-     * @param inputInfo the table input info
-     * @throws IOException the exception in communicating with the metadata server
-     */
-    public static void setInput(Job job, HCatTableInfo inputInfo) throws IOException;
+  /**
+   * Set the input to use for the Job. This queries the metadata server with
+   * the specified partition predicates, gets the matching partitions, puts
+   * the information in the conf object. The inputInfo object is updated with
+   * information needed in the client context
+   * @param job the job object
+   * @param inputJobInfo the input info for table to read
+   * @throws IOException the exception in communicating with the metadata server
+   */
+  public static void setInput(Job job,
+      InputJobInfo inputJobInfo) throws IOException;
 
-    /**
-     * Set the schema for the HCatRecord data returned by HCatInputFormat.
-     * @param job the job object
-     * @param hcatSchema the schema to use as the consolidated schema
-     */
-    public static void setOutputSchema(Job job,HCatSchema hcatSchema) throws Exception;
+  /**
+   * Set the schema for the HCatRecord data returned by HCatInputFormat.
+   * @param job the job object
+   * @param hcatSchema the schema to use as the consolidated schema
+   */
+  public static void setOutputSchema(Job job,HCatSchema hcatSchema) 
+    throws IOException;
+
+  /**
+   * Gets the HCatTable schema for the table specified in the HCatInputFormat.setInput call
+   * on the specified job context. This information is available only after HCatInputFormat.setInput
+   * has been called for a JobContext.
+   * @param context the context
+   * @return the table schema
+   * @throws IOException if HCatInputFormat.setInput has not been called 
+   *                     for the current context
+   */
+  public static HCatSchema getTableSchema(JobContext context) 
+    throws IOException;	
 
-    /**
-     * Gets the HCatalog schema for the table specified in the HCatInputFormat.setInput call
-     * on the specified job context. This information is available only after HCatInputFormat.setInput
-     * has been called for a JobContext.
-     * @param context the context
-     * @return the table schema
-     * @throws Exception if HCatInputFormat.setInput has not been called for the current context
-     */
-    public static HCatSchema getTableSchema(JobContext context) throws Exception	
 </source>
 	
 </section>
@@ -93,7 +93,8 @@
 	<title>HCatOutputFormat</title>
 	<p>HCatOutputFormat is used with MapReduce jobs to write data to HCatalog managed tables.</p> 
 	
-	<p>HCatOutputFormat exposes a new Hadoop 20 MapReduce API for writing data to a table. If a MapReduce job uses this OutputFormat to write output, the default OutputFormat configured for the table is used as the underlying OutputFormat and the new partition is published to the table after the job completes. </p>
+	<p>HCatOutputFormat exposes a Hadoop 20 MapReduce API for writing data to a table.
+    When a MapReduce job uses HCatOutputFormat to write output, the default OutputFormat configured for the table is used and the new partition is published to the table after the job completes. </p>
 	
 <section>
 	<title>API</title>
@@ -101,52 +102,204 @@
 	<p>The first call on the HCatOutputFormat must be <code>setOutput</code>; any other call will throw an exception saying the output format is not initialized. The schema for the data being written out is specified by the <code>setSchema </code> method. You must call this method, providing the schema of data you are writing. If your data has same schema as table schema, you can use HCatOutputFormat.getTableSchema() to get the table schema and then pass that along to setSchema(). </p>
 	
 <source>
-/**
+    /**
      * Set the info about the output to write for the Job. This queries the metadata server
      * to find the StorageDriver to use for the table.  Throws error if partition is already published.
      * @param job the job object
-     * @param outputInfo the table output info
+     * @param outputJobInfo the table output info
      * @throws IOException the exception in communicating with the metadata server
      */
-    public static void setOutput(Job job, HCatTableInfo outputInfo) throws IOException;
+    @SuppressWarnings("unchecked")
+    public static void setOutput(Job job, OutputJobInfo outputJobInfo) throws IOException;
 
     /**
      * Set the schema for the data being written out to the partition. The
      * table schema is used by default for the partition if this is not called.
      * @param job the job object
      * @param schema the schema for the data
-     * @throws IOException the exception
      */
-    public static void setSchema(Job job, HCatSchema schema) throws IOException;
+    public static void setSchema(final Job job, final HCatSchema schema) throws IOException;
+
+  /**
+   * Gets the table schema for the table specified in the HCatOutputFormat.setOutput call
+   * on the specified job context.
+   * @param context the context
+   * @return the table schema
+   * @throws IOException if HCatOutputFromat.setOutput has not been called for the passed context
+   */
+  public static HCatSchema getTableSchema(JobContext context) throws IOException;
 
-    /**
-     * Gets the table schema for the table specified in the HCatOutputFormat.setOutput call
-     * on the specified job context.
-     * @param context the context
-     * @return the table schema
-     * @throws IOException if HCatOutputFormat.setOutput has not been called for the passed context
-     */
-    public static HCatSchema getTableSchema(JobContext context) throws IOException
 </source>
 </section>   
 
+</section>
+
 <section>
-	<title>Partition Schema Semantics</title>
-	
-	<p>The partition schema specified can be different from the current table level schema. The rules about what kinds of schema are allowed are:</p>
-	
-	<ul>
-	<li>If a column is present in both the table schema and the partition schema, the type for the column should match. 
-</li>
-	<li>If the partition schema has lesser columns that the table level schema, then only the columns at the end of the table schema are allowed to be absent. Columns in the middle cannot be absent. So if table schema is "c1,c2,c3", partition schema can be "c1" or "c1,c2" but not "c1,c3" or "c2,c3"</li>
-	<li>If the partition schema has extra columns, then the extra columns should appear after the table schema. So if table schema is "c1,c2", the partition schema can be "c1,c2,c3" but not "c1,c3,c4". The table schema is automatically updated to have the extra column. In the previous example, the table schema will become "c1,c2,c3" after the completion of the job. 
-</li>
-	<li>The partition keys are not allowed to be present in the schema being written out. 
-</li>
-	</ul>
-	
-</section>   
+<title>Examples</title>
+
+
+<p><strong>Running MapReduce with HCatalog</strong></p>
+<p>
+Your MapReduce program will need to know where the thrift server to connect to is.  The
+easiest way to do this is pass it as an argument to your Java program. You will need to
+pass the Hive and HCatalog jars MapReduce as well, via the -libjars argument.</p>
+
+
+<source>
+export HADOOP_HOME=&lt;path_to_hadoop_install&gt;
+export HCAT_HOME=&lt;path_to_hcat_install&gt;
+export LIB_JARS=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar,$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar,$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar,$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar,$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar,$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar,$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar,$HCAT_HOME/share/hcatalog/lib/antlr-runtime-3.0.1.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-connectionpool-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-core-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-enhancer-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/datanucleus-rdbms-2.0.3.jar,$HCAT_HOME/share/hcatalog/lib/commons-dbcp-1.4.jar,$HCAT_HOME/share/hcatalog/lib/commons-pool-1.5.4.jar
+export HADOOP_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar:$HCAT_HOME/share/hcatalog/lib/antlr-runtime-3.0.1.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-connectionpool-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-core-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-enhancer-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/datanucleus-rdbms-2.0.3.jar:$HCAT_HOME/share/hcatalog/lib/commons-dbcp-1.4.jar:$HCAT_HOME/share/hcatalog/lib/commons-pool-1.5.4.jar:$HCAT_HOME/etc/hcatalog
+
+$HADOOP_HOME/bin/hadoop --config $HADOOP_HOME/conf jar &lt;path_to_jar&gt;
+&lt;main_class&gt; -libjars $LIB_JARS &lt;program_arguments&gt;
+</source>
+
+<p><strong>Authentication</strong></p>
+<table>
+	<tr>
+	<td><p>If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/&lt;username&gt;/hive.log, then make sure you have run "kinit &lt;username&gt;@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server. </p></td>
+	</tr>
+</table>
+
+<p><strong>Examples</strong></p>
+
+<p>
+The following very simple MapReduce program reads data from one table which it assumes to have an integer in the
+second column, and counts how many different values it sees.   That is, is does the
+equivalent of <code>select col1, count(*) from $table group by col1;</code>.
+</p>
+
+<source>
+public class GroupByAge extends Configured implements Tool {
+
+    public static class Map extends
+            Mapper&lt;WritableComparable, HCatRecord, IntWritable, IntWritable&gt; {
+
+        int age;
+
+        @Override
+        protected void map(
+                WritableComparable key,
+                HCatRecord value,
+                org.apache.hadoop.mapreduce.Mapper&lt;WritableComparable, HCatRecord, IntWritable, IntWritable&gt;.Context context)
+                throws IOException, InterruptedException {
+            age = (Integer) value.get(1);
+            context.write(new IntWritable(age), new IntWritable(1));
+        }
+    }
+
+    public static class Reduce extends Reducer&lt;IntWritable, IntWritable,
+    WritableComparable, HCatRecord&gt; {
+
+
+      @Override 
+      protected void reduce(IntWritable key, java.lang.Iterable&lt;IntWritable&gt;
+        values, org.apache.hadoop.mapreduce.Reducer&lt;IntWritable,IntWritable,WritableComparable,HCatRecord&gt;.Context context)
+        throws IOException ,InterruptedException {
+          int sum = 0;
+          Iterator&lt;IntWritable&gt; iter = values.iterator();
+          while (iter.hasNext()) {
+              sum++;
+              iter.next();
+          }
+          HCatRecord record = new DefaultHCatRecord(2);
+          record.set(0, key.get());
+          record.set(1, sum);
+
+          context.write(null, record);
+        }
+    }
+
+    public int run(String[] args) throws Exception {
+        Configuration conf = getConf();
+        args = new GenericOptionsParser(conf, args).getRemainingArgs();
+
+        String inputTableName = args[0];
+        String outputTableName = args[1];
+        String dbName = null;
+
+        Job job = new Job(conf, "GroupByAge");
+        HCatInputFormat.setInput(job, InputJobInfo.create(dbName,
+                inputTableName, null));
+        // initialize HCatOutputFormat
+
+        job.setInputFormatClass(HCatInputFormat.class);
+        job.setJarByClass(GroupByAge.class);
+        job.setMapperClass(Map.class);
+        job.setReducerClass(Reduce.class);
+        job.setMapOutputKeyClass(IntWritable.class);
+        job.setMapOutputValueClass(IntWritable.class);
+        job.setOutputKeyClass(WritableComparable.class);
+        job.setOutputValueClass(DefaultHCatRecord.class);
+        HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName,
+                outputTableName, null));
+        HCatSchema s = HCatOutputFormat.getTableSchema(job);
+        System.err.println("INFO: output schema explicitly set for writing:"
+                + s);
+        HCatOutputFormat.setSchema(job, s);
+        job.setOutputFormatClass(HCatOutputFormat.class);
+        return (job.waitForCompletion(true) ? 0 : 1);
+    }
+
+    public static void main(String[] args) throws Exception {
+        int exitCode = ToolRunner.run(new GroupByAge(), args);
+        System.exit(exitCode);
+    }
+}
+</source>
+
+<p>Notice a number of important points about this program:
+<br></br><br></br>
+1) The implementation of Map takes HCatRecord as an input and the implementation of Reduce produces it as an output.
+<br></br>
+2) This example program assumes the schema of the input, but it could also retrieve the schema via
+HCatOutputFormat.getOutputSchema() and retrieve fields based on the results of that call.
+<br></br>
+3) The input descriptor for the table to be read is created by calling InputJobInfo.create.  It requires the database name,
+table name, and partition filter.  In this example the partition filter is null, so all partitions of the table
+will be read.
+<br></br>
+4) The output descriptor for the table to be written is created by calling OutputJobInfo.create.  It requires the
+database name, the table name, and a Map of partition keys and values that describe the partition being written.
+In this example it is assumed the table is unpartitioned, so this Map is null.
+</p>
+
+<p>To scan just selected partitions of a table, a filter describing the desired partitions can be passed to
+InputJobInfo.create.  This filter can contain the operators '=', '&lt;', '&gt;', '&lt;=',
+'&gt;=', '&lt;&gt;', 'and', 'or', and 'like'.  Assume for example you have a web_logs
+table that is partitioned by the column datestamp.  You could select one partition of the table by changing</p>
+<source>
+HCatInputFormat.setInput(job, InputJobInfo.create(dbName, inputTableName, null));
+</source>
+<p>
+to
+</p>
+<source>
+HCatInputFormat.setInput(job,
+    InputJobInfo.create(dbName, inputTableName, "datestamp=\"20110924\""));
+  </source>
+<p>
+This filter must reference only partition columns.  Values from other columns will cause the job to fail.</p>
+<p>
+To write to a single partition you can change the above example to have a Map of key value pairs that describe all
+of the partition keys and values for that partition.  In our example web_logs table, there is only one partition
+column (datestamp), so our Map will have only one entry.  Change </p>
+<source>
+HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, null));
+</source>
+<p>to </p>
+<source>
+Map partitions = new HashMap&lt;String, String&gt;(1);
+partitions.put("datestamp", "20110924");
+HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName, partitions));
+</source>
+
+<p>To write multiple partitions simultaneously you can leave the Map null, but all of the partitioning columns must be present in the data you are writing.
+</p>
+
 </section>
 
+
   </body>
 </document>

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/install.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/install.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/install.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/install.xml Thu Mar 15 21:00:34 2012
@@ -24,23 +24,62 @@
   <body>
 
   <section>
-    <title>Server Installation</title>
+    <title>Server Installation from Source</title>
 
     <p><strong>Prerequisites</strong></p>
     <ul>
+        <li>Machine to build the installation tar on</li>
         <li>Machine on which the server can be installed - this should have
-        access to the hadoop cluster in question, and be accessible from
+        access to the Hadoop cluster in question, and be accessible from
         the machines you launch jobs from</li>
-        <li>MySQL db</li>
+        <li>an RDBMS - we recommend MySQL and provide instructions for it</li>
         <li>Hadoop cluster</li>
-        <li>Unix user that the server will run as, and an associated kerberos
-        service principal and keytabs.</li>
+        <li>Unix user that the server will run as, and, if you are running your
+        cluster in secure mode, an associated Kerberos service principal and keytabs.</li>
     </ul>
 
     <p>Throughout these instructions when you see a word in <em>italics</em> it
     indicates a place where you should replace the word with a locally 
     appropriate value such as a hostname or password.</p>
 
+    <p><strong>Building a tarball </strong></p>
+
+    <p>If you downloaded HCatalog from Apache or another site as a source release,
+    you will need to first build a tarball to install.  You can tell if you have
+    a source release by looking at the name of the object you downloaded.  If
+    it is named hcatalog-src-0.4.0-incubating.tar.gz (notice the
+    <strong>src</strong> in the name) then you have a source release.</p>
+    
+    <p>If you do not already have Apache Ant installed on your machine, you 
+    will need to obtain it.  You can get it from the <a href="http://ant.apache.org/">
+    Apache Ant website</a>.  Once you download it, you will need to unpack it
+    somewhere on your machine.  The directory where you unpack it will be referred
+    to as <em>ant_home</em> in this document.</p>
+
+    <p>If you do not already have Apache Forrest installed on your machine, you 
+    will need to obtain it.  You can get it from the <a href="http://forrest.apache.org/">
+    Apache Forrest website</a>.  Once you download it, you will need to unpack 
+    it somewhere on your machine.  The directory where you unpack it will be referred
+    to as <em>forrest_home</em> in this document.</p>
+    
+    <p>To produce a tarball from this do the following:</p>
+
+    <p>Create a directory to expand the source release in.  Copy the source
+    release to that directory and unpack it.</p>
+    <p><code>mkdir /tmp/hcat_source_release</code></p>
+    <p><code>cp hcatalog-src-0.4.0-incubating.tar.gz /tmp/hcat_source_release</code></p>
+    <p><code>cd /tmp/hcat_source_release</code></p>
+    <p><code>tar xzf hcatalog-src-0.4.0-incubating.tar.gz</code></p>
+
+    <p>Change directories into the unpacked source release and build the
+    installation tarball.</p>
+    <p><code>cd hcatalog-src-0.4.0-incubating</code></p>
+    <p><em>ant_home</em><code>/bin/ant -Dhcatalog.version=0.4.0
+    -Dforrest.home=</code><em>forrest_home</em><code> tar </code></p>
+
+    <p>The tarball for installation should now be at
+    <code>build/hcatalog-0.4.0.tar.gz</code></p>
+
     <p><strong>Database Setup</strong></p>
 
     <p>Select a machine to install the database on.  This need not be the same
@@ -65,13 +104,13 @@
     <p><code>mysql> flush privileges;</code></p>
     <p><code>mysql> quit;</code></p>
 
-    <p>In a temporary directory, untar the HCatalog artifact</p>
+    <p>In a temporary directory, untar the HCatalog installation tarball.</p>
 
-    <p><code>tar xzf hcatalog-</code><em>version</em><code>.tar.gz</code></p>
+    <p><code>tar xzf hcatalog-0.4.0.tar.gz</code></p>
 
     <p>Use the database installation script found in the package to create the
-    database</p>
-    <p><code>mysql -u hive -D hivemetastoredb -h</code><em>hcatdb.acme.com</em><code> -p &lt; share/hcatalog/hive/external/metastore/scripts/upgrade/mysql/hive-schema-0.7.0.mysql.sql</code></p>
+    database.</p>
+    <p><code>mysql -u hive -D hivemetastoredb -h</code><em>hcatdb.acme.com</em><code> -p &lt; share/hcatalog/hive/external/metastore/scripts/upgrade/mysql/hive-schema-0.8.0.mysql.sql</code></p>
 
     <p><strong>Thrift Server Setup</strong></p>
 
@@ -88,14 +127,16 @@
     <p>Select a user to run the Thrift server as.  This user should not be a
     human user, and must be able to act as a proxy for other users.  We suggest
     the name "hcat" for the user.  Throughout the rest of this documentation 
-    we will refer to this user as "hcat".  If necessary, add the user to 
+    we will refer to this user as <em>hcat</em>.  If necessary, add the user to 
     <em>hcatsvr.acme.com</em>.</p>
 
     <p>Select a <em>root</em> directory for your installation of HCatalog.  This 
-    directory must be owned by the hcat user.  We recommend
-    <code>/usr/local/hcat</code>.  If necessary, create the directory.</p>
+    directory must be owned by the <em>hcat</em> user.  We recommend
+    <code>/usr/local/hcat</code>.  If necessary, create the directory.  You will
+    need to be the <em>hcat</em> user for the operations described in the remainder
+    of this Thrift Server Setup section.</p>
 
-    <p>Download the HCatalog release into a temporary directory, and untar
+    <p>Copy the HCatalog installation tarball into a temporary directory, and untar
     it.  Then change directories into the new distribution and run the HCatalog
     server installation script.  You will need to know the directory you chose
     as <em>root</em> and the
@@ -105,8 +146,8 @@
     the port number you wish HCatalog to operate on which you will use to set
     <em>portnum</em>.</p>
 
-    <p><code>tar zxf hcatalog-</code><em>version</em><code>.tar.gz
-    cd hcatalog-</code><em>version</em></p>
+    <p><code>tar zxf hcatalog-0.4.0.tar.gz</code></p>
+    <p><code>cd hcatalog-0.4.0</code></p>
     <p><code>share/hcatalog/scripts/hcat_server_install.sh -r </code><em>root</em><code> -d </code><em>dbroot</em><code> -h </code><em>hadoop_home</em><code> -p </code><em>portnum</em></p>
 
     <p>Now you need to edit your <em>root</em><code>/etc/hcatalog/hive-site.xml</code> file.
@@ -126,40 +167,41 @@
         <tr>
             <td>javax.jdo.option.ConnectionPassword</td>
             <td><em>dbpassword</em> value you used in setting up the MySQL server
-            above</td>
+            above.</td>
         </tr>
         <tr>
             <td>hive.metastore.warehouse.dir</td>
             <td>The directory can be a URI or an absolute file path. If it is an absolute file path, it will be resolved to a URI by the metastore:
             <p>-- If default hdfs was specified in core-site.xml, path resolves to HDFS location. </p>
             <p>-- Otherwise, path is resolved as local file: URI.</p>
-            <p>This setting becomes effective when creating new tables (takes precedence over default DBS.DB_LOCATION_URI at time of table creation).</p>
+            <p>This setting becomes effective when creating new tables (it takes precedence over default DBS.DB_LOCATION_URI at the time of table creation).</p>
             </td>
         </tr>
         <tr>
             <td>hive.metastore.uris</td>
-            <td>You need to set the hostname to your Thrift
-            server.  Replace <em>SVRHOST</em> with the name of the
+            <td>Set the hostname of your Thrift
+            server by replacing <em>SVRHOST</em> with the name of the
             machine you are installing the Thrift server on.  You can also
             change the port the Thrift server runs on by changing the default
             value of 3306.</td>
         </tr>
         <tr>
             <td>hive.metastore.sasl.enabled</td>
-            <td>Set to true by default.  Set to false if you do not wish to
-            secure the thrift interface.  This can be convenient for testing.
-            We do not recommend turning this off in production.</td>
+            <td>Set to true if you are using kerberos security with your Hadoop
+            cluster, false otherwise.</td>
         </tr>
         <tr>
             <td>hive.metastore.kerberos.keytab.file</td>
-            <td>The path to the Kerberos keytab file containg the metastore
-            thrift server's service principal.</td>
+            <td>The path to the Kerberos keytab file containing the metastore
+            Thrift server's service principal.  Only required if you set
+            hive.metastore.sasl.enabled above to true.</td>
         </tr>
         <tr>
             <td>hive.metastore.kerberos.principal</td>
-            <td>The service principal for the metastore thrift server.  You can
+            <td>The service principal for the metastore Thrift server.  You can
             reference your host as _HOST and it will be replaced with your
-            actual hostname</td>
+            actual hostname.  Only required if you set
+            hive.metastore.sasl.enabled above to true.</td>
         </tr>
     </table>
 
@@ -171,7 +213,6 @@
       without parameters to see a full description of the usage of the script.
       Users looking to automate the hcat installation should look to leverage
       this script.
-      This script is highly recommended for automation.
     </p>
     <p>You can now procede to starting the server.</p>
   </section>
@@ -180,15 +221,14 @@
     <title>Starting the Server</title>
             
     <p>Start the HCatalog server by switching directories to
-    <em>root</em> and invoking the start script
-    <code>share/hcatalog/scripts/hcat_server_start.sh</code></p>
+    <em>root</em> and invoking <code>sbin/hcat_server.sh start</code></p>
 
   </section>
 
   <section>
     <title>Logging</title>
 
-    <p>Server activity logs and gc logs are located in
+    <p>Server activity logs are located in
     <em>root</em><code>/var/log/hcat_server</code>.  Logging configuration is located at
     <em>root</em><code>/conf/log4j.properties</code>.  Server logging uses
     <code>DailyRollingFileAppender</code> by default. It will generate a new
@@ -200,8 +240,7 @@
     <title>Stopping the Server</title>
 
     <p>To stop the HCatalog server, change directories to the <em>root</em>
-    directory and invoke the stop script
-    <code>share/hcatalog/scripts/hcat_server_stop.sh</code></p>
+    directory and invoking <code>sbin/hcat_server.sh stop</code></p>
 
   </section>
 
@@ -211,7 +250,7 @@
     <p>Select a <em>root</em> directory for your installation of HCatalog client.
     We recommend <code>/usr/local/hcat</code>.  If necessary, create the directory.</p>
 
-    <p>Download the HCatalog release into a temporary directory, and untar
+    <p>Copy the HCatalog installation tarball into a temporary directory, and untar
     it.</p>
 
     <p><code>tar zxf hcatalog-</code><em>version</em><code>.tar.gz</code></p>
@@ -233,13 +272,13 @@
             <td>The directory can be a URI or an absolute file path. If it is an absolute file path, it will be resolved to a URI by the metastore:
             <p>-- If default hdfs was specified in core-site.xml, path resolves to HDFS location. </p>
             <p>-- Otherwise, path is resolved as local file: URI.</p>
-            <p>This setting becomes effective when creating new tables (takes precedence over default DBS.DB_LOCATION_URI at time of table creation).</p>
+            <p>This setting becomes effective when creating new tables (it takes precedence over default DBS.DB_LOCATION_URI at the time of table creation).</p>
             </td>
         </tr>
         <tr>
             <td>hive.metastore.uris</td>
-            <td>You need to set the hostname wish your Thrift
-            server to use by replacing <em>SVRHOST</em> with the name of the
+            <td>Set the hostname of your Thrift
+            server by replacing <em>SVRHOST</em> with the name of the
             machine you are installing the Thrift server on.  You can also
             change the port the Thrift server runs on by changing the default
             value of 3306.</td>

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/loadstore.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/loadstore.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/loadstore.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/loadstore.xml Thu Mar 15 21:00:34 2012
@@ -27,30 +27,8 @@
   <section>
   <title>Set Up</title>
   
-<p>The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog managed tables. If you run your Pig script using the "pig" command (the bin/pig Perl script) no set up is required. </p>
-<source>
-$ pig mypig.script
-</source>    
-    
-   <p> If you run your Pig script using the "java" command (java -cp pig.jar...), then the hcat jar needs to be included in the classpath of the java command line (using the -cp option). Additionally, the following properties are required in the command line: </p>
-    <ul>
-		<li>-Dhive.metastore.uris=thrift://&lt;hcatalog server hostname&gt;:9080 </li>
-		<li>-Dhive.metastore.kerberos.principal=&lt;hcatalog server kerberos principal&gt; </li>
-	</ul>
-	
-<source>
-$ java -cp pig.jar hcatalog.jar
-     -Dhive.metastore.uris=thrift://&lt;hcatalog server hostname&gt;:9080 
-     -Dhive.metastore.kerberos.principal=&lt;hcatalog server kerberos principal&gt; myscript.pig
-</source>
-<p></p>
+<p>The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog managed tables.</p>
 <p><strong>Authentication</strong></p>
-<table>
-	<tr>
-	<td><p>If a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/&lt;username&gt;/hive.log, then make sure you have run "kinit &lt;username&gt;@FOO.COM" to get a kerberos ticket and to be able to authenticate to the HCatalog server. </p></td>
-	</tr>
-</table>
-
 </section>
   
       
@@ -62,20 +40,24 @@ $ java -cp pig.jar hcatalog.jar
 <title>Usage</title>
 <p>HCatLoader is accessed via a Pig load statement.</p>	
 <source>
-A = LOAD 'dbname.tablename' USING org.apache.hcatalog.pig.HCatLoader(); 
+A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); 
 </source>
 
     <p><strong>Assumptions</strong></p>	  
-    <p>You must specify the database name and table name using this format: 'dbname.tablename'. Both the database and table must be created prior to running your Pig script. The Hive metastore lets you create tables without specifying a database; if you created tables this way, then the database name is 'default' and the string becomes 'default.tablename'. </p>
+    <p>You must specify the table name in single quotes: LOAD 'tablename'. If you are using a non-default database you must specify your input as 'dbname.tablename'. If you are using Pig 0.9.2 or earlier, you must create your database and table prior to running the Pig script. Beginning with Pig 0.10 you can issue these create commands in Pig using the SQL command.</p>
+    <p>The Hive metastore lets you create tables without specifying a database; if you
+    created tables this way, then the database name is 'default' and is not required when
+    specifying the table for HCatLoader. </p>
     <p>If the table is partitioned, you can indicate which partitions to scan by immediately following the load statement with a partition filter statement 
-    (see <a href="#Examples">Examples</a>). </p>
- </section>   
+    (see <strong>Examples</strong>). </p>
+ </section> 
+
+ <!-- ==================================================================== --> 
 <section> 
 <title>HCatalog Data Types</title>
 <p>Restrictions apply to the types of columns HCatLoader can read.</p>
 <p>HCatLoader  can read <strong>only</strong> the data types listed in the table. 
 The table shows how Pig will interpret the HCatalog data type.</p>
-<p>(Note: HCatalog does not support type Boolean.)</p>
    <table>
         <tr>
             <td>
@@ -90,12 +72,12 @@ The table shows how Pig will interpret t
                <p>primitives (int, long, float, double, string) </p>
             </td>
             <td>
-               <p>int, long, float, double <br></br> string to chararray</p>
+               <p>int, long, float, double, string to chararray </p>
             </td>
     </tr>
     <tr>
             <td>
-               <p>map (key type should be string, valuetype can be a primitive listed above)</p>
+               <p>map (key type should be string, valuetype must be string)</p>
             </td>
             <td>
                <p>map </p>
@@ -103,72 +85,115 @@ The table shows how Pig will interpret t
     </tr>
     <tr>
             <td>
-               <p>List&lt;primitive&gt; or List&lt;map&gt; where map is of the type noted above </p>
+               <p>List&lt;any type&gt; </p>
             </td>
             <td>
-               <p>bag, with the primitive or map type as the field in each tuple of the bag </p>
+               <p>bag </p>
             </td>
     </tr>
     <tr>
             <td>
-               <p>struct&lt;primitive fields&gt; </p>
+               <p>struct&lt;any type fields&gt; </p>
             </td>
             <td>
                <p>tuple </p>
             </td>
     </tr>
-    <tr>
-            <td>
-               <p>List&lt;struct&lt;primitive fields&gt;&gt; </p>
-            </td>
-            <td>
-               <p>bag, where each tuple in the bag maps to struct &lt;primitive fields&gt; </p>
-            </td>
-    </tr>
  </table>
 </section> 
 
+ <!-- ==================================================================== --> 
 <section> 
-<title>Examples</title>
+<title>Running Pig with HCatalog</title>
+
+<p>Pig does not automatically pick up HCatalog jars. You will need tell Pig where your HCatalog jars are.
+These include the Hive jars used by the HCatalog client. To do this, you must define the environment
+variable PIG_CLASSPATH with the appropriate jars. HCat can tell you the jars it needs. In order to do this it
+needs to know where Hadoop is installed. Also, you need to tell Pig the URI for your metastore, in the PIG_OPTS
+variable. In the case where you have installed Hadoop and HCatalog via tar, you can do:</p>
+
+<source>
+export HADOOP_HOME=&lt;path_to_hadoop_install&gt;
+export HCAT_HOME=&lt;path_to_hcat_install&gt;
+PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/etc/hcatalog:$HADOOP_HOME/conf:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar
+export PIG_OPTS=-Dhive.metastore.uris=thrift://&lt;hostname&gt;:&lt;port&gt;
+
+&lt;path_to_pig_install&gt;/bin/pig -Dpig.additional.jars=$HCAT_HOME/share/hcatalog/hcatalog-0.4.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-metastore-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libthrift-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/hive-exec-0.8.1.jar:$HCAT_HOME/share/hcatalog/lib/libfb303-0.7.0.jar:$HCAT_HOME/share/hcatalog/lib/jdo2-api-2.3-ec.jar:$HCAT_HOME/etc/hcatalog:$HCAT_HOME/share/hcatalog/lib/slf4j-api-1.6.1.jar &lt;script.pig&gt;
+</source>
+
+<table>
+	<tr>
+	<td><p>If you are using a secure cluster and a failure results in a message like "2010-11-03 16:17:28,225 WARN hive.metastore ... - Unable to connect metastore with URI thrift://..." in /tmp/&lt;username&gt;/hive.log, then make sure you have run "kinit &lt;username&gt;@FOO.COM" to get a Kerberos ticket and to be able to authenticate to the HCatalog server. </p></td>
+	</tr>
+</table>
+
+
+</section>
+
+ <!-- ==================================================================== --> 
+<section> 
+<title>Load Examples</title>
+
+
 <p>This load statement will load all partitions of the specified table.</p>
 <source>
 /* myscript.pig */
-A = LOAD 'dbname.tablename' USING org.apache.hcatalog.pig.HCatLoader(); 
+A = LOAD 'tablename' USING org.apache.hcatalog.pig.HCatLoader(); 
 ...
 ...
 </source>
-<p>If only some partitions of the specified table are needed, include a partition filter statement <strong>immediately</strong> following the load statement. 
-The filter statement can include conditions on partition as well as non-partition columns.</p>
+<p>If only some partitions of the specified table are needed, include a partition filter statement <strong>immediately</strong> following the load statement in the data flow. (In the script, however, a filter statement might not immediately follow its load statement.) The filter statement can include conditions on partition as well as non-partition columns.</p>
 <source>
 /* myscript.pig */
-A = LOAD 'dbname.tablename' USING  org.apache.hcatalog.pig.HCatLoader();
- 
-B = filter A by date == '20100819' and by age &lt; 30; -- datestamp is a partition column; age is not
- 
-C = filter A by date == '20100819' and by country == 'US'; -- datestamp and country are partition columns
+A = LOAD 'tablename' USING  org.apache.hcatalog.pig.HCatLoader();
+
+-- date is a partition column; age is not
+
+B = filter A by date == '20100819' and age &lt; 30; 
+
+-- both date and country are partition columns
+
+C = filter A by date == '20100819' and country == 'US'; 
 ...
 ...
 </source>
 
-<p>Certain combinations of conditions on partition and non-partition columns are not allowed in filter statements.
-For example, the following script results in this error message:  <br></br> <br></br>
-<code>ERROR 1112: Unsupported query: You have an partition column (datestamp ) in a construction like: (pcond and ...) or ( pcond and ...) where pcond is a condition on a partition column.</code> <br></br> <br></br>
-A workaround is to restructure the filter condition by splitting it into multiple filter conditions, with the first condition immediately following the load statement.
-</p>
+<p>To scan a whole table:</p>
 
 <source>
-/* This script produces an ERROR */
+a = load 'student_data' using org.apache.hcatalog.pig.HCatLoader();
+b = foreach a generate name, age;
+</source>
 
-A = LOAD 'default.search_austria' USING org.apache.hcatalog.pig.HCatLoader();
-B = FILTER A BY
-    (   (datestamp &lt; '20091103' AND browser &lt; 50)
-     OR (action == 'click' and browser &gt; 100)
-    );
-...
-...
+
+<p>Notice that the schema is automatically provided to Pig, there's no need to declare name and age as fields, as if
+you were loading from a file.</p>
+
+<p>Example of scanning a single partition. Assume the table web_logs is partitioned by the column datestamp:</p>
+
+<source>
+a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader();
+b = filter a by datestamp == '20110924';
+</source>
+
+<p>Pig will push the datestamp filter shown here to HCatalog, so that HCat knows to just scan the partition where
+datestamp = '20110924'. You can combine this filter with others via 'and':</p>
+
+<source>
+a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader();
+b = filter a by datestamp == '20110924' and user is not null;
+</source>
+
+<p>Pig will split the above filter, pushing the datestamp portion to HCatalog and retaining the user is not null part
+to apply itself. You can also give a more complex filter to retrieve a set of partitions:</p>
+
+<source>
+a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader();
+b = filter a by datestamp &gt;= '20110924' or datestamp &lt;= '20110925';
 </source>
 
 </section> 
+ 
 </section> 
 	
 <!-- ==================================================================== -->	
@@ -189,32 +214,51 @@ B = FOREACH A ...
 ...
 my_processed_data = ...
 
-STORE my_processed_data INTO 'dbname.tablename' 
-    USING org.apache.hcatalog.pig.HCatStorer('month=12,date=25,hour=0300','a:int,b:chararray,c:map[]');
+STORE my_processed_data INTO 'tablename' USING
+ org.apache.hcatalog.pig.HCatStorer();
 </source>
 
 <p><strong>Assumptions</strong></p>
 
-<p>You must specify the database name and table name using this format: 'dbname.tablename'. Both the database and table must be created prior to running your Pig script. The Hive metastore lets you create tables without specifying a database; if you created tables this way, then the database name is 'default' and string becomes 'default.tablename'. </p>
-
-<p>For the USING clause, you can have two string arguments: </p>	
-<ul>
-<li>The first string argument represents key/value pairs for partition. This is a mandatory argument. In the above example, month, date and hour are columns on which table is partitioned. 
-The values for partition keys should NOT be quoted, even if the partition key is defined to be of string type. 
-</li>
-<li>The second string argument is the Pig schema for the data that will be written. This argument is optional, and if no schema is specified, a schema will be computed by Pig. If a schema is provided, it must match with the schema computed by Pig. (See also: <a href="inputoutput.html#Partition+Schema+Semantics">Partition Schema Semantics</a>.)</li>
-</ul>
+<p>You must specify the table name in single quotes: LOAD 'tablename'. Both the database and table must be created prior to running your Pig script. If you are using a non-default database you must specify your input as 'dbname.tablename'. If you are using Pig 0.9.2 or earlier, you must create your database and table prior to running the Pig script. Beginning with Pig 0.10 you can issue these create commands in Pig using the SQL command. </p>
+<p>The Hive metastore lets you create tables without specifying a database; if you created
+tables this way, then the database name is 'default' and you do not need to specify the
+database name in the store statement. </p>
+<p>For the USING clause, you can have a string argument that represents key/value pairs
+for partition. This is a mandatory argument when you are writing to a partitioned table
+and the partition column is not in the output column.  The values for partition keys
+should NOT be quoted.</p>
+<p>If partition columns are present in data they need not be specified as a STORE argument. Instead HCatalog will use these values to place records in the appropriate partition(s). It is valid to specify some partition keys in the STORE statement and have other partition keys in the data.</p>
 <p></p>
 <p></p>
 
+</section> 
+<section> 
+<title>Store Examples</title>
+<p>You can write to non-partitioned table simply by using HCatStorer.  The contents of the table will be overwritten:</p>
+
+<source>store z into 'student_data' using org.apache.hcatalog.pig.HCatStorer();</source>
+
+<p>To add one new partition to a partitioned table, specify the partition value in store function.  Pay careful
+attention to the quoting, as the whole string must be single quoted and separated with an equals sign:</p>
+
+<source>store z into 'web_data' using org.apache.hcatalog.pig.HCatStorer('datestamp=20110924');</source>
+
+<p>To write into multiple partitions at one, make sure that the partition column is present in your data, then call
+HCatStorer with no argument:</p>
+
+<source>store z into 'web_data' using org.apache.hcatalog.pig.HCatStorer(); -- datestamp
+must be a field in the relation z</source>
+
+
 	</section>
-	
+
+ <!-- ==================================================================== --> 
     <section>
 	<title>HCatalog Data Types</title>
 	<p>Restrictions apply to the types of columns HCatStorer can write.</p>
 <p>HCatStorer can write <strong>only</strong> the data types listed in the table. 
 The table shows how Pig will interpret the HCatalog data type.</p>
-<p>(Note: HCatalog does not support type Boolean.)</p>
    <table>
         <tr>
             <td>
@@ -229,15 +273,12 @@ The table shows how Pig will interpret t
                <p>primitives (int, long, float, double, string) </p>
             </td>
             <td>
-               <p>int, long, float, double, string <br></br><br></br>
-               <strong>Note:</strong> HCatStorer does NOT support writing table columns of type smallint or tinyint. 
-               To be able to write form Pig using the HCatalog storer, table columns must by of type int or bigint.
-               </p>
+               <p>int, long, float, double, string to chararray </p>
             </td>
     </tr>
     <tr>
             <td>
-               <p>map (key type should be string, valuetype can be a primitive listed above)</p>
+               <p>map (key type should be string, valuetype must be string)</p>
             </td>
             <td>
                <p>map </p>
@@ -245,28 +286,20 @@ The table shows how Pig will interpret t
     </tr>
     <tr>
             <td>
-               <p>List&lt;primitive&gt; or List&lt;map&gt; where map is of the type noted above </p>
+               <p>List&lt;any type&gt; </p>
             </td>
             <td>
-               <p>bag, with the primitive or map type as the field in each tuple of the bag </p>
+               <p>bag </p>
             </td>
     </tr>
     <tr>
             <td>
-               <p>struct&lt;primitive fields&gt; </p>
+               <p>struct&lt;any type fields&gt; </p>
             </td>
             <td>
                <p>tuple </p>
             </td>
     </tr>
-    <tr>
-            <td>
-               <p>List&lt;struct&lt;primitive fields&gt;&gt; </p>
-            </td>
-            <td>
-               <p>bag, where each tuple in the bag maps to struct &lt;primitive fields&gt; </p>
-            </td>
-    </tr>
  </table>
 	</section>
 	

Modified: incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/notification.xml
URL: http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/notification.xml?rev=1301196&r1=1301195&r2=1301196&view=diff
==============================================================================
--- incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/notification.xml (original)
+++ incubator/hcatalog/trunk/src/docs/src/documentation/content/xdocs/notification.xml Thu Mar 15 21:00:34 2012
@@ -23,13 +23,13 @@
   </header>
   <body>
   
- <p> In HCatalog 2.0 we introduce notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events: </p>
+ <p>Since HCatalog 0.2 provides notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events: </p>
 <ul>
 <li>Notification when a new partition is added</li>
 <li>Notification when a set of partitions is added</li>
 </ul>
 
-<p>No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message. This means that your existing code, when running with 0.2, will automatically send the notifications. </p>
+<p>No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message.</p>
 
 <section>
 <title>Notification for a New Partition</title>
@@ -46,7 +46,7 @@ conn.start();
   <p>2. Subscribe to a topic you are interested in. When subscribing on a message bus, you need to subscribe to a particular topic to receive the messages that are being delivered on that topic. </p>
   <ul>
   <li>  
-  <p>The topic name corresponding to a particular table is stored in table properties and can be retrieved using following piece of code: </p>
+  <p>The topic name corresponding to a particular table is stored in table properties and can be retrieved using the following piece of code: </p>
  <source>
 HiveMetaStoreClient msc = new HiveMetaStoreClient(hiveConf);
 String topicName = msc.getTable("mydb", "myTbl").getParameters().get(HCatConstants.HCAT_MSGBUS_TOPIC_NAME);
@@ -76,14 +76,18 @@ consumer.setMessageListener(this);
    }
  </source>
  
-  <p>You need to have a JMS jar in your classpath to make this work. Additionally, you need to have a JMS provider’s jar in your classpath. HCatalog uses ActiveMQ as a JMS provider. In principle, any JMS provider can be used in client side; however, ActiveMQ is recommended. ActiveMQ can be obtained from: http://activemq.apache.org/activemq-550-release.html </p>
+  <p>You need to have a JMS jar in your classpath to make this work. Additionally, you need to have a JMS provider’s jar in your classpath. HCatalog is tested with ActiveMQ as a JMS provider, although any JMS provider can be used. ActiveMQ can be obtained from: http://activemq.apache.org/activemq-550-release.html .</p>
 </section>
 
 <section>
 <title>Notification for a Set of Partitions</title>
 
+<p>Sometimes a user wants to wait until a collection of partitions is finished. For example, you may want to start processing after all partitions for a day are done. However, HCatalog has no notion of collections or hierarchies of partitions. To support this, HCatalog allows data writers to signal when they are finished writing a collection of partitions. Data readers may wait for this signal before beginning to read.</p>
+
 <p>The example code below illustrates how to send a notification when a set of partitions has been added.</p>
 
+<p>To signal, a data writer does this:</p>
+
 <source>
 HiveMetaStoreClient msc = new HiveMetaStoreClient(conf);
 
@@ -138,8 +142,8 @@ public void onMessage(Message msg) {
   MapMessage mapMsg = (MapMessage)msg;
   Enumeration&lt;String&gt; keys = mapMsg.getMapNames();
   
-  // Enumerate over all keys. This will print key value pairs specifying the particular partition 
-  // which was marked done. In this case, it will print:
+  // Enumerate over all keys. This will print key-value pairs specifying the  
+  // particular partition 44which was marked done. In this case, it will print:
   // date : 20110711
   // country: *