You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@sqoop.apache.org by ja...@apache.org on 2015/04/09 03:07:29 UTC
sqoop git commit: SQOOP-1758: Sqoop2: HDFS connector documentation

Repository: sqoop
Updated Branches:
  refs/heads/sqoop2 33e8a49ef -> 06f79093f


SQOOP-1758: Sqoop2: HDFS connector documentation

(Abraham Elmahrek via Jarek Jarcec Cecho)


Project: http://git-wip-us.apache.org/repos/asf/sqoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/sqoop/commit/06f79093
Tree: http://git-wip-us.apache.org/repos/asf/sqoop/tree/06f79093
Diff: http://git-wip-us.apache.org/repos/asf/sqoop/diff/06f79093

Branch: refs/heads/sqoop2
Commit: 06f79093f53b604747a8802aa241208bf93019ad
Parents: 33e8a49
Author: Jarek Jarcec Cecho <ja...@apache.org>
Authored: Wed Apr 8 18:07:13 2015 -0700
Committer: Jarek Jarcec Cecho <ja...@apache.org>
Committed: Wed Apr 8 18:07:13 2015 -0700

----------------------------------------------------------------------
 docs/src/site/sphinx/Connectors.rst | 148 ++++++++++++++++++++++++++++++-
 1 file changed, 145 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/sqoop/blob/06f79093/docs/src/site/sphinx/Connectors.rst
----------------------------------------------------------------------
diff --git a/docs/src/site/sphinx/Connectors.rst b/docs/src/site/sphinx/Connectors.rst
index 7b12016..cc71ff5 100644
--- a/docs/src/site/sphinx/Connectors.rst
+++ b/docs/src/site/sphinx/Connectors.rst
@@ -57,7 +57,7 @@ Inputs associated with the link configuration include:
 +-----------------------------+---------+-----------------------------------------------------------------------+------------------------------------------+
 | JDBC Connection Properties  | Map     | A map of JDBC connection properties to pass to the JDBC driver        | profileSQL=true&useFastDateParsing=false |
 |                             |         | *Optional*.                                                           |                                          |
-+-----------------------------+---------------------------------------------------------------------------------+------------------------------------------+
++-----------------------------+---------+-----------------------------------------------------------------------+------------------------------------------+
 
 **FROM Job Configuration**
 ++++++++++++++++++++++++++
@@ -87,7 +87,7 @@ Inputs associated with the Job configuration for the FROM direction include:
 +-----------------------------+---------+-------------------------------------------------------------------------+---------------------------------------------+
 | Boundary query              | String  | The query used to define an upper and lower boundary when partitioning. |                                             |
 |                             |         | *Optional*.                                                             |                                             |
-+-----------------------------+-----------------------------------------------------------------------------------+---------------------------------------------+
++-----------------------------+---------+-------------------------------------------------------------------------+---------------------------------------------+
 
 **Notes**
 =========
@@ -121,7 +121,7 @@ Inputs associated with the Job configuration for the TO direction include:
 +-----------------------------+---------+-------------------------------------------------------------------------+-------------------------------------------------+
 | Should clear stage table    | Boolean | True or false depending on whether the staging table should be cleared  | true                                            |
 |                             |         | after the data transfer has finished. *Optional*.                       |                                                 |
-+-----------------------------+-----------------------------------------------------------------------------------+-------------------------------------------------+
++-----------------------------+---------+-------------------------------------------------------------------------+-------------------------------------------------+
 
 **Notes**
 =========
@@ -198,3 +198,145 @@ The Generic JDBC Connector performs two operations in the destroyer in the TO di
 2. Clear the staging table.
 
 No operations are performed in the FROM direction.
+
+
+++++++++++++++
+HDFS Connector
+++++++++++++++
+
+-----
+Usage
+-----
+
+To use the HDFS Connector, create a link for the connector and a job that uses the link.
+
+**Link Configuration**
+++++++++++++++++++++++
+
+Inputs associated with the link configuration include:
+
++-----------------------------+---------+-----------------------------------------------------------------------+----------------------------+
+| Input                       | Type    | Description                                                           | Example                    |
++=============================+=========+=======================================================================+============================+
+| URI                         | String  | The URI of the HDFS File System.                                      | hdfs://example.com:8020/   |
+|                             |         | *Optional*. See note below.                                           |                            |
++-----------------------------+---------+-----------------------------------------------------------------------+----------------------------+
+| Configuration directory     | String  | Path to the clusters configuration directory.                         | /etc/conf/hadoop           |
+|                             |         | *Optional*.                                                           |                            |
++-----------------------------+---------+-----------------------------------------------------------------------+----------------------------+
+
+**Notes**
+=========
+
+1. The specified URI will override the declared URI in your configuration.
+
+**FROM Job Configuration**
+++++++++++++++++++++++++++
+
+Inputs associated with the Job configuration for the FROM direction include:
+
++-----------------------------+---------+-------------------------------------------------------------------------+------------------+
+| Input                       | Type    | Description                                                             | Example          |
++=============================+=========+=========================================================================+==================+
+| Input directory             | String  | The location in HDFS that the connector should look for files in.       | /tmp/sqoop2/hdfs |
+|                             |         | *Required*. See note below.                                             |                  |
++-----------------------------+---------+-------------------------------------------------------------------------+------------------+
+| Null value                  | String  | The value of NULL in the contents of each file extracted.               | \N               |
+|                             |         | *Optional*. See note below.                                             |                  |
++-----------------------------+---------+-------------------------------------------------------------------------+------------------+
+| Override null value         | Boolean | Tells the connector to replace the specified NULL value.                | true             |
+|                             |         | *Optional*. See note below.                                             |                  |
++-----------------------------+---------+-------------------------------------------------------------------------+------------------+
+
+**Notes**
+=========
+
+1. All files in *Input directory* will be extracted.
+2. *Null value* and *override null value* should be used in conjunction. If *override null value* is not set to true, then *null value* will not be used when extracting data.
+
+**TO Job Configuration**
+++++++++++++++++++++++++
+
+Inputs associated with the Job configuration for the TO direction include:
+
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Input                       | Type    | Description                                                             | Example                           |
++=============================+=========+=========================================================================+===================================+
+| Output directory            | String  | The location in HDFS that the connector will load files to.             | /tmp/sqoop2/hdfs                  |
+|                             |         | *Optional*                                                              |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Output format               | Enum    | The format to output data to.                                           | CSV                               |
+|                             |         | *Optional*. See note below.                                             |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Compression                 | Enum    | Compression class.                                                      | GZIP                              |
+|                             |         | *Optional*. See note below.                                             |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Custom compression          | String  | Custom compression class.                                               | org.apache.sqoop.SqoopCompression |
+|                             |         | *Optional* Comma separated list of columns.                             |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Null value                  | String  | The value of NULL in the contents of each file loaded.                  | \N                                |
+|                             |         | *Optional*. See note below.                                             |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Override null value         | Boolean | Tells the connector to replace the specified NULL value.                | true                              |
+|                             |         | *Optional*. See note below.                                             |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+| Append mode                 | Boolean | Append to an existing output directory.                                 | true                              |
+|                             |         | *Optional*.                                                             |                                   |
++-----------------------------+---------+-------------------------------------------------------------------------+-----------------------------------+
+
+**Notes**
+=========
+
+1. *Output format* only supports CSV at the moment.
+2. *Compression* supports all Hadoop compression classes.
+3. *Null value* and *override null value* should be used in conjunction. If *override null value* is not set to true, then *null value* will not be used when loading data.
+
+-----------
+Partitioner
+-----------
+
+The HDFS Connector partitioner partitions based on total blocks in all files in the specified input directory.
+Blocks will try to be placed in splits based on the *node* and *rack* they reside in.
+
+---------
+Extractor
+---------
+
+During the *extraction* phase, the FileSystem API is used to query files from HDFS. The HDFS cluster used is the one defined by:
+
+1. The HDFS URI in the link configuration
+2. The Hadoop configuration in the link configuration
+3. The Hadoop configuration used by the execution framework
+
+The format of the data must be CSV. The NULL value in the CSV can be chosen via *null value*. For example::
+
+    1,\N
+    2,null
+    3,NULL
+
+In the above example, if *null value* is set to \N, then only the first row's NULL value will be inferred.
+
+------
+Loader
+------
+
+During the *loading* phase, HDFS is written to via the FileSystem API. The number of files created is equal to the number of loads that run. The format of the data currently can only be CSV. The NULL value in the CSV can be chosen via *null value*. For example:
+
++--------------+-------+
+| Id           | Value |
++==============+=======+
+| 1            | NULL  |
++--------------+-------+
+| 2            | value |
++--------------+-------+
+
+If *null value* is set to \N, then here's how the data will look like in HDFS::
+
+    1,\N
+    2,value
+
+----------
+Destroyers
+----------
+
+The HDFS TO destroyer moves all created files to the proper output directory.