You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-commits@hadoop.apache.org by cd...@apache.org on 2009/11/03 10:35:13 UTC

svn commit: r832362 - in /hadoop/mapreduce/trunk: CHANGES.txt src/contrib/sqoop/doc/SqoopUserGuide.txt src/contrib/sqoop/doc/api-reference.txt

Author: cdouglas
Date: Tue Nov  3 09:35:12 2009
New Revision: 832362

URL: http://svn.apache.org/viewvc?rev=832362&view=rev
Log:
MAPREDUCE-1036. Document Sqoop API. Contributed by Aaron Kimball

Added:
    hadoop/mapreduce/trunk/src/contrib/sqoop/doc/api-reference.txt
Modified:
    hadoop/mapreduce/trunk/CHANGES.txt
    hadoop/mapreduce/trunk/src/contrib/sqoop/doc/SqoopUserGuide.txt

Modified: hadoop/mapreduce/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=832362&r1=832361&r2=832362&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/CHANGES.txt (original)
+++ hadoop/mapreduce/trunk/CHANGES.txt Tue Nov  3 09:35:12 2009
@@ -31,6 +31,8 @@
     MAPREDUCE-1069. Implement Sqoop API refactoring. (Aaron Kimball via
     tomwhite)
 
+    MAPREDUCE-1036. Document Sqoop API. (Aaron Kimball via cdouglas)
+
   OPTIMIZATIONS
 
     MAPREDUCE-270. Fix the tasktracker to optionally send an out-of-band

Modified: hadoop/mapreduce/trunk/src/contrib/sqoop/doc/SqoopUserGuide.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/sqoop/doc/SqoopUserGuide.txt?rev=832362&r1=832361&r2=832362&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/src/contrib/sqoop/doc/SqoopUserGuide.txt (original)
+++ hadoop/mapreduce/trunk/src/contrib/sqoop/doc/SqoopUserGuide.txt Tue Nov  3 09:35:12 2009
@@ -61,3 +61,5 @@
 
 include::supported-dbs.txt[]
 
+include::api-reference.txt[]
+

Added: hadoop/mapreduce/trunk/src/contrib/sqoop/doc/api-reference.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/sqoop/doc/api-reference.txt?rev=832362&view=auto
==============================================================================
--- hadoop/mapreduce/trunk/src/contrib/sqoop/doc/api-reference.txt (added)
+++ hadoop/mapreduce/trunk/src/contrib/sqoop/doc/api-reference.txt Tue Nov  3 09:35:12 2009
@@ -0,0 +1,243 @@
+
+////
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+////
+
+Developer API Reference
+-----------------------
+
+This section is intended to specify the APIs available to application writers
+integrating with Sqoop, and those modifying Sqoop. The next three subsections
+are written from the following three perspectives: those using classes generated
+by Sqoop, and its public library; those writing Sqoop extensions (i.e.,
+additional ConnManager implementations that interact with more databases); and
+those modifying Sqoop's internals. Each section describes the system in
+successively greater depth.
+
+
+The External API
+~~~~~~~~~~~~~~~~
+
+Sqoop auto-generates classes that represent the tables imported into HDFS. The
+class contains member fields for each column of the imported table; an instance
+of the class holds one row of the table. The generated classes implement the
+serialization APIs used in Hadoop, namely the _Writable_ and _DBWritable_
+interfaces.  They also contain other convenience methods: a +parse()+ method
+that interprets delimited text fields, and a +toString()+ method that preserves
+the user's chosen delimiters. The full set of methods guaranteed to exist in an
+auto-generated class are specified in the interface
++org.apache.hadoop.sqoop.lib.SqoopRecord+.
+
+Instances of _SqoopRecord_ may depend on Sqoop's public API. This is all classes
+in the +org.apache.hadoop.sqoop.lib+ package. These are briefly described below.
+Clients of Sqoop should not need to directly interact with any of these classes,
+although classes generated by Sqoop will depend on them. Therefore, these APIs
+are considered public and care will be taken when forward-evolving them.
+
+* The +RecordParser+ class will parse a line of text into a list of fields,
+  using controllable delimiters and quote characters.
+* The static +FieldFormatter+ class provides a method which handles quoting and
+  escaping of characters in a field which will be used in
+  +SqoopRecord.toString()+ implementations.
+* Marshaling data between _ResultSet_ and _PreparedStatement_ objects and
+  _SqoopRecords_ is done via +JdbcWritableBridge+.
+* +BigDecimalSerializer+ contains a pair of methods that facilitate
+  serialization of +BigDecimal+ objects over the _Writable_ interface.
+
+The Extension API
+~~~~~~~~~~~~~~~~~
+
+This section covers the API and primary classes used by extensions for Sqoop
+which allow Sqoop to interface with more database vendors.
+
+While Sqoop uses JDBC and +DBInputFormat+ (and +DataDrivenDBInputFormat+) to
+read from databases, differences in the SQL supported by different vendors as
+well as JDBC metadata necessitates vendor-specific codepaths for most databases.
+Sqoop's solution to this problem is by introducing the ConnManager API
+(+org.apache.hadoop.sqoop.manager.ConnMananger+).
+
++ConnManager+ is an abstract class defining all methods that interact with the
+database itself. Most implementations of +ConnManager+ will extend the
++org.apache.hadoop.sqoop.manager.SqlManager+ abstract class, which uses standard
+SQL to perform most actions. Subclasses are required to implement the
++getConnection()+ method which returns the actual JDBC connection to the
+database. Subclasses are free to override all other methods as well. The
++SqlManager+ class itself exposes a protected API that allows developers to
+selectively override behavior. For example, the +getColNamesQuery()+ method
+allows the SQL query used by +getColNames()+ to be modified without needing to
+rewrite the majority of +getColNames()+.
+
++ConnManager+ implementations receive a lot of their configuration data from a
+Sqoop-specific class, +ImportOptions+. While +ImportOptions+ does not currently
+contain many setter methods, clients should not assume +ImportOptions+ are
+immutable. More setter methods may be added in the future.  +ImportOptions+ does
+not directly store specific per-manager options. Instead, it contains a
+reference to the +Configuration+ returned by +Tool.getConf()+ after parsing
+command-line arguments with the +GenericOptionsParser+. This allows extension
+arguments via "+-D any.specific.param=any.value+" without requiring any layering
+of options parsing or modification of +ImportOptions+.
+
+All existing +ConnManager+ implementations are stateless. Thus, the system which
+instantiates +ConnManagers+ may implement multiple instances of the same
++ConnMananger+ class over Sqoop's lifetime. If a caching layer is required, we
+can add one later, but it is not currently available.
+
++ConnManagers+ are currently created by instances of the abstract class +ManagerFactory+ (See
+MAPREDUCE-750). One +ManagerFactory+ implementation currently serves all of
+Sqoop: +org.apache.hadoop.sqoop.manager.DefaultManagerFactory+.  Extensions
+should not modify +DefaultManagerFactory+. Instead, an extension-specific
++ManagerFactory+ implementation should be provided with the new ConnManager.
++ManagerFactory+ has a single method of note, named +accept()+. This method will
+determine whether it can instantiate a +ConnManager+ for the user's
++ImportOptions+. If so, it returns the +ConnManager+ instance. Otherwise, it
+returns +null+.
+
+The +ManagerFactory+ implementations used are governed by the
++sqoop.connection.factories+ setting in sqoop-site.xml. Users of extension
+libraries can install the 3rd-party library containing a new +ManagerFactory+
+and +ConnManager+(s), and configure sqoop-site.xml to use the new
++ManagerFactory+.  The +DefaultManagerFactory+ principly discriminates between
+databases by parsing the connect string stored in +ImportOptions+.
+
+Extension authors may make use of classes in the +org.apache.hadoop.sqoop.io+,
++mapred+, +mapreduce+, and +util+ packages to facilitate their implementations.
+These packages and classes are described in more detail in the following
+section.
+
+
+Sqoop Internals
+~~~~~~~~~~~~~~~
+
+This section describes the internal architecture of Sqoop.
+
+The Sqoop program is driven by the +org.apache.hadoop.sqoop.Sqoop+ main class.
+A limited number of additional classes are in the same package; +ImportOptions+
+(described earlier) and +ConnFactory+ (which manipulates +ManagerFactory+
+instances).
+
+General program flow
+^^^^^^^^^^^^^^^^^^^^
+
+The general program flow is as follows:
+
++org.apache.hadoop.sqoop.Sqoop+ is the main class and implements _Tool_. A new
+instance is launched with +ToolRunner+. It parses its arguments using the
++ImportOptions+ class.  Within the +ImportOptions+, an +ImportAction+ will be
+chosen by the user. This may be import all tables, import one specific table,
+execute a SQL statement, or others.
+
+A +ConnManager+ is then instantiated based on the data in the +ImportOptions+.
+The +ConnFactory+ is used to get a +ConnManager+ from a +ManagerFactory+; the
+mechanics of this were described in an earlier section.
+
+Then in the +run()+ method, using a case statement, it determines which actions
+the user needs performed based on the +ImportAction+ enum. Usually this involves
+determining a list of tables to import, generating user code for them, and
+running a MapReduce job per table to read the data.  The import itself does not
+specifically need to be run via a MapReduce job; the +ConnManager.importTable()+
+method is left to determine how best to run the import. Each of these actions is
+controlled by the +ConnMananger+, except for the generating of code, which is
+done by the +CompilationManager+ and +ClassWriter+. (Both in the
++org.apache.hadoop.sqoop.orm+ package.) Importing into Hive is also taken care
+of via the +org.apache.hadoop.sqoop.hive.HiveImport+ class after the
++importTable()+ has completed. This is done without concern for the
++ConnManager+ implementation used.
+
+A ConnManager's +importTable()+ method receives a single argument of type
++ImportJobContext+ which contains parameters to the method. This class may be
+extended with additional parameters in the future, which optionally further
+direct the import operation. Similarly, the +exportTable()+ method receives an
+argument of type +ExportJobContext+. These classes contain the name of the table
+to import/export, a reference to the +ImportOptions+ object, and other related
+data.
+
+Subpackages
+^^^^^^^^^^^
+
+The following subpackages under +org.apache.hadoop.sqoop+ exist:
+
+* +hive+ - Facilitates importing data to Hive.
+* +io+ - Implementations of +java.io.*+ interfaces (namely, _OutputStream_ and
+  _Writer_).
+* +lib+ - The external public API (described earlier).
+* +manager+ - The +ConnManager+ and +ManagerFactory+ interface and their
+  implementations.
+* +mapred+ - Classes interfacing with the old (pre-0.20) MapReduce API.
+* +mapreduce+ - Classes interfacing with the new (0.20+) MapReduce API....
+* +orm+ - Code auto-generation.
+* +util+ - Miscellaneous utility classes.
+
+The +io+ package contains _OutputStream_ and _BufferedWriter_ implementations
+used by direct writers to HDFS. The +SplittableBufferedWriter+ allows a single
+BufferedWriter to be opened to a client which will, under the hood, write to
+multiple files in series as they reach a target threshold size. This allows
+unsplittable compression libraries (e.g., gzip) to be used in conjunction with
+Sqoop import while still allowing subsequent MapReduce jobs to use multiple
+input splits per dataset.
+
+Code in the +mapred+ package should be considered deprecated. The +mapreduce+
+package contains +DataDrivenImportJob+, which uses the +DataDrivenDBInputFormat+
+introduced in 0.21. The mapred package contains +ImportJob+, which uses the
+older +DBInputFormat+. Most +ConnManager+ implementations use
++DataDrivenImportJob+; +DataDrivenDBInputFormat+ does not currently work with
+Oracle in all circumstances, so it remains on the old code-path.
+
+The +orm+ package contains code used for class generation. It depends on the
+JDK's tools.jar which provides the com.sun.tools.javac package.
+
+The +util+ package contains various utilities used throughout Sqoop:
+
+* +ClassLoaderStack+ manages a stack of +ClassLoader+ instances used by the
+  current thread. This is principly used to load auto-generated code into the
+  current thread when running MapReduce in local (standalone) mode.
+* +DirectImportUtils+ contains convenience methods used by direct HDFS
+  importers.
+* +Executor+ launches external processes and connects these to stream handlers
+  generated by an AsyncSink (see more detail below).
+* +ExportError+ is thrown by +ConnManagers+ when exports fail.
+* +ImportError+ is thrown by +ConnManagers+ when imports fail.
+* +JdbcUrl+ handles parsing of connect strings, which are URL-like but not
+  specification-conforming. (In particular, JDBC connect strings may have
+  +multi:part:scheme://+ components.)
+* +PerfCounters+ are used to estimate transfer rates for display to the user.
+* +ResultSetPrinter+ will pretty-print a _ResultSet_.
+
+In several places, Sqoop reads the stdout from external processes. The most
+straightforward cases are direct-mode imports as performed by the
++LocalMySQLManager+ and +DirectPostgresqlManager+. After a process is spawned by
++Runtime.exec()+, its stdout (+Process.getInputStream()+) and potentially stderr
+(+Process.getErrorStream()+) must be handled. Failure to read enough data from
+both of these streams will cause the external process to block before writing
+more. Consequently, these must both be handled, and preferably asynchronously.
+
+In Sqoop parlance, an "async sink" is a thread that takes an +InputStream+ and
+reads it to completion. These are realized by +AsyncSink+ implementations. The
++org.apache.hadoop.sqoop.util.AsyncSink+ abstract class defines the operations
+this factory must perform. +processStream()+ will spawn another thread to
+immediately begin handling the data read from the +InputStream+ argument; it
+must read this stream to completion. The +join()+ method allows external threads
+to wait until this processing is complete.
+
+Some "stock" +AsyncSink+ implementations are provided: the +LoggingAsyncSink+ will
+repeat everything on the +InputStream+ as log4j INFO statements. The
++NullAsyncSink+ consumes all its input and does nothing.
+
+The various +ConnManagers+ that make use of external processes have their own
++AsyncSink+ implementations as inner classes, which read from the database tools
+and forward the data along to HDFS, possibly performing formatting conversions
+in the meantime.
+
+