You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@sqoop.apache.org by bo...@apache.org on 2018/07/16 14:25:19 UTC
sqoop git commit: SQOOP-3338: Document Parquet support
Repository: sqoop
Updated Branches:
refs/heads/trunk e63905325 -> 17461e91d
SQOOP-3338: Document Parquet support
(Szabolcs Vasas via Boglarka Egyed)
Project: http://git-wip-us.apache.org/repos/asf/sqoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/sqoop/commit/17461e91
Tree: http://git-wip-us.apache.org/repos/asf/sqoop/tree/17461e91
Diff: http://git-wip-us.apache.org/repos/asf/sqoop/diff/17461e91
Branch: refs/heads/trunk
Commit: 17461e91db01bf67663caf0fb35e8920128c1aba
Parents: e639053
Author: Boglarka Egyed <bo...@apache.org>
Authored: Mon Jul 16 16:23:56 2018 +0200
Committer: Boglarka Egyed <bo...@apache.org>
Committed: Mon Jul 16 16:23:56 2018 +0200
----------------------------------------------------------------------
src/docs/user/hive-args.txt | 3 +-
src/docs/user/hive-notes.txt | 4 +-
src/docs/user/import.txt | 125 +++++++++++++++++++++++++-------------
3 files changed, 88 insertions(+), 44 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/sqoop/blob/17461e91/src/docs/user/hive-args.txt
----------------------------------------------------------------------
diff --git a/src/docs/user/hive-args.txt b/src/docs/user/hive-args.txt
index 8af9a1c..4edf338 100644
--- a/src/docs/user/hive-args.txt
+++ b/src/docs/user/hive-args.txt
@@ -42,7 +42,8 @@ Argument Description
+\--map-column-hive <map>+ Override default mapping from SQL type to\
Hive type for configured columns. If specify commas in\
this argument, use URL encoded keys and values, for example,\
- use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).
+ use DECIMAL(1%2C%201) instead of DECIMAL(1, 1). Note that in case of Parquet file format users have\
+ to use +\--map-column-java+ instead of this option.
+\--hs2-url+ The JDBC connection string to HiveServer2 as you would specify in Beeline. If you use this option with \
--hive-import then Sqoop will try to connect to HiveServer2 instead of using Hive CLI.
+\--hs2-user+ The user for creating the JDBC connection to HiveServer2. The default is the current OS user.
http://git-wip-us.apache.org/repos/asf/sqoop/blob/17461e91/src/docs/user/hive-notes.txt
----------------------------------------------------------------------
diff --git a/src/docs/user/hive-notes.txt b/src/docs/user/hive-notes.txt
index deee270..af97d94 100644
--- a/src/docs/user/hive-notes.txt
+++ b/src/docs/user/hive-notes.txt
@@ -32,7 +32,7 @@ informing you of the loss of precision.
Parquet Support in Hive
~~~~~~~~~~~~~~~~~~~~~~~
-In order to contact the Hive MetaStore from a MapReduce job, a delegation token will
-be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add
+When using the Kite Dataset API based Parquet implementation in order to contact the Hive MetaStore
+from a MapReduce job, a delegation token will be fetched and passed. HIVE_CONF_DIR and HIVE_HOME must be set appropriately to add
Hive to the runtime classpath. Otherwise, importing/exporting into Hive in Parquet
format may not work.
http://git-wip-us.apache.org/repos/asf/sqoop/blob/17461e91/src/docs/user/import.txt
----------------------------------------------------------------------
diff --git a/src/docs/user/import.txt b/src/docs/user/import.txt
index 2d074f4..a2c16d9 100644
--- a/src/docs/user/import.txt
+++ b/src/docs/user/import.txt
@@ -51,46 +51,47 @@ include::validation-args.txt[]
.Import control arguments:
[grid="all"]
-`---------------------------------`--------------------------------------
-Argument Description
+`-------------------------------------------`----------------------------
+Argument Description
-------------------------------------------------------------------------
-+\--append+ Append data to an existing dataset\
- in HDFS
-+\--as-avrodatafile+ Imports data to Avro Data Files
-+\--as-sequencefile+ Imports data to SequenceFiles
-+\--as-textfile+ Imports data as plain text (default)
-+\--as-parquetfile+ Imports data to Parquet Files
-+\--boundary-query <statement>+ Boundary query to use for creating splits
-+\--columns <col,col,col...>+ Columns to import from table
-+\--delete-target-dir+ Delete the import target directory\
- if it exists
-+\--direct+ Use direct connector if exists for the database
-+\--fetch-size <n>+ Number of entries to read from database\
- at once.
-+\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
-+-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
-+-e,\--query <statement>+ Import the results of '+statement+'.
-+\--split-by <column-name>+ Column of the table used to split work\
- units. Cannot be used with\
- +--autoreset-to-one-mapper+ option.
-+\--split-limit <n>+ Upper Limit for each split size.\
- This only applies to Integer and Date columns.\
- For date or timestamp fields it is calculated in seconds.
-+\--autoreset-to-one-mapper+ Import should use one mapper if a table\
- has no primary key and no split-by column\
- is provided. Cannot be used with\
- +--split-by <col>+ option.
-+\--table <table-name>+ Table to read
-+\--target-dir <dir>+ HDFS destination dir
-+\--temporary-rootdir <dir>+ HDFS directory for temporary files created during import (overrides default "_sqoop")
-+\--warehouse-dir <dir>+ HDFS parent for table destination
-+\--where <where clause>+ WHERE clause to use during import
-+-z,\--compress+ Enable compression
-+\--compression-codec <c>+ Use Hadoop codec (default gzip)
-+--null-string <null-string>+ The string to be written for a null\
- value for string columns
-+--null-non-string <null-string>+ The string to be written for a null\
- value for non-string columns
++\--append+ Append data to an existing dataset\
+ in HDFS
++\--as-avrodatafile+ Imports data to Avro Data Files
++\--as-sequencefile+ Imports data to SequenceFiles
++\--as-textfile+ Imports data as plain text (default)
++\--as-parquetfile+ Imports data to Parquet Files
++\--parquet-configurator-implementation+ Sets the implementation used during Parquet import. Supported values: kite, hadoop.
++\--boundary-query <statement>+ Boundary query to use for creating splits
++\--columns <col,col,col...>+ Columns to import from table
++\--delete-target-dir+ Delete the import target directory\
+ if it exists
++\--direct+ Use direct connector if exists for the database
++\--fetch-size <n>+ Number of entries to read from database\
+ at once.
++\--inline-lob-limit <n>+ Set the maximum size for an inline LOB
++-m,\--num-mappers <n>+ Use 'n' map tasks to import in parallel
++-e,\--query <statement>+ Import the results of '+statement+'.
++\--split-by <column-name>+ Column of the table used to split work\
+ units. Cannot be used with\
+ +--autoreset-to-one-mapper+ option.
++\--split-limit <n>+ Upper Limit for each split size.\
+ This only applies to Integer and Date columns.\
+ For date or timestamp fields it is calculated in seconds.
++\--autoreset-to-one-mapper+ Import should use one mapper if a table\
+ has no primary key and no split-by column\
+ is provided. Cannot be used with\
+ +--split-by <col>+ option.
++\--table <table-name>+ Table to read
++\--target-dir <dir>+ HDFS destination dir
++\--temporary-rootdir <dir>+ HDFS directory for temporary files created during import (overrides default "_sqoop")
++\--warehouse-dir <dir>+ HDFS parent for table destination
++\--where <where clause>+ WHERE clause to use during import
++-z,\--compress+ Enable compression
++\--compression-codec <c>+ Use Hadoop codec (default gzip)
++--null-string <null-string>+ The string to be written for a null\
+ value for string columns
++--null-non-string <null-string>+ The string to be written for a null\
+ value for non-string columns
-------------------------------------------------------------------------
The +\--null-string+ and +\--null-non-string+ arguments are optional.\
@@ -402,8 +403,8 @@ saved jobs later in this document for more information.
File Formats
^^^^^^^^^^^^
-You can import data in one of two file formats: delimited text or
-SequenceFiles.
+You can import data in one of these file formats: delimited text,
+SequenceFiles, Avro and Parquet.
Delimited text is the default import format. You can also specify it
explicitly by using the +\--as-textfile+ argument. This argument will write
@@ -444,6 +445,48 @@ argument, or specify any Hadoop compression codec using the
+\--compression-codec+ argument. This applies to SequenceFile, text,
and Avro files.
+Parquet support
++++++++++++++++
+
+Sqoop has two different implementations for importing data in Parquet format:
+
+- Kite Dataset API based implementation (default, legacy)
+- Parquet Hadoop API based implementation (recommended)
+
+The users can specify the desired implementation with the +\--parquet-configurator-implementation+ option:
+
+----
+$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --parquet-configurator-implementation kite
+----
+
+----
+$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --parquet-configurator-implementation hadoop
+----
+
+If the +\--parquet-configurator-implementation+ option is not present then Sqoop will check the value of +parquetjob.configurator.implementation+
+property (which can be specified using -D in the Sqoop command or in the site.xml). If that value is also absent Sqoop will
+default to Kite Dataset API based implementation.
+
+The Kite Dataset API based implementation executes the import command on a different code
+path than the text import: it creates the Hive table based on the generated Avro schema by connecting to the Hive metastore.
+This can be a disadvantage since sometimes moving from the text file format to the Parquet file format can lead to many
+unexpected behavioral changes. Kite checks the Hive table schema before importing the data into it so if the user wants
+to import some data which has a schema incompatible with the Hive table's schema Sqoop will throw an error. This implementation
+uses snappy codec for compression by default and apart from this it supports the bzip codec too.
+
+The Parquet Hadoop API based implementation builds the Hive CREATE TABLE statement and executes the
+LOAD DATA INPATH command just like the text import does. Unlike Kite it also supports connecting to HiveServer2 (using the +\--hs2-url+ option)
+so it provides better security features. This implementation does not check the Hive table's schema before importing so
+it is possible that the user can successfully import data into Hive but they get an error during a Hive read operation later.
+It does not use any compression by default but supports snappy and bzip codecs.
+
+The below example demonstrates how to use Sqoop to import into Hive in Parquet format using HiveServer2 and snappy codec:
+
+----
+$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --as-parquetfile --compression-codec snappy \
+--parquet-configurator-implementation hadoop --hs2-url "jdbc:hive2://hs2.foo.com:10000" --hs2-keytab "/path/to/keytab"
+----
+
Enabling Logical Types in Avro and Parquet import for numbers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^