You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tajo.apache.org by hy...@apache.org on 2014/04/09 13:05:04 UTC

git commit: TAJO-736: Add table management documentation. (hyunsik)

Repository: tajo
Updated Branches:
  refs/heads/branch-0.8.0 8de707ef6 -> 50e8c2301


TAJO-736: Add table management documentation. (hyunsik)


Project: http://git-wip-us.apache.org/repos/asf/tajo/repo
Commit: http://git-wip-us.apache.org/repos/asf/tajo/commit/50e8c230
Tree: http://git-wip-us.apache.org/repos/asf/tajo/tree/50e8c230
Diff: http://git-wip-us.apache.org/repos/asf/tajo/diff/50e8c230

Branch: refs/heads/branch-0.8.0
Commit: 50e8c230194ccd2e9e3bb0065770823064dd243c
Parents: 8de707e
Author: Hyunsik Choi <hy...@apache.org>
Authored: Wed Apr 9 20:04:34 2014 +0900
Committer: Hyunsik Choi <hy...@apache.org>
Committed: Wed Apr 9 20:04:34 2014 +0900

----------------------------------------------------------------------
 CHANGES.txt                                     |   2 +
 .../sphinx/partitioning/column_partitioning.rst |  49 ++++++-
 .../partitioning/intro_to_partitioning.rst      |   5 +-
 .../src/main/sphinx/table_management/csv.rst    | 108 +++++++++++++-
 .../main/sphinx/table_management/parquet.rst    |  44 +++++-
 .../src/main/sphinx/table_management/rcfile.rst | 147 ++++++++++++++++++-
 6 files changed, 346 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/tajo/blob/50e8c230/CHANGES.txt
----------------------------------------------------------------------
diff --git a/CHANGES.txt b/CHANGES.txt
index 358e145..657d622 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -679,6 +679,8 @@ Release 0.8.0 - unreleased
 
   SUB TASKS
 
+    TAJO-736: Add table management documentation. (hyunsik)
+
     TAJO-602: WorkerResourceManager should be broke down into 3 parts.
     (hyunsik)
 

http://git-wip-us.apache.org/repos/asf/tajo/blob/50e8c230/tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst b/tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst
index e88d23f..4b8d6bf 100644
--- a/tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst
+++ b/tajo-docs/src/main/sphinx/partitioning/column_partitioning.rst
@@ -2,4 +2,51 @@
 Column Partitioning
 *********************************
 
-.. todo::
\ No newline at end of file
+The column table partition is designed to support the partition of Apache Hive™.
+
+================================================
+How to Create a Column Partitioned Table
+================================================
+
+You can create a partitioned table by using the ``PARTITION BY`` clause. For a column partitioned table, you should use
+the ``PARTITION BY COLUMN`` clause with partition keys.
+
+For example, assume there is a table ``orders`` composed of the following schema. ::
+
+  id          INT,
+  item_name   TEXT,
+  price       FLOAT
+
+Also, assume that you want to use ``order_date TEXT`` and ``ship_date TEXT`` as the partition keys.
+Then, you should create a table as follows:
+
+.. code-block:: sql
+
+  CREATE TABLE orders (
+    id INT,
+    item_name TEXT,
+    price
+  ) PARTITION BY COLUMN (order_date TEXT, ship_date TEXT);
+
+==================================================
+Partition Pruning on Column Partitioned Tables
+==================================================
+
+The following predicates in the ``WHERE`` clause can be used to prune unqualified column partitions without processing
+during query planning phase.
+
+* ``=``
+* ``<>``
+* ``>``
+* ``<``
+* ``>=``
+* ``<=``
+* LIKE predicates with a leading wild-card character
+* IN list predicates
+
+==================================================
+Compatibility Issues with Apache Hive™
+==================================================
+
+If partitioned tables of Hive are created as external tables in Tajo, Tajo can process the Hive partitioned tables directly.
+There haven't known compatibility issues yet.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tajo/blob/50e8c230/tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst b/tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
index bfb555f..ff3eb98 100644
--- a/tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
+++ b/tajo-docs/src/main/sphinx/partitioning/intro_to_partitioning.rst
@@ -2,9 +2,8 @@
 Introduction to Partitioning
 **************************************
 
-======================
-Partition Key
-======================
+Table partitioning provides two benefits: easy table management and data pruning by partition keys.
+Currently, Apache Tajo only provides Apache Hive-compatible column partitioning.
 
 =========================
 Partitioning Methods

http://git-wip-us.apache.org/repos/asf/tajo/blob/50e8c230/tajo-docs/src/main/sphinx/table_management/csv.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/table_management/csv.rst b/tajo-docs/src/main/sphinx/table_management/csv.rst
index c11a343..2d3ee9e 100644
--- a/tajo-docs/src/main/sphinx/table_management/csv.rst
+++ b/tajo-docs/src/main/sphinx/table_management/csv.rst
@@ -1,6 +1,110 @@
 *************************************
-CSV
+CSV (TextFile)
 *************************************
 
+A character-separated values (CSV) file represents a tabular data set consisting of rows and columns.
+Each row is a plan-text line. A line is usually broken by a character line feed ``\n`` or carriage-return ``\r``.
+The line feed ``\n`` is the default delimiter in Tajo. Each record consists of multiple fields, separated by
+some other character or string, most commonly a literal vertical bar ``|``, comma ``,`` or tab ``\t``.
+The vertical bar is used as the default field delimiter in Tajo.
 
-(TODO)
\ No newline at end of file
+=========================================
+How to Create a CSV Table ?
+=========================================
+
+If you are not familiar with the ``CREATE TABLE`` statement, please refer to the Data Definition Language :doc:`/sql_language/ddl`.
+
+In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE``
+statement. The below is an example statement for creating a table using CSV files.
+
+.. code-block:: sql
+
+ CREATE TABLE
+  table1 (
+    id int,
+    name text,
+    score float,
+    type text
+  ) USING CSV;
+
+=========================================
+Physical Properties
+=========================================
+
+Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters.
+The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters.
+
+Now, the CSV storage format provides the following physical properties.
+
+* ``csvfile.delimiter``: delimiter character. ``|`` or ``\u0001`` is usually used, and the default field delimiter is ``|``.
+* ``csvfile.null``: NULL character. The default NULL character is an empty string ``''``. Hive's default NULL character is ``'\\N'``.
+* ``compression.codec``: Compression codec. You can enable compression feature and set specified compression algorithm. The compression algorithm used to compress files. The compression codec name should be the fully qualified class name inherited from `org.apache.hadoop.io.compress.CompressionCodec <https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html>`_. By default, compression is disabled.
+* ``csvfile.serde``: custom (De)serializer class. ``org.apache.tajo.storage.TextSerializerDeserializer`` is the default (De)serializer class.
+
+The following example is to set a custom field delimiter, NULL character, and compression codec:
+
+.. code-block:: sql
+
+ CREATE TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+ ) USING CSV WITH('csvfile.delimiter'='\u0001',
+                  'csvfile.null'='\\N',
+                  'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec');
+
+.. warning::
+
+  Be careful when using ``\n`` as the field delimiter because CSV uses ``\n`` as the line delimiter.
+  At the moment, Tajo does not provide a way to specify the line delimiter.
+
+=========================================
+Custom (De)serializer
+=========================================
+
+The CSV storage format not only provides reading and writing interfaces for CSV data but also allows users to process custom
+plan-text file formats with user-defined (De)serializer classes.
+For example, with custom (de)serializers, Tajo can process JSON file formats or any specialized plan-text file formats.
+
+In order to specify a custom (De)serializer, set a physical property ``csvfile.serde``.
+The property value should be a fully qualified class name.
+
+For example:
+
+.. code-block:: sql
+
+ CREATE TABLE table1 (
+  id int,
+  name text,
+  score float,
+  type text
+ ) USING CSV WITH ('csvfile.serde'='org.my.storage.CustomSerializerDeserializer')
+
+
+=========================================
+Null Value Handling Issues
+=========================================
+In default, NULL character in CSV files is an empty string ``''``.
+In other words, an empty field is basically recognized as a NULL value in Tajo.
+If a field domain is ``TEXT``, an empty field is recognized as a string value ``''`` instead of NULL value.
+Besides, You can also use your own NULL character by specifying a physical property ``csvfile.null``.
+
+=========================================
+Compatibility Issues with Apache Hive™
+=========================================
+
+CSV files generated in Tajo can be processed directly by Apache Hive™ without further processing.
+In this section, we explain some compatibility issue for users who use both Hive and Tajo.
+
+If you set a custom field delimiter, the CSV tables cannot be directly used in Hive.
+In order to specify the custom field delimiter in Hive, you need to use ``ROW FORMAT DELIMITED FIELDS TERMINATED BY``
+clause in a Hive's ``CREATE TABLE`` statement as follows:
+
+.. code-block:: sql
+
+ CREATE TABLE table1 (id int, name string, score float, type string)
+ ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
+ STORED AS TEXTFILE
+
+To the best of our knowledge, there is not way to specify a custom NULL character in Hive.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tajo/blob/50e8c230/tajo-docs/src/main/sphinx/table_management/parquet.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/table_management/parquet.rst b/tajo-docs/src/main/sphinx/table_management/parquet.rst
index a994b7e..2707e3f 100644
--- a/tajo-docs/src/main/sphinx/table_management/parquet.rst
+++ b/tajo-docs/src/main/sphinx/table_management/parquet.rst
@@ -2,5 +2,47 @@
 Parquet
 *************************************
 
+Parquet is a columnar storage format for Hadoop. Parquet is designed to make the advantages of compressed,
+efficient columnar data representation available to any project in the Hadoop ecosystem,
+regardless of the choice of data processing framework, data model, or programming language.
+For more details, please refer to `Parquet File Format <http://parquet.io/>`_.
 
-(TODO)
\ No newline at end of file
+=========================================
+How to Create a Parquet Table?
+=========================================
+
+If you are not familiar with ``CREATE TABLE`` statement, please refer to Data Definition Language :doc:`/sql_language/ddl`.
+
+In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE``
+statement. Below is an example statement for creating a table using parquet files.
+
+.. code-block:: sql
+
+  CREATE TABLE table1 (
+    id int,
+    name text,
+    score float,
+    type text
+  ) USING PARQUET;
+
+=========================================
+Physical Properties
+=========================================
+
+Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters.
+The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters.
+
+Now, Parquet file provides the following physical properties.
+
+* ``parquet.block.size``: The block size is the size of a row group being buffered in memory. This limits the memory usage when writing. Larger values will improve the I/O when reading but consume more memory when writing. Default size is 134217728 bytes (= 128 * 1024 * 1024).
+* ``parquet.page.size``: The page size is for compression. When reading, each page can be decompressed independently. A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. Default size is 1048576 bytes (= 1 * 1024 * 1024).
+* ``parquet.compression``: The compression algorithm used to compress pages. It should be one of ``uncompressed``, ``snappy``, ``gzip``, ``lzo``. Default is ``uncompressed``.
+* ``parquet.enable.dictionary``: The boolean value is to enable/disable dictionary encoding. It should be one of either ``true`` or ``false``. Default is ``true``.
+
+=========================================
+Compatibility Issues with Apache Hive™
+=========================================
+
+At the moment, Tajo only supports flat relational tables.
+As a result, Tajo's Parquet storage type does not support nested schemas.
+However, we are currently working on adding support for nested schemas and non-scalar types (`TAJO-710 <https://issues.apache.org/jira/browse/TAJO-710>`_).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/tajo/blob/50e8c230/tajo-docs/src/main/sphinx/table_management/rcfile.rst
----------------------------------------------------------------------
diff --git a/tajo-docs/src/main/sphinx/table_management/rcfile.rst b/tajo-docs/src/main/sphinx/table_management/rcfile.rst
index 21f8253..903ac4d 100644
--- a/tajo-docs/src/main/sphinx/table_management/rcfile.rst
+++ b/tajo-docs/src/main/sphinx/table_management/rcfile.rst
@@ -1,6 +1,149 @@
 *************************************
-RCFIle
+RCFile
 *************************************
 
+RCFile, short of Record Columnar File, are flat files consisting of binary key/value pairs,
+which shares many similarities with SequenceFile.
 
-(TODO)
\ No newline at end of file
+=========================================
+How to Create a RCFile Table?
+=========================================
+
+If you are not familiar with the ``CREATE TABLE`` statement, please refer to the Data Definition Language :doc:`/sql_language/ddl`.
+
+In order to specify a certain file format for your table, you need to use the ``USING`` clause in your ``CREATE TABLE``
+statement. Below is an example statement for creating a table using RCFile.
+
+.. code-block:: sql
+
+  CREATE TABLE table1 (
+    id int,
+    name text,
+    score float,
+    type text
+  ) USING RCFILE;
+
+=========================================
+Physical Properties
+=========================================
+
+Some table storage formats provide parameters for enabling or disabling features and adjusting physical parameters.
+The ``WITH`` clause in the CREATE TABLE statement allows users to set those parameters.
+
+Now, the RCFile storage type provides the following physical properties.
+
+* ``rcfile.serde`` : custom (De)serializer class. ``org.apache.tajo.storage.BinarySerializerDeserializer`` is the default (de)serializer class.
+* ``rcfile.null`` : NULL character. It is only used when a table uses ``org.apache.tajo.storage.TextSerializerDeserializer``. The default NULL character is an empty string ``''``. Hive's default NULL character is ``'\\N'``.
+* ``compression.codec`` : Compression codec. You can enable compression feature and set specified compression algorithm. The compression algorithm used to compress files. The compression codec name should be the fully qualified class name inherited from `org.apache.hadoop.io.compress.CompressionCodec <https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html>`_. By default, compression is disabled.
+
+The following is an example for creating a table using RCFile that uses compression.
+
+.. code-block:: sql
+
+  CREATE TABLE table1 (
+    id int,
+    name text,
+    score float,
+    type text
+  ) USING RCFILE WITH ('compression.codec'='org.apache.hadoop.io.compress.SnappyCodec');
+
+=========================================
+RCFile (De)serializers
+=========================================
+
+Tajo provides two built-in (De)serializer for RCFile:
+
+* ``org.apache.tajo.storage.TextSerializerDeserializer``: stores column values in a plain-text form.
+* ``org.apache.tajo.storage.BinarySerializerDeserializer``: stores column values in a binary file format.
+
+The RCFile format can store some metadata in the RCFile header. Tajo writes the (de)serializer class name into
+the metadata header of each RCFile when the RCFile is created in Tajo.
+
+.. note::
+
+  ``org.apache.tajo.storage.BinarySerializerDeserializer`` is the default (de) serializer for RCFile.
+
+
+=========================================
+Compatibility Issues with Apache Hive™
+=========================================
+
+Regardless of whether the RCFiles are written by Apache Hive™ or Apache Tajo™, the files are compatible in both systems.
+In other words, Tajo can process RCFiles written by Apache Hive and vice versa.
+
+Since there are no metadata in RCFiles written by Hive, we need to manually specify the (de)serializer class name
+by setting a physical property.
+
+In Hive, there are two SerDe, and they correspond to the following (de)serializer in Tajo.
+
+* ``org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe``: corresponds to ``TextSerializerDeserializer`` in Tajo.
+* ``org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe``: corresponds to ``BinarySerializerDeserializer`` in Tajo.
+
+The compatibility issue mostly occurs when a user creates an external table pointing to data of an existing table.
+The following section explains two cases: 1) the case where Tajo reads RCFile written by Hive, and
+2) the case where Hive reads RCFile written by Tajo.
+
+-----------------------------------------
+When Tajo reads RCFile generated in Hive
+-----------------------------------------
+
+To create an external RCFile table generated with ``ColumnarSerDe`` in Hive,
+you should set the physical property ``rcfile.serde`` in Tajo as follows:
+
+.. code-block:: sql
+
+  CREATE EXTERNAL TABLE table1 (
+    id int,
+    name text,
+    score float,
+    type text
+  ) USING RCFILE with ( 'rcfile.serde'='org.apache.tajo.storage.TextSerializerDeserializer', 'rcfile.null'='\\N' )
+  LOCATION '....';
+
+To create an external RCFile table generated with ``LazyBinaryColumnarSerDe`` in Hive,
+you should set the physical property ``rcfile.serde`` in Tajo as follows:
+
+.. code-block:: sql
+
+  CREATE EXTERNAL TABLE table1 (
+    id int,
+    name text,
+    score float,
+    type text
+  ) USING RCFILE WITH ('rcfile.serde' = 'org.apache.tajo.storage.BinarySerializerDeserializer')
+  LOCATION '....';
+
+.. note::
+
+  As we mentioned above, ``BinarySerializerDeserializer`` is the default (de) serializer for RCFile.
+  So, you can omit the ``rcfile.serde`` only for ``org.apache.tajo.storage.BinarySerializerDeserializer``.
+
+-----------------------------------------
+When Hive reads RCFile generated in Tajo
+-----------------------------------------
+
+To create an external RCFile table written by Tajo with ``TextSerializerDeserializer``,
+you should set the ``SERDE`` as follows:
+
+.. code-block:: sql
+
+  CREATE TABLE table1 (
+    id int,
+    name string,
+    score float,
+    type string
+  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE
+  LOCATION '<hdfs_location>';
+
+To create an external RCFile table written by Tajo with ``BinarySerializerDeserializer``,
+you should set the ``SERDE`` as follows:
+
+.. code-block:: sql
+
+  CREATE TABLE table1 (
+    id int,
+    name string,
+    score float,
+    type string
+  ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' STORED AS RCFILE
+  LOCATION '<hdfs_location>';
\ No newline at end of file