You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by lz...@apache.org on 2020/06/15 02:08:49 UTC

[flink] branch master updated: [FLINK-18140][doc][orc] Add documentation for ORC format

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git


The following commit(s) were added to refs/heads/master by this push:
     new 79e8882  [FLINK-18140][doc][orc] Add documentation for ORC format
79e8882 is described below

commit 79e88820365f6a1fdfc9cbde7f6b80a67036432b
Author: Jingsong Lee <ji...@gmail.com>
AuthorDate: Mon Jun 15 10:07:14 2020 +0800

    [FLINK-18140][doc][orc] Add documentation for ORC format
    
    This closes #12602
---
 docs/dev/table/connectors/formats/index.md         |  2 +-
 docs/dev/table/connectors/formats/index.zh.md      |  2 +-
 .../connectors/formats/{parquet.md => orc.md}      | 98 ++++++++++------------
 .../connectors/formats/{parquet.md => orc.zh.md}   | 98 ++++++++++------------
 docs/dev/table/connectors/formats/parquet.md       |  2 +-
 docs/dev/table/connectors/formats/parquet.zh.md    |  2 +-
 6 files changed, 96 insertions(+), 108 deletions(-)

diff --git a/docs/dev/table/connectors/formats/index.md b/docs/dev/table/connectors/formats/index.md
index dcc48f9..e349465 100644
--- a/docs/dev/table/connectors/formats/index.md
+++ b/docs/dev/table/connectors/formats/index.md
@@ -65,7 +65,7 @@ Flink supports the following formats:
          <td><a href="{% link dev/table/connectors/filesystem.md %}">Filesystem</a></td>
         </tr>
         <tr>
-         <td>Apache ORC</td>
+         <td><a href="{% link dev/table/connectors/formats/orc.md %}">Apache ORC</a></td>
          <td><a href="{% link dev/table/connectors/filesystem.md %}">Filesystem</a></td>
         </tr>
     </tbody>
diff --git a/docs/dev/table/connectors/formats/index.zh.md b/docs/dev/table/connectors/formats/index.zh.md
index 0f43f29..92bc738 100644
--- a/docs/dev/table/connectors/formats/index.zh.md
+++ b/docs/dev/table/connectors/formats/index.zh.md
@@ -65,7 +65,7 @@ Flink supports the following formats:
          <td><a href="{% link dev/table/connectors/filesystem.zh.md %}">Filesystem</a></td>
         </tr>
         <tr>
-         <td>Apache ORC</td>
+         <td><a href="{% link dev/table/connectors/formats/orc.zh.md %}">Apache ORC</a></td>
          <td><a href="{% link dev/table/connectors/filesystem.zh.md %}">Filesystem</a></td>
         </tr>
     </tbody>
diff --git a/docs/dev/table/connectors/formats/parquet.md b/docs/dev/table/connectors/formats/orc.md
similarity index 58%
copy from docs/dev/table/connectors/formats/parquet.md
copy to docs/dev/table/connectors/formats/orc.md
index a5ec01d..4c878a4 100644
--- a/docs/dev/table/connectors/formats/parquet.md
+++ b/docs/dev/table/connectors/formats/orc.md
@@ -1,8 +1,8 @@
 ---
-title: "Parquet Format"
-nav-title: Parquet
+title: "Orc Format"
+nav-title: Orc
 nav-parent_id: sql-formats
-nav-pos: 5
+nav-pos: 6
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
@@ -29,22 +29,22 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-The [Apache Parquet](https://parquet.apache.org/) format allows to read and write Parquet data.
+The [Apache Orc](https://orc.apache.org/) format allows to read and write Orc data.
 
 Dependencies
 ------------
 
-In order to setup the Parquet format, the following table provides dependency information for both
+In order to setup the Orc format, the following table provides dependency information for both
 projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
 
 | Maven dependency   | SQL Client JAR         |
 | :----------------- | :----------------------|
-| `flink-parquet`    |{% if site.is_stable %} [Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-parquet{{site.scala_version_suffix}}-{{site.version}}-jar-with-dependencies.jar) {% else %} Only available for stable releases {% endif %}|
+| flink-orc{{site.scala_version_suffix}}        |{% if site.is_stable %}[Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-orc{{site.scala_version_suffix}}/{{site.version}}/flink-sql-orc{{site.scala_version_suffix}}-{{site.version}}.jar) {% else %} Only available for stable releases {% endif %}|
 
-How to create a table with Parquet format
+How to create a table with Orc format
 ----------------
 
-Here is an example to create a table using Filesystem connector and Parquet format.
+Here is an example to create a table using Filesystem connector and Orc format.
 
 <div class="codetabs" markdown="1">
 <div data-lang="SQL" markdown="1">
@@ -59,7 +59,7 @@ CREATE TABLE user_behavior (
 ) PARTITIONED BY (dt) WITH (
  'connector' = 'filesystem',
  'path' = '/tmp/user_behavior',
- 'format' = 'parquet'
+ 'format' = 'orc'
 )
 {% endhighlight %}
 </div>
@@ -84,104 +84,98 @@ Format Options
       <td>required</td>
       <td style="word-wrap: break-word;">(none)</td>
       <td>String</td>
-      <td>Specify what format to use, here should be 'parquet'.</td>
-    </tr>
-    <tr>
-      <td><h5>parquet.utc-timezone</h5></td>
-      <td>optional</td>
-      <td style="word-wrap: break-word;">false</td>
-      <td>Boolean</td>
-      <td>Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.</td>
+      <td>Specify what format to use, here should be 'orc'.</td>
     </tr>
     </tbody>
 </table>
 
-Parquet format also supports configuration from [ParquetOutputFormat](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/org/apache/parquet/hadoop/ParquetOutputFormat.html).
-For example, you can configure `parquet.compression=GZIP` to enable gzip compression.
+Orc format also supports table properties from [Table properties](https://orc.apache.org/docs/hive-config.html#table-properties).
+For example, you can configure `orc.compress=SNAPPY` to enable snappy compression.
 
 Data Type Mapping
 ----------------
 
-Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
-
-- Timestamp: mapping timestamp type to int96 whatever the precision is.
-- Decimal: mapping decimal type to fixed length byte array according to the precision.
-
-The following table lists the type mapping from Flink type to Parquet type.
+Orc format type mapping is compatible with Apache Hive.
+The following table lists the type mapping from Flink type to Orc type.
 
 <table class="table table-bordered">
     <thead>
       <tr>
         <th class="text-left">Flink Data Type</th>
-        <th class="text-center">Parquet type</th>
-        <th class="text-center">Parquet logical type</th>
+        <th class="text-center">Orc physical type</th>
+        <th class="text-center">Orc logical type</th>
       </tr>
     </thead>
     <tbody>
     <tr>
-      <td>CHAR / VARCHAR / STRING</td>
-      <td>BINARY</td>
-      <td>UTF8</td>
+      <td>CHAR</td>
+      <td>bytes</td>
+      <td>CHAR</td>
+    </tr>
+    <tr>
+      <td>VARCHAR</td>
+      <td>bytes</td>
+      <td>VARCHAR</td>
+    </tr>
+    <tr>
+      <td>STRING</td>
+      <td>bytes</td>
+      <td>STRING</td>
     </tr>
     <tr>
       <td>BOOLEAN</td>
+      <td>long</td>
       <td>BOOLEAN</td>
-      <td></td>
     </tr>
     <tr>
-      <td>BINARY / VARBINARY</td>
+      <td>BYTES</td>
+      <td>bytes</td>
       <td>BINARY</td>
-      <td></td>
     </tr>
     <tr>
       <td>DECIMAL</td>
-      <td>FIXED_LEN_BYTE_ARRAY</td>
+      <td>decimal</td>
       <td>DECIMAL</td>
     </tr>
     <tr>
       <td>TINYINT</td>
-      <td>INT32</td>
-      <td>INT_8</td>
+      <td>long</td>
+      <td>BYTE</td>
     </tr>
     <tr>
       <td>SMALLINT</td>
-      <td>INT32</td>
-      <td>INT_16</td>
+      <td>long</td>
+      <td>SHORT</td>
     </tr>
     <tr>
       <td>INT</td>
-      <td>INT32</td>
-      <td></td>
+      <td>long</td>
+      <td>INT</td>
     </tr>
     <tr>
       <td>BIGINT</td>
-      <td>INT64</td>
-      <td></td>
+      <td>long</td>
+      <td>LONG</td>
     </tr>
     <tr>
       <td>FLOAT</td>
+      <td>double</td>
       <td>FLOAT</td>
-      <td></td>
     </tr>
     <tr>
       <td>DOUBLE</td>
+      <td>double</td>
       <td>DOUBLE</td>
-      <td></td>
     </tr>
     <tr>
       <td>DATE</td>
-      <td>INT32</td>
+      <td>long</td>
       <td>DATE</td>
     </tr>
     <tr>
-      <td>TIME</td>
-      <td>INT32</td>
-      <td>TIME_MILLIS</td>
-    </tr>
-    <tr>
       <td>TIMESTAMP</td>
-      <td>INT96</td>
-      <td></td>
+      <td>timestamp</td>
+      <td>TIMESTAMP</td>
     </tr>
     </tbody>
 </table>
diff --git a/docs/dev/table/connectors/formats/parquet.md b/docs/dev/table/connectors/formats/orc.zh.md
similarity index 58%
copy from docs/dev/table/connectors/formats/parquet.md
copy to docs/dev/table/connectors/formats/orc.zh.md
index a5ec01d..4c878a4 100644
--- a/docs/dev/table/connectors/formats/parquet.md
+++ b/docs/dev/table/connectors/formats/orc.zh.md
@@ -1,8 +1,8 @@
 ---
-title: "Parquet Format"
-nav-title: Parquet
+title: "Orc Format"
+nav-title: Orc
 nav-parent_id: sql-formats
-nav-pos: 5
+nav-pos: 6
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
@@ -29,22 +29,22 @@ under the License.
 * This will be replaced by the TOC
 {:toc}
 
-The [Apache Parquet](https://parquet.apache.org/) format allows to read and write Parquet data.
+The [Apache Orc](https://orc.apache.org/) format allows to read and write Orc data.
 
 Dependencies
 ------------
 
-In order to setup the Parquet format, the following table provides dependency information for both
+In order to setup the Orc format, the following table provides dependency information for both
 projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
 
 | Maven dependency   | SQL Client JAR         |
 | :----------------- | :----------------------|
-| `flink-parquet`    |{% if site.is_stable %} [Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-parquet{{site.scala_version_suffix}}-{{site.version}}-jar-with-dependencies.jar) {% else %} Only available for stable releases {% endif %}|
+| flink-orc{{site.scala_version_suffix}}        |{% if site.is_stable %}[Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-orc{{site.scala_version_suffix}}/{{site.version}}/flink-sql-orc{{site.scala_version_suffix}}-{{site.version}}.jar) {% else %} Only available for stable releases {% endif %}|
 
-How to create a table with Parquet format
+How to create a table with Orc format
 ----------------
 
-Here is an example to create a table using Filesystem connector and Parquet format.
+Here is an example to create a table using Filesystem connector and Orc format.
 
 <div class="codetabs" markdown="1">
 <div data-lang="SQL" markdown="1">
@@ -59,7 +59,7 @@ CREATE TABLE user_behavior (
 ) PARTITIONED BY (dt) WITH (
  'connector' = 'filesystem',
  'path' = '/tmp/user_behavior',
- 'format' = 'parquet'
+ 'format' = 'orc'
 )
 {% endhighlight %}
 </div>
@@ -84,104 +84,98 @@ Format Options
       <td>required</td>
       <td style="word-wrap: break-word;">(none)</td>
       <td>String</td>
-      <td>Specify what format to use, here should be 'parquet'.</td>
-    </tr>
-    <tr>
-      <td><h5>parquet.utc-timezone</h5></td>
-      <td>optional</td>
-      <td style="word-wrap: break-word;">false</td>
-      <td>Boolean</td>
-      <td>Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.</td>
+      <td>Specify what format to use, here should be 'orc'.</td>
     </tr>
     </tbody>
 </table>
 
-Parquet format also supports configuration from [ParquetOutputFormat](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/org/apache/parquet/hadoop/ParquetOutputFormat.html).
-For example, you can configure `parquet.compression=GZIP` to enable gzip compression.
+Orc format also supports table properties from [Table properties](https://orc.apache.org/docs/hive-config.html#table-properties).
+For example, you can configure `orc.compress=SNAPPY` to enable snappy compression.
 
 Data Type Mapping
 ----------------
 
-Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
-
-- Timestamp: mapping timestamp type to int96 whatever the precision is.
-- Decimal: mapping decimal type to fixed length byte array according to the precision.
-
-The following table lists the type mapping from Flink type to Parquet type.
+Orc format type mapping is compatible with Apache Hive.
+The following table lists the type mapping from Flink type to Orc type.
 
 <table class="table table-bordered">
     <thead>
       <tr>
         <th class="text-left">Flink Data Type</th>
-        <th class="text-center">Parquet type</th>
-        <th class="text-center">Parquet logical type</th>
+        <th class="text-center">Orc physical type</th>
+        <th class="text-center">Orc logical type</th>
       </tr>
     </thead>
     <tbody>
     <tr>
-      <td>CHAR / VARCHAR / STRING</td>
-      <td>BINARY</td>
-      <td>UTF8</td>
+      <td>CHAR</td>
+      <td>bytes</td>
+      <td>CHAR</td>
+    </tr>
+    <tr>
+      <td>VARCHAR</td>
+      <td>bytes</td>
+      <td>VARCHAR</td>
+    </tr>
+    <tr>
+      <td>STRING</td>
+      <td>bytes</td>
+      <td>STRING</td>
     </tr>
     <tr>
       <td>BOOLEAN</td>
+      <td>long</td>
       <td>BOOLEAN</td>
-      <td></td>
     </tr>
     <tr>
-      <td>BINARY / VARBINARY</td>
+      <td>BYTES</td>
+      <td>bytes</td>
       <td>BINARY</td>
-      <td></td>
     </tr>
     <tr>
       <td>DECIMAL</td>
-      <td>FIXED_LEN_BYTE_ARRAY</td>
+      <td>decimal</td>
       <td>DECIMAL</td>
     </tr>
     <tr>
       <td>TINYINT</td>
-      <td>INT32</td>
-      <td>INT_8</td>
+      <td>long</td>
+      <td>BYTE</td>
     </tr>
     <tr>
       <td>SMALLINT</td>
-      <td>INT32</td>
-      <td>INT_16</td>
+      <td>long</td>
+      <td>SHORT</td>
     </tr>
     <tr>
       <td>INT</td>
-      <td>INT32</td>
-      <td></td>
+      <td>long</td>
+      <td>INT</td>
     </tr>
     <tr>
       <td>BIGINT</td>
-      <td>INT64</td>
-      <td></td>
+      <td>long</td>
+      <td>LONG</td>
     </tr>
     <tr>
       <td>FLOAT</td>
+      <td>double</td>
       <td>FLOAT</td>
-      <td></td>
     </tr>
     <tr>
       <td>DOUBLE</td>
+      <td>double</td>
       <td>DOUBLE</td>
-      <td></td>
     </tr>
     <tr>
       <td>DATE</td>
-      <td>INT32</td>
+      <td>long</td>
       <td>DATE</td>
     </tr>
     <tr>
-      <td>TIME</td>
-      <td>INT32</td>
-      <td>TIME_MILLIS</td>
-    </tr>
-    <tr>
       <td>TIMESTAMP</td>
-      <td>INT96</td>
-      <td></td>
+      <td>timestamp</td>
+      <td>TIMESTAMP</td>
     </tr>
     </tbody>
 </table>
diff --git a/docs/dev/table/connectors/formats/parquet.md b/docs/dev/table/connectors/formats/parquet.md
index a5ec01d..c6f94cd 100644
--- a/docs/dev/table/connectors/formats/parquet.md
+++ b/docs/dev/table/connectors/formats/parquet.md
@@ -39,7 +39,7 @@ projects using a build automation tool (such as Maven or SBT) and SQL Client wit
 
 | Maven dependency   | SQL Client JAR         |
 | :----------------- | :----------------------|
-| `flink-parquet`    |{% if site.is_stable %} [Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-parquet{{site.scala_version_suffix}}-{{site.version}}-jar-with-dependencies.jar) {% else %} Only available for stable releases {% endif %}|
+| flink-parquet{{site.scala_version_suffix}}  |{% if site.is_stable %}[Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-sql-parquet{{site.scala_version_suffix}}-{{site.version}}.jar) {% else %} Only available for stable releases {% endif %}|
 
 How to create a table with Parquet format
 ----------------
diff --git a/docs/dev/table/connectors/formats/parquet.zh.md b/docs/dev/table/connectors/formats/parquet.zh.md
index a5ec01d..e6a4876 100644
--- a/docs/dev/table/connectors/formats/parquet.zh.md
+++ b/docs/dev/table/connectors/formats/parquet.zh.md
@@ -39,7 +39,7 @@ projects using a build automation tool (such as Maven or SBT) and SQL Client wit
 
 | Maven dependency   | SQL Client JAR         |
 | :----------------- | :----------------------|
-| `flink-parquet`    |{% if site.is_stable %} [Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-parquet{{site.scala_version_suffix}}-{{site.version}}-jar-with-dependencies.jar) {% else %} Only available for stable releases {% endif %}|
+| flink-parquet{{site.scala_version_suffix}}    |{% if site.is_stable %}[Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-sql-parquet{{site.scala_version_suffix}}-{{site.version}}.jar) {% else %} Only available for stable releases {% endif %}|
 
 How to create a table with Parquet format
 ----------------