You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by lz...@apache.org on 2020/06/12 11:11:28 UTC
[flink] branch master updated: [FLINK-18141][doc][parquet] Add
documentation for Parquet format
This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git
The following commit(s) were added to refs/heads/master by this push:
new cb2dc73 [FLINK-18141][doc][parquet] Add documentation for Parquet format
cb2dc73 is described below
commit cb2dc732633caf4c025c2ffc47a97c8ed6479177
Author: Jingsong Lee <ji...@gmail.com>
AuthorDate: Fri Jun 12 19:11:05 2020 +0800
[FLINK-18141][doc][parquet] Add documentation for Parquet format
This closes #12597
---
docs/dev/table/connectors/formats/index.md | 2 +-
docs/dev/table/connectors/formats/index.zh.md | 2 +-
docs/dev/table/connectors/formats/parquet.md | 189 ++++++++++++++++++++++++
docs/dev/table/connectors/formats/parquet.zh.md | 189 ++++++++++++++++++++++++
4 files changed, 380 insertions(+), 2 deletions(-)
diff --git a/docs/dev/table/connectors/formats/index.md b/docs/dev/table/connectors/formats/index.md
index 6ab144d..dcc48f9 100644
--- a/docs/dev/table/connectors/formats/index.md
+++ b/docs/dev/table/connectors/formats/index.md
@@ -61,7 +61,7 @@ Flink supports the following formats:
<td><a href="{% link dev/table/connectors/kafka.md %}">Apache Kafka</a></td>
</tr>
<tr>
- <td>Apache Parquet</td>
+ <td><a href="{% link dev/table/connectors/formats/parquet.md %}">Apache Parquet</a></td>
<td><a href="{% link dev/table/connectors/filesystem.md %}">Filesystem</a></td>
</tr>
<tr>
diff --git a/docs/dev/table/connectors/formats/index.zh.md b/docs/dev/table/connectors/formats/index.zh.md
index 6aef539..0f43f29 100644
--- a/docs/dev/table/connectors/formats/index.zh.md
+++ b/docs/dev/table/connectors/formats/index.zh.md
@@ -61,7 +61,7 @@ Flink supports the following formats:
<td><a href="{% link dev/table/connectors/kafka.zh.md %}">Apache Kafka</a></td>
</tr>
<tr>
- <td>Apache Parquet</td>
+ <td><a href="{% link dev/table/connectors/formats/parquet.zh.md %}">Apache Parquet</a></td>
<td><a href="{% link dev/table/connectors/filesystem.zh.md %}">Filesystem</a></td>
</tr>
<tr>
diff --git a/docs/dev/table/connectors/formats/parquet.md b/docs/dev/table/connectors/formats/parquet.md
new file mode 100644
index 0000000..a5ec01d
--- /dev/null
+++ b/docs/dev/table/connectors/formats/parquet.md
@@ -0,0 +1,189 @@
+---
+title: "Parquet Format"
+nav-title: Parquet
+nav-parent_id: sql-formats
+nav-pos: 5
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<span class="label label-info">Format: Serialization Schema</span>
+<span class="label label-info">Format: Deserialization Schema</span>
+
+* This will be replaced by the TOC
+{:toc}
+
+The [Apache Parquet](https://parquet.apache.org/) format allows to read and write Parquet data.
+
+Dependencies
+------------
+
+In order to setup the Parquet format, the following table provides dependency information for both
+projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
+
+| Maven dependency | SQL Client JAR |
+| :----------------- | :----------------------|
+| `flink-parquet` |{% if site.is_stable %} [Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-parquet{{site.scala_version_suffix}}-{{site.version}}-jar-with-dependencies.jar) {% else %} Only available for stable releases {% endif %}|
+
+How to create a table with Parquet format
+----------------
+
+Here is an example to create a table using Filesystem connector and Parquet format.
+
+<div class="codetabs" markdown="1">
+<div data-lang="SQL" markdown="1">
+{% highlight sql %}
+CREATE TABLE user_behavior (
+ user_id BIGINT,
+ item_id BIGINT,
+ category_id BIGINT,
+ behavior STRING,
+ ts TIMESTAMP(3),
+ dt STRING
+) PARTITIONED BY (dt) WITH (
+ 'connector' = 'filesystem',
+ 'path' = '/tmp/user_behavior',
+ 'format' = 'parquet'
+)
+{% endhighlight %}
+</div>
+</div>
+
+Format Options
+----------------
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 25%">Option</th>
+ <th class="text-center" style="width: 8%">Required</th>
+ <th class="text-center" style="width: 7%">Default</th>
+ <th class="text-center" style="width: 10%">Type</th>
+ <th class="text-center" style="width: 50%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><h5>format</h5></td>
+ <td>required</td>
+ <td style="word-wrap: break-word;">(none)</td>
+ <td>String</td>
+ <td>Specify what format to use, here should be 'parquet'.</td>
+ </tr>
+ <tr>
+ <td><h5>parquet.utc-timezone</h5></td>
+ <td>optional</td>
+ <td style="word-wrap: break-word;">false</td>
+ <td>Boolean</td>
+ <td>Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.</td>
+ </tr>
+ </tbody>
+</table>
+
+Parquet format also supports configuration from [ParquetOutputFormat](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/org/apache/parquet/hadoop/ParquetOutputFormat.html).
+For example, you can configure `parquet.compression=GZIP` to enable gzip compression.
+
+Data Type Mapping
+----------------
+
+Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
+
+- Timestamp: mapping timestamp type to int96 whatever the precision is.
+- Decimal: mapping decimal type to fixed length byte array according to the precision.
+
+The following table lists the type mapping from Flink type to Parquet type.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left">Flink Data Type</th>
+ <th class="text-center">Parquet type</th>
+ <th class="text-center">Parquet logical type</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>CHAR / VARCHAR / STRING</td>
+ <td>BINARY</td>
+ <td>UTF8</td>
+ </tr>
+ <tr>
+ <td>BOOLEAN</td>
+ <td>BOOLEAN</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>BINARY / VARBINARY</td>
+ <td>BINARY</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>DECIMAL</td>
+ <td>FIXED_LEN_BYTE_ARRAY</td>
+ <td>DECIMAL</td>
+ </tr>
+ <tr>
+ <td>TINYINT</td>
+ <td>INT32</td>
+ <td>INT_8</td>
+ </tr>
+ <tr>
+ <td>SMALLINT</td>
+ <td>INT32</td>
+ <td>INT_16</td>
+ </tr>
+ <tr>
+ <td>INT</td>
+ <td>INT32</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>BIGINT</td>
+ <td>INT64</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>FLOAT</td>
+ <td>FLOAT</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>DOUBLE</td>
+ <td>DOUBLE</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>DATE</td>
+ <td>INT32</td>
+ <td>DATE</td>
+ </tr>
+ <tr>
+ <td>TIME</td>
+ <td>INT32</td>
+ <td>TIME_MILLIS</td>
+ </tr>
+ <tr>
+ <td>TIMESTAMP</td>
+ <td>INT96</td>
+ <td></td>
+ </tr>
+ </tbody>
+</table>
+
+<span class="label label-danger">Attention</span> Composite data type: Array, Map and Row are not supported.
diff --git a/docs/dev/table/connectors/formats/parquet.zh.md b/docs/dev/table/connectors/formats/parquet.zh.md
new file mode 100644
index 0000000..a5ec01d
--- /dev/null
+++ b/docs/dev/table/connectors/formats/parquet.zh.md
@@ -0,0 +1,189 @@
+---
+title: "Parquet Format"
+nav-title: Parquet
+nav-parent_id: sql-formats
+nav-pos: 5
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<span class="label label-info">Format: Serialization Schema</span>
+<span class="label label-info">Format: Deserialization Schema</span>
+
+* This will be replaced by the TOC
+{:toc}
+
+The [Apache Parquet](https://parquet.apache.org/) format allows to read and write Parquet data.
+
+Dependencies
+------------
+
+In order to setup the Parquet format, the following table provides dependency information for both
+projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
+
+| Maven dependency | SQL Client JAR |
+| :----------------- | :----------------------|
+| `flink-parquet` |{% if site.is_stable %} [Download](https://repo.maven.apache.org/maven2/org/apache/flink/flink-parquet{{site.scala_version_suffix}}/{{site.version}}/flink-parquet{{site.scala_version_suffix}}-{{site.version}}-jar-with-dependencies.jar) {% else %} Only available for stable releases {% endif %}|
+
+How to create a table with Parquet format
+----------------
+
+Here is an example to create a table using Filesystem connector and Parquet format.
+
+<div class="codetabs" markdown="1">
+<div data-lang="SQL" markdown="1">
+{% highlight sql %}
+CREATE TABLE user_behavior (
+ user_id BIGINT,
+ item_id BIGINT,
+ category_id BIGINT,
+ behavior STRING,
+ ts TIMESTAMP(3),
+ dt STRING
+) PARTITIONED BY (dt) WITH (
+ 'connector' = 'filesystem',
+ 'path' = '/tmp/user_behavior',
+ 'format' = 'parquet'
+)
+{% endhighlight %}
+</div>
+</div>
+
+Format Options
+----------------
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 25%">Option</th>
+ <th class="text-center" style="width: 8%">Required</th>
+ <th class="text-center" style="width: 7%">Default</th>
+ <th class="text-center" style="width: 10%">Type</th>
+ <th class="text-center" style="width: 50%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><h5>format</h5></td>
+ <td>required</td>
+ <td style="word-wrap: break-word;">(none)</td>
+ <td>String</td>
+ <td>Specify what format to use, here should be 'parquet'.</td>
+ </tr>
+ <tr>
+ <td><h5>parquet.utc-timezone</h5></td>
+ <td>optional</td>
+ <td style="word-wrap: break-word;">false</td>
+ <td>Boolean</td>
+ <td>Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.</td>
+ </tr>
+ </tbody>
+</table>
+
+Parquet format also supports configuration from [ParquetOutputFormat](https://www.javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.0/org/apache/parquet/hadoop/ParquetOutputFormat.html).
+For example, you can configure `parquet.compression=GZIP` to enable gzip compression.
+
+Data Type Mapping
+----------------
+
+Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
+
+- Timestamp: mapping timestamp type to int96 whatever the precision is.
+- Decimal: mapping decimal type to fixed length byte array according to the precision.
+
+The following table lists the type mapping from Flink type to Parquet type.
+
+<table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left">Flink Data Type</th>
+ <th class="text-center">Parquet type</th>
+ <th class="text-center">Parquet logical type</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td>CHAR / VARCHAR / STRING</td>
+ <td>BINARY</td>
+ <td>UTF8</td>
+ </tr>
+ <tr>
+ <td>BOOLEAN</td>
+ <td>BOOLEAN</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>BINARY / VARBINARY</td>
+ <td>BINARY</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>DECIMAL</td>
+ <td>FIXED_LEN_BYTE_ARRAY</td>
+ <td>DECIMAL</td>
+ </tr>
+ <tr>
+ <td>TINYINT</td>
+ <td>INT32</td>
+ <td>INT_8</td>
+ </tr>
+ <tr>
+ <td>SMALLINT</td>
+ <td>INT32</td>
+ <td>INT_16</td>
+ </tr>
+ <tr>
+ <td>INT</td>
+ <td>INT32</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>BIGINT</td>
+ <td>INT64</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>FLOAT</td>
+ <td>FLOAT</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>DOUBLE</td>
+ <td>DOUBLE</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>DATE</td>
+ <td>INT32</td>
+ <td>DATE</td>
+ </tr>
+ <tr>
+ <td>TIME</td>
+ <td>INT32</td>
+ <td>TIME_MILLIS</td>
+ </tr>
+ <tr>
+ <td>TIMESTAMP</td>
+ <td>INT96</td>
+ <td></td>
+ </tr>
+ </tbody>
+</table>
+
+<span class="label label-danger">Attention</span> Composite data type: Array, Map and Row are not supported.