You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/12/21 08:37:36 UTC

[GitHub] [flink-table-store] JingsongLi commented on a diff in pull request #443: [FLINK-30458] Refactor Table Store Documentation

JingsongLi commented on code in PR #443:
URL: https://github.com/apache/flink-table-store/pull/443#discussion_r1054066928


##########
docs/content/docs/concepts/file-layouts.md:
##########
@@ -0,0 +1,53 @@
+---
+title: "File Layouts"
+weight: 3
+type: docs
+aliases:
+- /concepts/file-layouts.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# File Layouts
+
+All files of a table are stored under one base directory. Table Store files are organized in a layered style. The following image illustrates the file layout. Starting from a snapshot file, Table Store readers can recursively access all records from the table.
+
+{{< img src="/img/file-layout.png">}}
+
+## Snapshot Files
+
+All snapshot files are stored in the `snapshot` directory.
+
+A snapshot file is a JSON file containing information about this snapshot, including
+
+* the schema file in use
+* the manifest list containing all changes prior to this snapshot
+* the manifest list containing all changes of this snapshot

Review Comment:
   It is hard to understand these two cases. Can we just provide only one?
   `the manifest list containing all changes of this snapshot`



##########
docs/content/docs/sql-api/querying-tables.md:
##########
@@ -24,99 +24,98 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Query Table
-
-You can directly SELECT the table in batch runtime mode of Flink SQL.
-
-```sql
--- Batch mode, read latest snapshot
-SET 'execution.runtime-mode' = 'batch';
-SELECT * FROM MyTable;
-```
-
-## Query Engines
-
-Table Store not only supports Flink SQL queries natively but also provides
-queries from other popular engines. See [Engines]({{< ref "docs/engines/overview" >}})
-
-## Query Optimization
-
-It is highly recommended to specify partition and primary key filters
-along with the query, which will speed up the data skipping of the query.
-
-The filter functions that can accelerate data skipping are:
-- `=`
-- `<`
-- `<=`
-- `>`
-- `>=`
-- `IN (...)`
-- `LIKE 'abc%'`
-- `IS NULL`
-
-Table Store will sort the data by primary key, which speeds up the point queries
-and range queries. When using a composite primary key, it is best for the query
-filters to form a [leftmost prefix](https://dev.mysql.com/doc/refman/5.7/en/multiple-column-indexes.html)
-of the primary key for good acceleration.
-
-Suppose that a table has the following specification:
-
-```sql
-CREATE TABLE orders (
-    catalog_id BIGINT,
-    order_id BIGINT,
-    .....,
-    PRIMARY KEY (catalog_id, order_id) NOT ENFORCED -- composite primary key
-)
-```
-
-The query obtains a good acceleration by specifying a range filter for
-the leftmost prefix of the primary key.
-
-```sql
-SELECT * FROM orders WHERE catalog_id=1025;
-
-SELECT * FROM orders WHERE catalog_id=1025 AND order_id=29495;
-
-SELECT * FROM orders
-  WHERE catalog_id=1025
-  AND order_id>2035 AND order_id<6000;
-```
-
-However, the following filter cannot accelerate the query well.
-
-```sql
-SELECT * FROM orders WHERE order_id=29495;
-
-SELECT * FROM orders WHERE catalog_id=1025 OR order_id=29495;
-```
-
-## Snapshots Table
-
-You can query the snapshot history information of the table through Flink SQL.
+# Querying Tables
+
+Just like all other tables, Table Store tables can be queried with `SELECT` statement.
+
+## Scan Mode
+
+By specifying the `scan.mode` table property, users can specify where and how Table Store sources should produce records.
+
+<table class="table table-bordered">
+<thead>
+<tr>
+<th>Scan Mode</th>
+<th>Batch Source Behavior</th>
+<th>Streaming Source Behavior</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>default</td>
+<td colspan="2">
+The default scan mode. Determines actual scan mode according to other table properties. If "scan.timestamp-millis" is set the actual scan mode will be "from-timestamp". Otherwise the actual scan mode will be "latest-full".
+</td>
+</tr>
+<tr>
+<td>latest-full</td>
+<td>
+Produces the latest snapshot of table.
+</td>
+<td>
+Produces the latest snapshot on the table upon first startup, and continues to read the following changes.
+</td>
+</tr>
+<tr>
+<td>compacted-full</td>
+<td>
+Produces the snapshot after the latest <a href="{{< ref "docs/concepts/lsm-trees#compactions" >}}">compaction</a>.
+</td>
+<td>
+Produces the snapshot after the latest compaction on the table upon first startup, and continues to read the following changes.
+</td>
+</tr>
+<tr>
+<td>latest</td>

Review Comment:
   We can remove this legacy one



##########
docs/content/docs/features/external-log-systems.md:
##########
@@ -0,0 +1,64 @@
+---
+title: "External Log Systems"
+weight: 2
+type: docs
+aliases:
+- /features/external-log-systems.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# External Log Systems
+
+Aside from [underlying table files]({{< ref "docs/features/table-types#changelog-producers" >}}), changelog of Table Store can also be stored into or consumed from an external log system, such as Kafka. By specifying `log.system` table property, users can choose which external log system to use.
+
+If an external log system is used, all records written into table files will also be written into the log system. Changes produced by the streaming queries will thus come from the log system instead of table files.
+
+## Consistency Guarantees
+
+By default, changes in the log systems are visible to consumers only after a snapshot, just like table files. This behavior guarantees the exactly-once semantics. That is, each record is seen by the consumers exactly once.
+
+However, users can also specify the table property `'log.consistency' = 'eventual'` so that changelog written into the log system can be immediately consumed by the consumers, without waiting for the next snapshot. This behavior decreases the latency of changelog, but it can only guarantee the at-least-once semantics (that is, consumers might see duplicated records) due to possible failures.

Review Comment:
   Therefore, in order to achieve correct calculation results, FlinkSQL automatically adds a normalized node. Similarly, when the changelog-log is none, it not only generates update-before, but also achieves the effect of de duplication.



##########
docs/content/docs/features/lookup-joins.md:
##########
@@ -0,0 +1,97 @@
+---
+title: "Lookup Joins"

Review Comment:
   I prefer to put this into SQL API, and add a `Flink-Only`.



##########
docs/content/docs/maintenance-actions/_index.md:
##########
@@ -0,0 +1,26 @@
+---
+title: Maintenance Actions

Review Comment:
   Maybe just `Tuning`?
   I think `Actions` can not cover all in here.



##########
docs/content/docs/sql-api/creating-tables.md:
##########
@@ -0,0 +1,261 @@
+---
+title: "Creating Tables"
+weight: 2
+type: docs
+aliases:
+- /sql-api/creating-tables.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Creating Tables
+
+## Creating Catalog Managed Tables
+
+Tables created in Table Store [catalogs]({{< ref "docs/sql-api/creating-catalogs" >}}) are managed by the catalog. When the table is dropped from catalog, its table files will also be deleted.
+
+The following SQL assumes that you have registered and are using a Table Store catalog. It creates a managed table named `MyTable` with five columns in the catalog's `default` database.
+
+{{< tabs "catalog-managed-table-example" >}}
+
+{{< tab "Flink" >}}
+
+```sql
+CREATE TABLE MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+);
+```
+
+{{< /tab >}}
+
+{{< tab "Spark3" >}}
+
+```sql
+CREATE TABLE tablestore.default.MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+);
+```
+
+{{< /tab >}}
+
+{{< /tabs >}}
+
+### Tables with Primary Keys
+
+The following SQL creates a table named `MyTable` with five columns, where `dt`, `hh` and `user_id` are the primary keys.
+
+{{< tabs "primary-keys-example" >}}
+
+{{< tab "Flink" >}}
+
+```sql
+CREATE TABLE MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING,
+    PRIMARY KEY (dt, hh, user_id) NOT ENFORCED
+);
+```
+
+{{< /tab >}}
+
+{{< tab "Spark3" >}}
+
+```sql
+CREATE TABLE tablestore.default.MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+) TBLPROPERTIES (
+    'primary-key' = 'dt,hh,user_id'
+);
+```
+
+{{< /tab >}}
+
+{{< /tabs >}}
+
+### Partitioned Tables
+
+The following SQL creates a table named `MyTable` with five columns partitioned by `dt` and `hh`.
+
+{{< tabs "partitions-example" >}}
+
+{{< tab "Flink" >}}
+
+```sql
+CREATE TABLE MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+) PARTITIONED BY (dt, hh);
+```
+
+{{< /tab >}}
+
+{{< tab "Spark3" >}}
+
+```sql
+CREATE TABLE tablestore.default.MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+) PARTITIONED BY (dt, hh);
+```
+
+{{< /tab >}}
+
+{{< /tabs >}}
+
+{{< hint info >}}
+
+Partition keys must be a subset of primary keys if primary keys are defined.
+
+{{< /hint >}}
+
+### Table Properties
+
+Users can specify table properties to enable features or improve performance of Table Store. For a complete list of such properties, see [configurations]({{< ref "docs/maintenance-actions/configurations" >}}).
+
+The following SQL creates a table named `MyTable` with five columns partitioned by `dt` and `hh`. This table has two properties: `'bucket' = '2'` and `'bucket-key' = 'user_id'`.
+
+{{< tabs "table-properties-example" >}}
+
+{{< tab "Flink" >}}
+
+```sql
+CREATE TABLE MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+) PARTITIONED BY (dt, hh) WITH (
+    'bucket' = '2',
+    'bucket-key' = 'user_id'
+);
+```
+
+{{< /tab >}}
+
+{{< tab "Spark3" >}}
+
+```sql
+CREATE TABLE tablestore.default.MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+) PARTITIONED BY (dt, hh) TBLPROPERTIES (
+    'bucket' = '2',
+    'bucket-key' = 'user_id'
+);
+```
+
+{{< /tab >}}
+
+{{< /tabs >}}
+
+## Creating External Tables
+
+External tables are recorded but not managed by catalogs. If an external table is dropped, its table files will not be deleted.
+
+Table Store external tables can be used in any catalog. If you do not want to create a Table Store catalog and just want to read / write a table, you can consider external tables.
+
+{{< tabs "external-table-example" >}}
+
+{{< tab "Flink" >}}
+
+Flink SQL supports reading and writing an external table. External Table Store tables are created by specifying the `connector` and `path` table properties. The following SQL creates an external table named `MyTable` with five columns, where the base path of table files is `hdfs://path/to/table`.
+
+```sql
+CREATE TABLE MyTable (
+    user_id BIGINT,
+    item_id BIGINT,
+    behavior STRING,
+    dt STRING,
+    hh STRING
+) WITH (
+    'connector' = 'table-store',
+    'path' = 'hdfs://path/to/table',
+    'auto-create' = 'true' -- this table property creates table files for an empty table if table path does not exist
+                           -- currently only supported by Flink
+);
+```
+
+{{< /tab >}}
+
+{{< tab "Hive" >}}
+
+Hive SQL only supports reading from an external table. The following SQL creates an external table named `my_table`, where the base path of table files is `hdfs://path/to/table`. As schemas are stored in table files, users do not need to write column definitions.
+
+```sql
+CREATE EXTERNAL TABLE my_table
+STORED BY 'org.apache.flink.table.store.hive.TableStoreHiveStorageHandler'
+LOCATION 'hdfs://path/to/table';
+```
+
+{{< /tab >}}
+

Review Comment:
   Here can add a Spark3 like: `Query Table with Scala API`.



##########
docs/content/docs/sql-api/altering-tables.md:
##########
@@ -0,0 +1,130 @@
+---
+title: "Altering Tables"
+weight: 3
+type: docs
+aliases:
+- /sql-api/altering-tables.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Altering Tables
+
+## Changing/Adding Table Properties
+
+The following SQL sets `write-buffer-size` table property to `256 MB`.
+
+{{< tabs "set-properties-example" >}}
+
+{{< tab "Flink" >}}
+
+```sql
+ALTER TABLE my_table SET (
+    'write-buffer-size' = '256 MB'
+);
+```
+
+{{< /tab >}}
+
+{{< tab "Spark3" >}}
+
+```sql
+ALTER TABLE tablestore.default.my_table SET TBLPROPERTIES (

Review Comment:
   we can just use `my_table`. Spark also has `use catalog.database`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org