You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ya...@apache.org on 2020/10/30 02:06:38 UTC

[incubator-doris] branch master updated: [Doc] Add doc for sequence column (#4814)

This is an automated email from the ASF dual-hosted git repository.

yangzhg pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 32afb11  [Doc] Add doc for sequence column (#4814)
32afb11 is described below

commit 32afb1145806aa4121c5f1cb3404628c6bbf843d
Author: Youngwb <ya...@163.com>
AuthorDate: Fri Oct 30 10:05:15 2020 +0800

    [Doc] Add doc for sequence column (#4814)
---
 .../load-data/sequence-column-manual.md            | 208 +++++++++++++++++++++
 .../load-data/sequence-column-manual.md            | 208 +++++++++++++++++++++
 2 files changed, 416 insertions(+)

diff --git a/docs/en/administrator-guide/load-data/sequence-column-manual.md b/docs/en/administrator-guide/load-data/sequence-column-manual.md
new file mode 100644
index 0000000..8f29789
--- /dev/null
+++ b/docs/en/administrator-guide/load-data/sequence-column-manual.md
@@ -0,0 +1,208 @@
+---
+{
+    "title": "Sequence Column",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Sequence Column
+The Sequence Column currently only supports the Uniq model. The Uniq model is mainly for scenarios requiring a unique primary key, which can guarantee the uniqueness constraint of the primary key. However, due to the use of REPLACE aggregation, the replacement sequence is not guaranteed for data imported in the same batch, which can be described in detail [here](../../getting-started/data-model-rollup.md). If the order of substitution is not guaranteed, then the specific data that is fin [...]
+
+To solve this problem, Doris supported a sequence column by allowing the user to specify the sequence column when importing. Under the same key column, columns of the REPLACE aggregate type will be replaced according to the value of the sequence column, larger values can be replaced with smaller values, and vice versa. In this method, the order is determined by the user, and the user controls the replacement order.
+
+## Principle
+
+Implemented by adding a hidden column `__DORIS_SEQUENCE_COL__`, the type of the column is specified by the user while create the table, determines the specific value of the column on import, and replaces the REPLACE column with that value.
+
+### Create Table
+When you create the Uniq table, a hidden column `__DORIS_SEQUENCE_COL__` is automatically added, depending on the type specified by the user
+
+### Import
+
+When importing, fe sets the value of the hidden column during parsing to the value of the 'order by' expression (Broker Load and routine Load), or the value of the 'function_column.sequence_col' expression (stream load), and the value column will be replaced according to this value. The value of the hidden column `__DORIS_SEQUENCE_COL__` can be set as a column in the source data or in the table structure.
+
+### Read
+
+The request with the value column needs to read the additional column of `__DORIS_SEQUENCE_COL__`, which is used as a basis for the order of replacement aggregation function replacement under the same key column, with the larger value replacing the smaller value and not the reverse.
+
+### Cumulative Compaction
+
+Cumulative Compaction works in the same way as the reading process
+
+### Base Compaction
+
+Base Compaction works in the same way as the reading process
+
+### Syntax
+The syntax aspect of the table construction adds a property to the property identifying the type of `__DORIS_SEQUENCE_COL__`.
+The syntax design aspect of the import is primarily the addition of a mapping from the sequence column to other columns, the settings of each import mode are described below
+
+#### Create Table
+When you create the Uniq table, you can specify the sequence column type
+```
+PROPERTIES (
+    "function_column.sequence_type" = 'Date',
+);
+```
+The sequence_type is used to specify the type of the sequence column, which can be integral and time
+
+#### stream load
+
+The syntax of the stream load is to add the mapping of hidden columns corresponding to source_sequence in the 'function_column.sequence_col' field in the header, for example
+```
+curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load
+```
+
+#### broker load
+
+Set the source_sequence field for the hidden column map at `ORDER BY`
+
+```
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://host:port/user/data/*/test.txt")
+    INTO TABLE `tbl1`
+    COLUMNS TERMINATED BY ","
+    (k1,k2,source_sequence,v1,v2)
+    ORDER BY source_sequence
+)
+WITH BROKER 'broker'
+(
+    "username"="user",
+    "password"="pass"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+#### routine load
+
+The mapping method is the same as above, as shown below
+
+```
+   CREATE ROUTINE LOAD example_db.test1 ON example_tbl 
+    [WITH MERGE|APPEND|DELETE]
+    COLUMNS(k1, k2, source_sequence, v1, v2),
+    WHERE k1 > 100 and k2 like "%doris%"
+    [ORDER BY source_sequence]
+    PROPERTIES
+    (
+        "desired_concurrent_number"="3",
+        "max_batch_interval" = "20",
+        "max_batch_rows" = "300000",
+        "max_batch_size" = "209715200",
+        "strict_mode" = "false"
+    )
+    FROM KAFKA
+    (
+        "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
+        "kafka_topic" = "my_topic",
+        "kafka_partitions" = "0,1,2,3",
+        "kafka_offsets" = "101,0,0,200"
+    );
+```
+
+## Enable sequence column support
+If `function_column.sequence_type` is set when creating a new table, then the sequence column will be supported.
+For a table that does not support sequence column, use the following statement if you would like to use this feature:
+`ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` to enable。
+If you want to determine if a table supports sequence column, you can set the session variable to display the hidden column `SET show_hidden_columns=true`, followed by `desc Tablename`, if the output contains the column `__DORIS_SEQUENCE_COL__`, it is supported, if not, it is not supported
+
+## Usage example
+Let's take the stream Load as an example to show how to use it
+1. Create a table that supports sequence column. 
+
+The table structure is shown below
+```
+MySQL > desc test_table;
++-------------+--------------+------+-------+---------+---------+
+| Field       | Type         | Null | Key   | Default | Extra   |
++-------------+--------------+------+-------+---------+---------+
+| user_id     | BIGINT       | No   | true  | NULL    |         |
+| date        | DATE         | No   | true  | NULL    |         |
+| group_id    | BIGINT       | No   | true  | NULL    |         |
+| modify_date | DATE         | No   | false | NULL    | REPLACE |
+| keyword     | VARCHAR(128) | No   | false | NULL    | REPLACE |
++-------------+--------------+------+-------+---------+---------+
+```
+
+2. Import data normally:
+
+Import the following data
+```
+1       2020-02-22      1       2020-02-22      a
+1       2020-02-22      1       2020-02-22      b
+1       2020-02-22      1       2020-03-05      c
+1       2020-02-22      1       2020-02-26      d
+1       2020-02-22      1       2020-02-22      e
+1       2020-02-22      1       2020-02-22      b
+```
+Take the Stream Load as an example here and map the sequence column to the modify_date column
+```
+curl --location-trusted -u root: -H "function_column.sequence_col: modify_date" -T testData http://host:port/api/test/test_table/_stream_load
+```
+The results is
+```
+MySQL > select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2020-02-22 |        1 | 2020-03-05  | c       |
++---------+------------+----------+-------------+---------+
+```
+In this import, the c is eventually retained in the keyword column because the value of the sequence column (the value in modify_date) is the maximum value: '2020-03-05'.
+
+3. Guarantee of substitution order
+
+After the above steps are completed, import the following data
+```
+1       2020-02-22      1       2020-02-22      a
+1       2020-02-22      1       2020-02-23      b
+```
+Query data
+```
+MySQL [test]> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2020-02-22 |        1 | 2020-03-05  | c       |
++---------+------------+----------+-------------+---------+
+```
+Because the sequence column for the newly imported data are all smaller than the values already in the table, they cannot be replaced
+Try importing the following data again
+```
+1       2020-02-22      1       2020-02-22      a
+1       2020-02-22      1       2020-03-23      w
+```
+Query data
+```
+MySQL [test]> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2020-02-22 |        1 | 2020-03-23  | w       |
++---------+------------+----------+-------------+---------+
+```
+At this point, you can replace the original data in the table
\ No newline at end of file
diff --git a/docs/zh-CN/administrator-guide/load-data/sequence-column-manual.md b/docs/zh-CN/administrator-guide/load-data/sequence-column-manual.md
new file mode 100644
index 0000000..e9ffd0d
--- /dev/null
+++ b/docs/zh-CN/administrator-guide/load-data/sequence-column-manual.md
@@ -0,0 +1,208 @@
+---
+{
+    "title": "sequence列",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# sequence列
+sequence列目前只支持Uniq模型,Uniq模型主要针对需要唯一主键的场景,可以保证主键唯一性约束,但是由于使用REPLACE聚合方式,在同一批次中导入的数据,替换顺序不做保证,详细介绍可以参考[这里](../../getting-started/data-model-rollup.md)。替换顺序无法保证则无法确定最终导入到表中的具体数据,存在了不确定性。
+
+为了解决这个问题,Doris支持了sequence列,通过用户在导入时指定sequence列,相同key列下,REPLACE聚合类型的列将按照sequence列的值进行替换,较大值可以替换较小值,反之则无法替换。该方法将顺序的确定交给了用户,由用户控制替换顺序。
+
+## 原理
+通过增加一个隐藏列`__DORIS_SEQUENCE_COL__`实现,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值对REPLACE列进行替换。
+
+### 建表
+
+创建Uniq表时,将按照用户指定类型自动添加一个隐藏列`__DORIS_SEQUENCE_COL__`
+
+### 导入
+
+导入时,fe在解析的过程中将隐藏列的值设置成 `order by` 表达式的值(broker load和routine load),或者`function_column.sequence_col`表达式的值(stream load), value列将按照该值进行替换。隐藏列`__DORIS_SEQUENCE_COL__`的值既可以设置为数据源中一列,也可以是表结构中的一列。
+
+### 读取
+
+请求包含value列时需要需要额外读取`__DORIS_SEQUENCE_COL__`列,该列用于在相同key列下,REPLACE聚合函数替换顺序的依据,较大值可以替换较小值,反之则不能替换。
+
+### Cumulative Compaction
+
+Cumulative Compaction 时和读取过程原理相同
+
+### Base Compaction
+
+Base Compaction 时读取过程原理相同
+
+### 语法
+建表时语法方面在property中增加了一个属性,用来标识`__DORIS_SEQUENCE_COL__`的类型
+导入的语法设计方面主要是增加一个从sequence列的到其他column的映射,各个导入方式设置的将在下面介绍
+
+#### 建表
+创建Uniq表时,可以指定sequence列类型
+```
+PROPERTIES (
+    "function_column.sequence_type" = 'Date',
+);
+```
+sequence_type用来指定sequence列的类型,可以为整型和时间类型
+
+#### stream load
+
+stream load 的写法是在header中的`function_column.sequence_col`字段添加隐藏列对应的source_sequence的映射, 示例
+```
+curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load
+```
+
+#### broker load
+
+在`ORDER BY` 处设置隐藏列映射的source_sequence字段
+
+```
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://host:port/user/data/*/test.txt")
+    INTO TABLE `tbl1`
+    COLUMNS TERMINATED BY ","
+    (k1,k2,source_sequence,v1,v2)
+    ORDER BY source_sequence
+)
+WITH BROKER 'broker'
+(
+    "username"="user",
+    "password"="pass"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+#### routine load
+
+映射方式同上,示例如下
+
+```
+   CREATE ROUTINE LOAD example_db.test1 ON example_tbl 
+    [WITH MERGE|APPEND|DELETE]
+    COLUMNS(k1, k2, source_sequence, v1, v2),
+    WHERE k1 > 100 and k2 like "%doris%"
+    [ORDER BY source_sequence]
+    PROPERTIES
+    (
+        "desired_concurrent_number"="3",
+        "max_batch_interval" = "20",
+        "max_batch_rows" = "300000",
+        "max_batch_size" = "209715200",
+        "strict_mode" = "false"
+    )
+    FROM KAFKA
+    (
+        "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
+        "kafka_topic" = "my_topic",
+        "kafka_partitions" = "0,1,2,3",
+        "kafka_offsets" = "101,0,0,200"
+    );
+```
+
+## 启用sequence column支持
+在新建表时如果设置了`function_column.sequence_type` ,则新建表将支持sequence column。
+对于一个不支持sequence column的表,如果想要使用该功能,可以使用如下语句:
+`ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。
+如果确定一个表是否支持sequence column,可以通过设置一个session variable来显示隐藏列 `SET show_hidden_columns=true` ,之后使用`desc tablename`,如果输出中有`__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持
+
+## 使用示例
+下面以stream load 为例 展示下使用方式
+1. 创建支持sequence column的表
+
+表结构如下:
+```
+MySQL > desc test_table;
++-------------+--------------+------+-------+---------+---------+
+| Field       | Type         | Null | Key   | Default | Extra   |
++-------------+--------------+------+-------+---------+---------+
+| user_id     | BIGINT       | No   | true  | NULL    |         |
+| date        | DATE         | No   | true  | NULL    |         |
+| group_id    | BIGINT       | No   | true  | NULL    |         |
+| modify_date | DATE         | No   | false | NULL    | REPLACE |
+| keyword     | VARCHAR(128) | No   | false | NULL    | REPLACE |
++-------------+--------------+------+-------+---------+---------+
+```
+
+2. 正常导入数据:
+
+导入如下数据
+```
+1       2020-02-22      1       2020-02-22      a
+1       2020-02-22      1       2020-02-22      b
+1       2020-02-22      1       2020-03-05      c
+1       2020-02-22      1       2020-02-26      d
+1       2020-02-22      1       2020-02-22      e
+1       2020-02-22      1       2020-02-22      b
+```
+此处以stream load为例, 将sequence column映射为modify_date列
+```
+curl --location-trusted -u root: -H "function_column.sequence_col: modify_date" -T testData http://host:port/api/test/test_table/_stream_load
+```
+结果为
+```
+MySQL > select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2020-02-22 |        1 | 2020-03-05  | c       |
++---------+------------+----------+-------------+---------+
+```
+在这次导入中,因sequence column的值(也就是modify_date中的值)中'2020-03-05'为最大值,所以keyword列中最终保留了c。
+
+3. 替换顺序的保证
+
+上述步骤完成后,接着导入如下数据
+```
+1       2020-02-22      1       2020-02-22      a
+1       2020-02-22      1       2020-02-23      b
+```
+查询数据
+```
+MySQL [test]> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2020-02-22 |        1 | 2020-03-05  | c       |
++---------+------------+----------+-------------+---------+
+```
+由于新导入的数据的sequence column都小于表中已有的值,无法替换
+再尝试导入如下数据
+```
+1       2020-02-22      1       2020-02-22      a
+1       2020-02-22      1       2020-03-23      w
+```
+查询数据
+```
+MySQL [test]> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2020-02-22 |        1 | 2020-03-23  | w       |
++---------+------------+----------+-------------+---------+
+```
+此时就可以替换表中原有的数据
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org