You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by wu...@apache.org on 2022/03/25 05:47:49 UTC

[incubator-seatunnel] branch dev updated: [Doc][Connector] Add ClickhouseFile document (#1558)

This is an automated email from the ASF dual-hosted git repository.

wuchunfu pushed a commit to branch dev
in repository https://gitbox.apache.org/repos/asf/incubator-seatunnel.git


The following commit(s) were added to refs/heads/dev by this push:
     new 2e34311  [Doc][Connector] Add ClickhouseFile document (#1558)
2e34311 is described below

commit 2e34311bd97447b3c0024c3bd4040f80050fb428
Author: TrickyZerg <32...@users.noreply.github.com>
AuthorDate: Fri Mar 25 13:47:44 2022 +0800

    [Doc][Connector] Add ClickhouseFile document (#1558)
    
    * [ST-1382][feat] Add clickhouse-file sink support clickhouse bulk load
    update first piece
    
    * move ClickhouseFile location
    
    * first full commit
    
    * support ClickhouseFile
    
    * support clickhouse bulk load mode
    
    * add TODO
    
    * code improve: Extract the Table class
    
    * code improve: remove comment
    
    * add apache license in class file
    
    * add dependency in known-dependencies.txt
    
    * add clickhouse-file doc
    
    * fix document's miss part
---
 .../spark/configuration/sink-plugins/Clickhouse.md |   4 +-
 .../{Clickhouse.md => ClickhouseFile.md}           | 110 +++++++++++----------
 2 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/docs/en/spark/configuration/sink-plugins/Clickhouse.md b/docs/en/spark/configuration/sink-plugins/Clickhouse.md
index d561de2..e510eee 100644
--- a/docs/en/spark/configuration/sink-plugins/Clickhouse.md
+++ b/docs/en/spark/configuration/sink-plugins/Clickhouse.md
@@ -70,8 +70,8 @@ The way to specify the parameter is to add the prefix `clickhouse.` to the origi
 
 ### split_mode [boolean]
 
-This mode only support clickhouse table which engine is 'Distributed'. They will split distributed table 
-data in seatunnel and perform the write directly on each shard. The shard weight define is clickhouse will be 
+This mode only support clickhouse table which engine is 'Distributed'.And `internal_replication` option 
+should be `true`. They will split distributed table data in seatunnel and perform write directly on each shard. The shard weight define is clickhouse will be 
 counted.
 
 ### sharding_key [string]
diff --git a/docs/en/spark/configuration/sink-plugins/Clickhouse.md b/docs/en/spark/configuration/sink-plugins/ClickhouseFile.md
similarity index 59%
copy from docs/en/spark/configuration/sink-plugins/Clickhouse.md
copy to docs/en/spark/configuration/sink-plugins/ClickhouseFile.md
index d561de2..be0709b 100644
--- a/docs/en/spark/configuration/sink-plugins/Clickhouse.md
+++ b/docs/en/spark/configuration/sink-plugins/ClickhouseFile.md
@@ -1,32 +1,30 @@
-# Clickhouse
+# ClickhouseFile
 
-> Sink plugin : Clickhouse [Spark]
+> Sink plugin : ClickhouseFile [Spark]
 
 ## Description
 
-Use [Clickhouse-jdbc](https://github.com/ClickHouse/clickhouse-jdbc) to correspond the data source according to the field name and write it into ClickHouse. The corresponding data table needs to be created in advance before use
+Generate the clickhouse data file with the clickhouse-local program, and then send it to the clickhouse 
+server, also call bulk load.
 
 ## Options
 
-| name           | type    | required | default value |
-|----------------|---------| -------- |---------------|
-| bulk_size      | number  | no       | 20000         |
-| clickhouse.*   | string  | no       |               |
-| database       | string  | yes      | -             |
-| fields         | array   | no       | -             |
-| host           | string  | yes      | -             |
-| password       | string  | no       | -             |
-| retry          | number  | no       | 1             |
-| retry_codes    | array   | no       | [ ]           |
-| table          | string  | yes      | -             |
-| username       | string  | no       | -             |
-| split_mode     | boolean | no       | false         |
-| sharding_key   | string  | no       | -             |
-| common-options | string  | no       | -             |
-
-### bulk_size [number]
-
-The number of data written through [Clickhouse-jdbc](https://github.com/ClickHouse/clickhouse-jdbc) each time, the `default is 20000` .
+| name                   | type    | required | default value |
+|------------------------|---------|----------|---------------|
+| database               | string  | yes      | -             |
+| fields                 | array   | no       | -             |
+| host                   | string  | yes      | -             |
+| password               | string  | no       | -             |
+| table                  | string  | yes      | -             |
+| username               | string  | no       | -             |
+| sharding_key           | string  | no       | -             |
+| clickhouse_local_path  | string  | yes      | -             |
+| copy_method            | string  | no       | scp           |
+| node_free_password     | boolean | no       | false         |
+| node_pass              | list    | no       | -             |
+| node_pass.node_address | string  | no       | -             |
+| node_pass.password     | string  | no       | -             |
+| common-options         | string  | no       | -             |
 
 ### database [string]
 
@@ -44,41 +42,46 @@ The data field that needs to be output to `ClickHouse` , if not configured, it w
 
 `ClickHouse user password` . This field is only required when the permission is enabled in `ClickHouse` .
 
-### retry [number]
+### table [string]
 
-The number of retries, the default is 1
+table name
 
-### retry_codes [array]
+### username [string]
 
-When an exception occurs, the ClickHouse exception error code of the operation will be retried. For a detailed list of error codes, please refer to [ClickHouseErrorCode](https://github.com/ClickHouse/clickhouse-jdbc/blob/master/clickhouse-jdbc/src/main/java/ru/yandex/clickhouse/except/ClickHouseErrorCode.java)
+`ClickHouse` user username, this field is only required when permission is enabled in `ClickHouse`
 
-If multiple retries fail, this batch of data will be discarded, use with caution! !
+### sharding_key [string]
 
-### table [string]
+When use split_mode, which node to send data to is a problem, the default is random selection, but the 
+'sharding_key' parameter can be used to specify the field for the sharding algorithm. This option only 
+worked when 'split_mode' is true.
 
-table name
+### clickhouse_local_path [string]
 
-### username [string]
+The address of the clickhouse-local program on the spark node. Since each task needs to be called, 
+clickhouse-local should be located in the same path of each spark node.
 
-`ClickHouse` user username, this field is only required when permission is enabled in `ClickHouse`
+### copy_method [string]
 
-### clickhouse [string]
+Specifies the method used to transfer files, the default is scp, optional scp and rsync
 
-In addition to the above mandatory parameters that must be specified by `clickhouse-jdbc` , users can also specify multiple optional parameters, which cover all the [parameters](https://github.com/ClickHouse/clickhouse-jdbc/blob/master/clickhouse-jdbc/src/main/java/ru/yandex/clickhouse/settings/ClickHouseProperties.java) provided by `clickhouse-jdbc` .
+### node_free_password [boolean]
 
-The way to specify the parameter is to add the prefix `clickhouse.` to the original parameter name. For example, the way to specify `socket_timeout` is: `clickhouse.socket_timeout = 50000` . If these non-essential parameters are not specified, they will use the default values given by `clickhouse-jdbc`.
+Because seatunnel need to use scp or rsync for file transfer, seatunnel need clickhouse server-side access.
+If each spark node and clickhouse server are configured with password-free login, 
+you can configure this option to true, otherwise you need to configure the corresponding node password in the node_pass configuration
 
-### split_mode [boolean]
+### node_pass [list]
 
-This mode only support clickhouse table which engine is 'Distributed'. They will split distributed table 
-data in seatunnel and perform the write directly on each shard. The shard weight define is clickhouse will be 
-counted.
+Used to save the addresses and corresponding passwords of all clickhouse servers
 
-### sharding_key [string]
+### node_pass.node_address [string]
 
-When use split_mode, which node to send data to is a problem, the default is random selection, but the 
-'sharding_key' parameter can be used to specify the field for the sharding algorithm. This option only 
-worked when 'split_mode' is true.
+The address corresponding to the clickhouse server
+
+### node_pass.node_password [string]
+
+The password corresponding to the clickhouse server, only support root user yet.
 
 ### common options [string]
 
@@ -87,7 +90,7 @@ Sink plugin common parameters, please refer to [Sink Plugin](./sink-plugin.md) f
 ## ClickHouse type comparison table
 
 | ClickHouse field type | Convert plugin conversion goal type | SQL conversion expression     | Description                                           |
-| --------------------- | ----------------------------------- | ----------------------------- | ----------------------------------------------------- |
+| --------------------- | ----------------------------------- | ----------------------------- |-------------------------------------------------------|
 | Date                  | string                              | string()                      | `yyyy-MM-dd` Format string                            |
 | DateTime              | string                              | string()                      | `yyyy-MM-dd HH:mm:ss` Format string                   |
 | String                | string                              | string()                      |                                                       |
@@ -111,13 +114,13 @@ Sink plugin common parameters, please refer to [Sink Plugin](./sink-plugin.md) f
 ```bash
 clickhouse {
     host = "localhost:8123"
-    clickhouse.socket_timeout = 50000
     database = "nginx"
     table = "access_msg"
     fields = ["date", "datetime", "hostname", "http_code", "data_size", "ua", "request_time"]
     username = "username"
     password = "password"
-    bulk_size = 20000
+    clickhouse_local_path = "/usr/bin/clickhouse-local"
+    node_free_password = true
 }
 ```
 
@@ -129,10 +132,17 @@ ClickHouse {
     fields = ["date", "datetime", "hostname", "http_code", "data_size", "ua", "request_time"]
     username = "username"
     password = "password"
-    bulk_size = 20000
-    retry_codes = [209, 210]
-    retry = 3
+    sharding_key = "age"
+    clickhouse_local_path = "/usr/bin/Clickhouse local"
+    node_pass = [
+      {
+        node_address = "localhost1"
+        password = "password"
+      }
+      {
+        node_address = "localhost2"
+        password = "password"
+      }
+    ]
 }
 ```
-
-> In case of network timeout or network abnormality, retry writing 3 times