You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@inlong.apache.org by GitBox <gi...@apache.org> on 2022/06/13 09:31:11 UTC

[GitHub] [incubator-inlong-website] Oneal65 opened a new pull request, #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Oneal65 opened a new pull request, #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404

   Fixes #<403>
   
   ### Motivation
   
   Add doc about how to extend Extract or Load node for new sort
   
   ### Modifications
   
   remove how_to_write_plugin_sort_backup_ch.md
   add how_to_extend_extract_or_load_node.md
   
   ### Verifying this change
   
   - [ ] Make sure that the change passes the CI checks.
   
   *(Please pick either of the following options)*
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This change is already covered by existing tests, such as *(please describe tests)*.
   
   *(or)*
   
   This change added tests and can be verified as follows:
   
   *(example:)*
     - *Added integration tests for end-to-end deployment with large payloads (10MB)*
     - *Extended integration test for recovery after broker failure*
   
   ### Documentation
   
     - Does this pull request introduce a new feature? (yes / no)
     - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
     - If a feature is not applicable for documentation, explain why?
     - If a feature is not documented yet in this PR, please create a followup issue for adding the documentation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] yunqingmoswu commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

yunqingmoswu commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897862971


##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).
+
+# 扩展Extract Node
+
+扩展一个ExtractNode分为三步骤：
+
+**第一步**：继承ExtractNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java；在实现的ExtractNode中指定connecter；
+
+```Java
+// 继承ExtractNode类，实现具体的类，例如MongoExtractNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("MongoExtract")
+@Data
+public class MongoExtractNode extends ExtractNode implements Serializable {
+    @JsonInclude(Include.NON_NULL)
+    @JsonProperty("primaryKey")
+    private String primaryKey;
+		...
+
+    @JsonCreator
+    public MongoExtractNode(@JsonProperty("id") String id,
+                           ...) { ... }
+
+    @Override
+    public Map<String, String> tableOptions() {
+        Map<String, String> options = super.tableOptions();
+      	// 配置指定的connector, 这里指定的是mongodb-cdc
+        options.put("connector", "mongodb-cdc");
+      	...
+        return options;
+    }
+}
+```
+
+**第二步**：在ExtractNode和Node中的JsonSubTypes添加该Extract
+
+```java
+// 在ExtractNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+...
+public abstract class ExtractNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector，查看该（/incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）目录下是否已经存在对应的connector。如果还没有，则需要参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors)来扩展，直接调用已有的flink-connector（例如incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）或自行实现相关的connecter。
+
+# 扩展Load Node
+
+扩展一个LoadNode分为三步骤：
+
+**第一步**：继承LoadNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/LoadNode.java；在实现的LoadNode中指定connecter；
+
+```java
+// 继承LoadNode类，实现具体的类，例如KafkaLoadNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("kafkaLoad")
+@Data
+@NoArgsConstructor
+public class KafkaLoadNode extends LoadNode implements Serializable {
+    @Nonnull
+    @JsonProperty("topic")
+    private String topic;
+  	...
+
+    @JsonCreator
+    public KafkaLoadNode(@Nonnull @JsonProperty("topic") String topic,
+                        ...) {...}
+
+  // 根据不同的条件配置使用不同的connector
+    @Override
+    public Map<String, String> tableOptions() {
+      ...
+        if (format instanceof JsonFormat || format instanceof AvroFormat || format instanceof CsvFormat) {
+            if (StringUtils.isEmpty(this.primaryKey)) {
+                options.put("connector", "kafka");   // kafka connector
+                options.putAll(format.generateOptions(false));
+            } else {
+                options.put("connector", "upsert-kafka"); // upsert-kafka connector
+                options.putAll(format.generateOptions(true));
+            }
+        } else if (format instanceof CanalJsonFormat || format instanceof DebeziumJsonFormat) {
+            options.put("connector", "kafka-inlong");	 // kafka-inlong connector
+            options.putAll(format.generateOptions(false));
+        } else {
+            throw new IllegalArgumentException("kafka load Node format is IllegalArgument");
+        }
+        return options;
+    }
+}
+```
+
+**第二步**：在LoadNode和Node中的JsonSubTypes添加该Load
+
+```java
+// 在LoadNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = KafkaLoadNode.class, name = "kafkaLoad")
+})
+...
+public abstract class LoadNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = KafkaLoadNode.class, name = "kafkaLoad")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector ，Kafka的sort connector在incubator-inlong/inlong-sort/sort-connectors/kafka
+
+# 集成Extract 和Load 到InLong-Sort主流程

Review Comment:
   英文单词前后空格



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).

Review Comment:
   flink connector  -> Flink Connectors



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。

Review Comment:
   elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar这些首字符大写，前后空格



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).
+
+# 扩展Extract Node
+
+扩展一个ExtractNode分为三步骤：
+
+**第一步**：继承ExtractNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java；在实现的ExtractNode中指定connecter；
+
+```Java
+// 继承ExtractNode类，实现具体的类，例如MongoExtractNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("MongoExtract")
+@Data
+public class MongoExtractNode extends ExtractNode implements Serializable {
+    @JsonInclude(Include.NON_NULL)
+    @JsonProperty("primaryKey")
+    private String primaryKey;
+		...
+
+    @JsonCreator
+    public MongoExtractNode(@JsonProperty("id") String id,
+                           ...) { ... }
+
+    @Override
+    public Map<String, String> tableOptions() {
+        Map<String, String> options = super.tableOptions();
+      	// 配置指定的connector, 这里指定的是mongodb-cdc
+        options.put("connector", "mongodb-cdc");
+      	...
+        return options;
+    }
+}
+```
+
+**第二步**：在ExtractNode和Node中的JsonSubTypes添加该Extract
+
+```java
+// 在ExtractNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+...
+public abstract class ExtractNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector，查看该（/incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）目录下是否已经存在对应的connector。如果还没有，则需要参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors)来扩展，直接调用已有的flink-connector（例如incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）或自行实现相关的connecter。
+
+# 扩展Load Node
+
+扩展一个LoadNode分为三步骤：
+
+**第一步**：继承LoadNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/LoadNode.java；在实现的LoadNode中指定connecter；
+
+```java
+// 继承LoadNode类，实现具体的类，例如KafkaLoadNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("kafkaLoad")
+@Data
+@NoArgsConstructor
+public class KafkaLoadNode extends LoadNode implements Serializable {
+    @Nonnull
+    @JsonProperty("topic")
+    private String topic;
+  	...
+
+    @JsonCreator
+    public KafkaLoadNode(@Nonnull @JsonProperty("topic") String topic,
+                        ...) {...}
+
+  // 根据不同的条件配置使用不同的connector
+    @Override
+    public Map<String, String> tableOptions() {
+      ...
+        if (format instanceof JsonFormat || format instanceof AvroFormat || format instanceof CsvFormat) {
+            if (StringUtils.isEmpty(this.primaryKey)) {
+                options.put("connector", "kafka");   // kafka connector
+                options.putAll(format.generateOptions(false));
+            } else {
+                options.put("connector", "upsert-kafka"); // upsert-kafka connector
+                options.putAll(format.generateOptions(true));
+            }
+        } else if (format instanceof CanalJsonFormat || format instanceof DebeziumJsonFormat) {
+            options.put("connector", "kafka-inlong");	 // kafka-inlong connector
+            options.putAll(format.generateOptions(false));
+        } else {
+            throw new IllegalArgumentException("kafka load Node format is IllegalArgument");
+        }
+        return options;
+    }
+}
+```
+
+**第二步**：在LoadNode和Node中的JsonSubTypes添加该Load
+
+```java
+// 在LoadNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = KafkaLoadNode.class, name = "kafkaLoad")
+})
+...
+public abstract class LoadNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = KafkaLoadNode.class, name = "kafkaLoad")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector ，Kafka的sort connector在incubator-inlong/inlong-sort/sort-connectors/kafka
+
+# 集成Extract 和Load 到InLong-Sort主流程
+
+将Extract和Load集成到InLong-Sort主流程中，需要构建总览小节中提到的语意：Group、stream、node等。InLong-Sort的入口类在inlong-sort/sort-core/src/main/java/org/apache/inlong/sort/Entrance.java。Extract和Load如何集成至InLong-Sort，可参考下面的UT，首先构建对应的ExtractNode、LoadNode，再构建NodeRelation、streamInfo、groupInfo，最后使用FlinkSqlParser去执行。

Review Comment:
   inlong-sort/sort-core/src/main/java/org/apache/inlong/sort/Entrance.java -> `inlong-sort/sort-core/src/main/java/org/apache/inlong/sort/Entrance.java`



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).
+
+# 扩展Extract Node
+
+扩展一个ExtractNode分为三步骤：
+
+**第一步**：继承ExtractNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java；在实现的ExtractNode中指定connecter；

Review Comment:
   incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java -> `incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java`



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).

Review Comment:
   flink -> Flink



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).
+
+# 扩展Extract Node
+
+扩展一个ExtractNode分为三步骤：
+
+**第一步**：继承ExtractNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java；在实现的ExtractNode中指定connecter；
+
+```Java
+// 继承ExtractNode类，实现具体的类，例如MongoExtractNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("MongoExtract")
+@Data
+public class MongoExtractNode extends ExtractNode implements Serializable {
+    @JsonInclude(Include.NON_NULL)
+    @JsonProperty("primaryKey")
+    private String primaryKey;
+		...
+
+    @JsonCreator
+    public MongoExtractNode(@JsonProperty("id") String id,
+                           ...) { ... }
+
+    @Override
+    public Map<String, String> tableOptions() {
+        Map<String, String> options = super.tableOptions();
+      	// 配置指定的connector, 这里指定的是mongodb-cdc
+        options.put("connector", "mongodb-cdc");
+      	...
+        return options;
+    }
+}
+```
+
+**第二步**：在ExtractNode和Node中的JsonSubTypes添加该Extract
+
+```java
+// 在ExtractNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+...
+public abstract class ExtractNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector，查看该（/incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）目录下是否已经存在对应的connector。如果还没有，则需要参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors)来扩展，直接调用已有的flink-connector（例如incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）或自行实现相关的connecter。
+
+# 扩展Load Node
+
+扩展一个LoadNode分为三步骤：
+
+**第一步**：继承LoadNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/LoadNode.java；在实现的LoadNode中指定connecter；

Review Comment:
   incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/LoadNode.java -> `incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/LoadNode.java`



##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).
+
+# 扩展Extract Node
+
+扩展一个ExtractNode分为三步骤：
+
+**第一步**：继承ExtractNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java；在实现的ExtractNode中指定connecter；
+
+```Java
+// 继承ExtractNode类，实现具体的类，例如MongoExtractNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("MongoExtract")
+@Data
+public class MongoExtractNode extends ExtractNode implements Serializable {
+    @JsonInclude(Include.NON_NULL)
+    @JsonProperty("primaryKey")
+    private String primaryKey;
+		...
+
+    @JsonCreator
+    public MongoExtractNode(@JsonProperty("id") String id,
+                           ...) { ... }
+
+    @Override
+    public Map<String, String> tableOptions() {
+        Map<String, String> options = super.tableOptions();
+      	// 配置指定的connector, 这里指定的是mongodb-cdc
+        options.put("connector", "mongodb-cdc");
+      	...
+        return options;
+    }
+}
+```
+
+**第二步**：在ExtractNode和Node中的JsonSubTypes添加该Extract
+
+```java
+// 在ExtractNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+...
+public abstract class ExtractNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector，查看该（/incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）目录下是否已经存在对应的connector。如果还没有，则需要参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors)来扩展，直接调用已有的flink-connector（例如incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）或自行实现相关的connecter。

Review Comment:
   flink-connector -> Sort Connectors



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] dockerzhang commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

dockerzhang commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r896694361


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,216 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.
+
+This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong-Sort.  After understanding the InLong-Sort architecture, you can understand how the source corresponds to the extract node, and how the sink corresponds to the load node. The architecture of inlong sort can be represented by UML object relation diagram as: 
+
+![sort_UML](img/sort_uml.png)
+
+The concepts of each component are:

Review Comment:
   it's better to add a table to show the following item information.



##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,216 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.
+
+This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong-Sort.  After understanding the InLong-Sort architecture, you can understand how the source corresponds to the extract node, and how the sink corresponds to the load node. The architecture of inlong sort can be represented by UML object relation diagram as: 
+
+![sort_UML](img/sort_uml.png)
+
+The concepts of each component are:
+
+**Group**: data flow group, including multiple data flows, one group represents one data access
+
+**Stream**: data flow, a data flow has a specific flow direction
+
+**GroupInfo**: encapsulation of data flow in sort. a groupinfo can contain multiple dataflowinfo
+
+**StreamInfo**: abstract of data flow in sort, including various sources, transformations, destinations, etc. of the data flow
+
+**Node**: abstraction of data source, data transformation and data destination in data synchronization
+
+**ExtractNode**: source-side abstraction for data synchronization
+
+**TransformNode**: transformation process abstraction of data synchronization
+
+**LoadNode**: destination abstraction for data synchronization
+
+**NodeRelationShip**:  abstraction of each node relationship in data synchronization
+
+**FieldRelationShip**:  abstraction of the relationship between upstream and downstream node fields in data synchronization
+
+**FieldInfo**: node field
+
+**MetaFieldInfo**: node meta fields
+
+**Function**: abstraction of transformation function
+
+**FunctionParam**: input parameter abstraction of function
+
+**ConstantParam**: constant parameters
+
+To extend the extract node or load node, you need to do the following: 1.  Inherit the node class (such as MyExtractNode) and build specific extract or load usage logic; 2.  In a specific node class (such as MyExtractNode), specify the corresponding Flink connector; 3.  Use specific node classes in specific ETL implementation logic (such as MyExtractNode)

Review Comment:
   To extend the extract node or load node, you need to do the following: 
   - Inherit the node class (such as MyExtractNode) and build specific extract or load usage logic
   - In a specific node class (such as MyExtractNode), specify the corresponding Flink connector
   - Use specific node classes in specific ETL implementation logic (such as MyExtractNode)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897461767


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,216 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.
+
+This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong-Sort.  After understanding the InLong-Sort architecture, you can understand how the source corresponds to the extract node, and how the sink corresponds to the load node. The architecture of inlong sort can be represented by UML object relation diagram as: 
+
+![sort_UML](img/sort_uml.png)
+
+The concepts of each component are:

Review Comment:
   DONE



##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,216 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.
+
+This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong-Sort.  After understanding the InLong-Sort architecture, you can understand how the source corresponds to the extract node, and how the sink corresponds to the load node. The architecture of inlong sort can be represented by UML object relation diagram as: 
+
+![sort_UML](img/sort_uml.png)
+
+The concepts of each component are:
+
+**Group**: data flow group, including multiple data flows, one group represents one data access
+
+**Stream**: data flow, a data flow has a specific flow direction
+
+**GroupInfo**: encapsulation of data flow in sort. a groupinfo can contain multiple dataflowinfo
+
+**StreamInfo**: abstract of data flow in sort, including various sources, transformations, destinations, etc. of the data flow
+
+**Node**: abstraction of data source, data transformation and data destination in data synchronization
+
+**ExtractNode**: source-side abstraction for data synchronization
+
+**TransformNode**: transformation process abstraction of data synchronization
+
+**LoadNode**: destination abstraction for data synchronization
+
+**NodeRelationShip**:  abstraction of each node relationship in data synchronization
+
+**FieldRelationShip**:  abstraction of the relationship between upstream and downstream node fields in data synchronization
+
+**FieldInfo**: node field
+
+**MetaFieldInfo**: node meta fields
+
+**Function**: abstraction of transformation function
+
+**FunctionParam**: input parameter abstraction of function
+
+**ConstantParam**: constant parameters
+
+To extend the extract node or load node, you need to do the following: 1.  Inherit the node class (such as MyExtractNode) and build specific extract or load usage logic; 2.  In a specific node class (such as MyExtractNode), specify the corresponding Flink connector; 3.  Use specific node classes in specific ETL implementation logic (such as MyExtractNode)

Review Comment:
   DONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#issuecomment-1154637726

   > sort_uml is not supported to be in this commit
   thispicture has been cited in the md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] gong commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

gong commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r896567043


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,216 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.
+
+This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong-Sort.  After understanding the InLong-Sort architecture, you can understand how the source corresponds to the extract node, and how the sink corresponds to the load node. The architecture of inlong sort can be represented by UML object relation diagram as: 
+
+![sort_UML](img/sort_uml.png)
+
+The concepts of each component are:
+
+**Group**: data flow group, including multiple data flows, one group represents one data access
+
+**Stream**: data flow, a data flow has a specific flow direction
+
+**GroupInfo**: encapsulation of data flow in sort. a groupinfo can contain multiple dataflowinfo
+
+**StreamInfo**: abstract of data flow in sort, including various sources, transformations, destinations, etc. of the data flow
+
+**Node**: abstraction of data source, data transformation and data destination in data synchronization
+
+**ExtractNode**: source-side abstraction for data synchronization
+
+**TransformNode**: transformation process abstraction of data synchronization
+
+**LoadNode**: destination abstraction for data synchronization
+
+**NodeRelationShip**:  abstraction of each node relationship in data synchronization
+
+**FieldRelationShip**:  abstraction of the relationship between upstream and downstream node fields in data synchronization
+
+**FieldInfo**: node field
+
+**BuiltInFieldInfo**: node built-in fields

Review Comment:
   It should be `MetaFieldInfo` now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] dockerzhang merged pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

dockerzhang merged PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897886854


##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).

Review Comment:
   DONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] healchow commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

healchow commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897887775


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,208 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+## Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.

Review Comment:
   It is recommended to use the official name for those extracts or load nodes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897909876


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,208 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+## Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.

Review Comment:
   DONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r896575509


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,216 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview
+
+InLong-Sort is known as a real-time ETL system.  Currently, supported extract or load includes elasticsearch, HBase, hive, iceberg, JDBC, Kafka, mongodb, mysql, orcale, Postgres, pulsar, etc。InLong-Sort is an ETL solution based on Flink SQL，The powerful expressive power of Flink SQL brings high scalability and flexibility. Basically, the semantics supported by Flink SQL are supported by InLong-Sort。In some scenarios, when the built-in functions of Flink SQL do not meet the requirements, they can also be extended through various UDFs in InLong-Sort. At the same time, it will be easier for those who have used SQL, especially Flink SQL, to get started.
+
+This article describes how to extend a new source (abstracted as extract node in inlong) or a new sink (abstracted as load node in inlong) in InLong-Sort.  After understanding the InLong-Sort architecture, you can understand how the source corresponds to the extract node, and how the sink corresponds to the load node. The architecture of inlong sort can be represented by UML object relation diagram as: 
+
+![sort_UML](img/sort_uml.png)
+
+The concepts of each component are:
+
+**Group**: data flow group, including multiple data flows, one group represents one data access
+
+**Stream**: data flow, a data flow has a specific flow direction
+
+**GroupInfo**: encapsulation of data flow in sort. a groupinfo can contain multiple dataflowinfo
+
+**StreamInfo**: abstract of data flow in sort, including various sources, transformations, destinations, etc. of the data flow
+
+**Node**: abstraction of data source, data transformation and data destination in data synchronization
+
+**ExtractNode**: source-side abstraction for data synchronization
+
+**TransformNode**: transformation process abstraction of data synchronization
+
+**LoadNode**: destination abstraction for data synchronization
+
+**NodeRelationShip**:  abstraction of each node relationship in data synchronization
+
+**FieldRelationShip**:  abstraction of the relationship between upstream and downstream node fields in data synchronization
+
+**FieldInfo**: node field
+
+**BuiltInFieldInfo**: node built-in fields

Review Comment:
   OK 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897888473


##########
i18n/zh-CN/docusaurus-plugin-content-docs/current/design_and_concept/how_to_extend_extract_or_load_node_ch.md:
##########
@@ -0,0 +1,205 @@
+---
+title: Sort 插件
+sidebar_position: 3
+---
+
+# 总览
+
+InLong-Sort是一个ETL系统，当前支持的extract或load包括elasticsearch、hbase、hive、iceberg、jdbc、kafka、mongodb、mysql、orcale、postgres、pulsar等。InLong-Sort是基于Flink SQL的ETL方案，Flink SQL强大的表达能力带来的高可扩展性、灵活性，基本上Flink SQL支持的语意，InLong-Sort都支持。在个别场景，Flink SQL内置的函数不满足需求时，还可通过各种UDF来扩展。同时对于曾经使用过SQL尤其是使用过Flink SQL的人而言，会更容易上手。
+
+本文介绍如何在InLong-Sort中扩展一个新的source（在InLong中抽象为Extract Node）或一个新的sink（在InLong中抽象为Load Node）。在弄清楚InLong的架构之后，就可以明白source与Extract Node如何对应，sink与Load Node如何对应。InLong-Sort的架构可以由UML对象关系图表示为：
+
+![sort_uml](img/sort_uml.png)
+
+其中各个组件的概念为：
+
+| **名称**          | **描述**                                                    |
+| ----------------- | ----------------------------------------------------------- |
+| Group             | 数据流组，包含多个数据流，一个Group 代表一个数据接入        |
+| Stream            | 数据流，一个数据流有具体的流向                              |
+| GroupInfo         | Sort中对数据流向的封装，一个GroupInfo可包含多个DataFlowInfo |
+| StreamInfo        | Sort中数据流向的抽象，包含该数据流的各种来源、转换、去向等  |
+| Node              | 数据同步中数据源、数据转换、数据去向的抽象                  |
+| ExtractNode       | 数据同步的来源端抽象                                        |
+| TransformNode     | 数据同步的转换过程抽象                                      |
+| LoadNode          | 数据同步的去向端抽象                                        |
+| NodeRelationShip  | 数据同步中各个节点关系抽象                                  |
+| FieldRelationShip | 数据同步中上下游节点字段间关系的抽象                        |
+| FieldInfo         | 节点字段                                                    |
+| MetaFieldInfo     | 节点Meta字段                                                |
+| Function          | 转换函数的抽象                                              |
+| FunctionParam     | 函数的入参抽象                                              |
+| ConstantParam     | 常量参数                                                    |
+
+扩展Extract Node 或 Load Node需要做的工作是：
+
+- 继承Node类（例如MyExtractNode），构建具体的extract 或 load使用逻辑；
+- 在具体的Node类（例如MyExtractNode）中，指定对应Flink connector；
+- 在具体的ETL实现逻辑中使用具体的Node类（例如MyExtractNode）。
+
+其中第二步中可以使用已有的flink connector，或者用户自己扩展，如何扩展flink connector请参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors).
+
+# 扩展Extract Node
+
+扩展一个ExtractNode分为三步骤：
+
+**第一步**：继承ExtractNode类，类的位置在incubator-inlong/inlong-sort/sort-common/src/main/java/org/apache/inlong/sort/protocol/node/ExtractNode.java；在实现的ExtractNode中指定connecter；
+
+```Java
+// 继承ExtractNode类，实现具体的类，例如MongoExtractNode
+@EqualsAndHashCode(callSuper = true)
+@JsonTypeName("MongoExtract")
+@Data
+public class MongoExtractNode extends ExtractNode implements Serializable {
+    @JsonInclude(Include.NON_NULL)
+    @JsonProperty("primaryKey")
+    private String primaryKey;
+		...
+
+    @JsonCreator
+    public MongoExtractNode(@JsonProperty("id") String id,
+                           ...) { ... }
+
+    @Override
+    public Map<String, String> tableOptions() {
+        Map<String, String> options = super.tableOptions();
+      	// 配置指定的connector, 这里指定的是mongodb-cdc
+        options.put("connector", "mongodb-cdc");
+      	...
+        return options;
+    }
+}
+```
+
+**第二步**：在ExtractNode和Node中的JsonSubTypes添加该Extract
+
+```java
+// 在ExtractNode和Node的JsonSubTypes中添加字段
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+...
+public abstract class ExtractNode implements Node{...}
+
+...
+@JsonSubTypes({
+        @JsonSubTypes.Type(value = MongoExtractNode.class, name = "mongoExtract")
+})
+public interface Node {...}
+```
+
+**第三步**：扩展flink connector，查看该（/incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）目录下是否已经存在对应的connector。如果还没有，则需要参考flink官方文档[DataStream Connectors ](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/overview/#datastream-connectors)来扩展，直接调用已有的flink-connector（例如incubator-inlong/inlong-sort/sort-connectors/mongodb-cdc）或自行实现相关的connecter。

Review Comment:
   DONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] yunqingmoswu commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

yunqingmoswu commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897860810


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,208 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview

Review Comment:
   `#` -> `##`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-inlong-website] Oneal65 commented on a diff in pull request #404: [INLONG-403][Sort] Add doc about how to extend Extract or Load node

Posted by GitBox <gi...@apache.org>.

Oneal65 commented on code in PR #404:
URL: https://github.com/apache/incubator-inlong-website/pull/404#discussion_r897867704


##########
docs/design_and_concept/how_to_extend_extract_or_load_node_en.md:
##########
@@ -0,0 +1,208 @@
+---
+title: Sort Plugin
+sidebar_position: 3
+---
+
+# Overview

Review Comment:
   DONE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org