You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by ni...@apache.org on 2022/09/19 08:16:02 UTC

[incubator-seatunnel-website] branch main updated: add (#145)

This is an automated email from the ASF dual-hosted git repository.

nielifeng pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-seatunnel-website.git


The following commit(s) were added to refs/heads/main by this push:
     new cc9b32326 add (#145)
cc9b32326 is described below

commit cc9b3232626cb7fc43367dea116f5b84f6c1b5bd
Author: lifeng <53...@users.noreply.github.com>
AuthorDate: Mon Sep 19 16:15:57 2022 +0800

    add (#145)
---
 blog/2022-09-12-SeaTunnel-2.1.3-released.md        |  60 +++++
 ...tributors | Why-do I contribute-to-SeaTunnel.md |  40 ++++
 ...-for-SeaTunnel-Connector-Development-Process.md | 242 +++++++++++++++++++++
 3 files changed, 342 insertions(+)

diff --git a/blog/2022-09-12-SeaTunnel-2.1.3-released.md b/blog/2022-09-12-SeaTunnel-2.1.3-released.md
new file mode 100644
index 000000000..97dd98bd0
--- /dev/null
+++ b/blog/2022-09-12-SeaTunnel-2.1.3-released.md
@@ -0,0 +1,60 @@
+# SeaTunnel 2.1.3 released! Introducing in Assert Sink connector and NullRate, Nulltf Transform
+
+![](https://miro.medium.com/max/1400/1*7jtTFNpvwC6nquA-BLfqGg.png)
+
+More than a month after the release of Apache SeaTunnel(Incubating) 2.1.2, we have been collecting user and developer feedback to bring you version 2.1.3. The new version introduces the Assert Sink connector, which is an inurgent need in the community, and two Transforms, NullRate and Nulltf. Some usability problems in the previous version have also been fixed, improving stability and efficiency.
+
+This article will introduce the details of the update of Apache SeaTunnel(Incubating) **version 2.1.3**.
+
+* Release Note: [https://github.com/apache/incubator-seatunnel/blob/2.1.3/release-note.md](https://github.com/apache/incubator-seatunnel/blob/2.1.3/release-note.md)
+* Download address: [https://seatunnel.apache.org/download](https://seatunnel.apache.org/download)
+
+## Major feature updates
+### Introduces Assert Sink connector
+Assert Sink connector is introduced in SeaTunnel version 2.1.3to verify data correctness. Special thanks to Lhyundeadsoul for his contribution.
+
+### Add two Transforms
+In addition, the 2.1.3 version also adds two Transforms, NullRate and Nulltf, which are used to detect data quality and convert null values ​​in the data to generate default values. These two Transforms can effectively improve the availability of data and reduce the frequency of abnormal situations. Special thanks to wsyhj and Interest1-wyt for their contributions.
+
+At present, SeaTunnel has supported 9 types of Transforms including Common Options, Json, NullRate, Nulltf, Replace, Split, SQL, UDF, and UUID, and the community is welcome to contribute more Transform types.
+
+For details of Transform, please refer to the official documentation: [https://seatunnel.apache.org/docs/2.1.3/category/transform](https://seatunnel.apache.org/docs/2.1.3/category/transform)
+
+### ClickhouseFile connector supports Rsync data transfer method now
+At the same time, SeaTunnel 2.1.3 version brings Rsync data transfer mode support to ClickhouseFile connector, users can now choose SCP and Rsync data transfer modes. Thanks to Emor-nj for contributing to this feature.
+
+### Specific feature updates:
+
+* Flink Fake data supports BigInteger type https://github.com/apache/incubator-seatunnel/pull/2118
+* Add Flink Assert Sink connector https://github.com/apache/incubator-seatunnel/pull/2022
+* Spark ClickhouseFile connector supports Rsync data file transfer method https://github.com/apache/incubator-seatunnel/pull/2074
+* Add Flink Assert Sink e2e module https://github.com/apache/incubator-seatunnel/pull/2036
+* Add NullRate Transform for detecting data quality https://github.com/apache/incubator-seatunnel/pull/1978
+* Add Nulltf Transform for setting defaults https://github.com/apache/incubator-seatunnel/pull/1958
+### Optimization
+* Refactored Spark TiDB-related parameter information
+* Refactor the code to remove redundant code warning information
+* Optimize connector jar package loading logic
+* Add Plugin Discovery module
+* Add documentation for some modules
+* Upgrade common-collection from version 4 to 4.4
+* Upgrade common-codec version to 1.13
+### Bug Fix
+In addition, in response to the feedback from users of version 2.1.2, we also fixed some usability issues, such as the inability to use the same components of Source and Sink, and further improved the stability.
+
+* Fixed the problem of Hudi Source loading twice
+* Fix the problem that the field TwoPhaseCommit is not recognized after Doris 0.15
+* Fixed abnormal data output when accessing Hive using Spark JDBC
+* Fix JDBC data loss when partition_column (partition mode) is set
+* Fix KafkaTableStream schema JSON parsing error
+* Fix Shell script getting APP_DIR path error
+* Updated Flink RunMode enumeration to get correct help messages for run modes
+* Fix the same source and sink registered connector cache error
+* Fix command line parameter -t( — check) conflict with Flink deployment target parameter
+* Fix Jackson type conversion error problem
+* Fix the problem of failure to run scripts in paths other than SeaTunnel_Home
+### Acknowledgment
+Thanks to all the contributors (GitHub ID, in no particular order,), it is your efforts that fuel the launch of this version, and we look forward to more contributions to the Apache SeaTunnel(Incubating) community!
+
+`leo65535, CalvinKirs, mans2singh, ashulin, wanghuan2054, lhyundeadsoul, tobezhou33, Hisoka-X, ic4y, wsyhj, Emor-nj, gleiyu, smallhibiscus, Bingz2, kezhenxu94, youyangkou, immustard, Interest1-wyt, superzhang0929, gaaraG, runwenjun`
+
diff --git a/blog/2022-09-14-Talk -With-Overseas-contributors | Why-do I contribute-to-SeaTunnel.md b/blog/2022-09-14-Talk -With-Overseas-contributors | Why-do I contribute-to-SeaTunnel.md
new file mode 100644
index 000000000..029ebb59e
--- /dev/null
+++ b/blog/2022-09-14-Talk -With-Overseas-contributors | Why-do I contribute-to-SeaTunnel.md	
@@ -0,0 +1,40 @@
+# Talk With Overseas contributors | Why do I contribute to SeaTunnel?
+
+As SeaTunnel gets popular around the world, it is attracting more and more contributors from overseas to join the open-source career. Among them, a big data platform engineer at Kakao enterprise corp., Namgung Chan has recently contributed the Neo4j Sink Connector for the SeaTunnel. We have a talk with him to know why SeaTunnel is attractive to him, and how he thinks SeaTunnel should gain popularity in the South Korean market.
+
+## Personal Profile
+![](https://miro.medium.com/max/1400/1*sKzXjqu6M_VmoperNBYUGQ.jpeg)
+
+Namgung Chan, South Korea, Big Data Platform Engineer at Kakao enterprise corp.
+
+Blog (written in Korean): https://getchan.github.io/
+GitHub ID: https://github.com/getChan
+LinkedIn : https://www.linkedin.com/in/namgung-chan-6a06441b6/
+
+### Contributions to the community
+He writes the Neo4j Sink Connector code for the new SeaTunnel Connector API.
+
+### How to know SeaTunnel for the first time?
+It’s the first time Namgung Chan to engage in open source. He wants to learn technical skills by contributing, at the same time experience the open-source culture.
+
+For him, an open source project which is written by java lang, and made for data engineering, has many issues of ‘help wanted’ or ‘good first issue’ is quite suitable. Then he found SeaTunnel on the Apache Software Foundation project webpage.
+
+### The first impression of SeaTunnel Community
+Though it was his first open source experience, he felt it was comfortable and interesting to go to the community. He also felt very welcome, because there are many ‘good first issue, and ‘volunteer wanted’ tagged issues and will get a quick response of code review.
+
+With gaining knowledge of Neo4j, he grows much more confident in open source contribution.
+
+### Research and comparison
+Before knowing about SeaTunnel, Namgung Chan used Spring Cloud Data Flow for data integration. While after experiencing SeaTunnel, he thinks the latter is more lightweight than SCDF, because in SCDF, every source, processor, and sink component are individual applications, but SeaTunnel is not.
+
+Though hasn’t used SeaTunnel in his working environment yet, Namgung Chan said he would like to use it positively when he is in need, especially for data integration for various data storage.
+
+
+### Expectations for SeaTunnel
+The most exciting new features or optimizations for Namgung Chan are:
+
+Data Integration for various data storage.
+Strict data validation. monitoring extension
+Low computing resource
+exactly-once data processing
+In the future, Namgung Chan plans to keep contributing from light issues to heavy ones, and we hope he will have a good time here!
\ No newline at end of file
diff --git a/blog/2022-09-19-Code-Demo-for-SeaTunnel-Connector-Development-Process.md b/blog/2022-09-19-Code-Demo-for-SeaTunnel-Connector-Development-Process.md
new file mode 100644
index 000000000..1025e1a6f
--- /dev/null
+++ b/blog/2022-09-19-Code-Demo-for-SeaTunnel-Connector-Development-Process.md
@@ -0,0 +1,242 @@
+# Code Demo for SeaTunnel Connector Development Process
+At the Apache SeaTunnel&Apache Doris Joint Meetup held on July 24, Liu Li — senior engineer of WhaleOps and contributor to Apache SeaTunnel — mentioned an easy way to develop a connector in SeaTunnel quickly.
+
+![](https://miro.medium.com/max/700/1*Rbd5BrSuGiZUQA53DXZrBw.png)
+We’ll divide it into four key parts:
+
+● The definition of a Connector
+
+● How to access data sources and targets
+
+● Code to demonstrate how to implement a Connector
+
+● Sources and targets that are currently supported
+
+## Definition of a Connector
+The Connector consists of Source and Sink and is a concrete implementation of accessing data sources.
+
+Source: The Source is responsible for reading data from sources such as MySQLSource, DorisSource, HDFSSource, TXTSource, and more.
+
+Sink: The Sink is responsible for writing read data to the target, including MySQLSink, ClickHouseSink, HudiSink, and more. Data transfer, and more specifically, data synchronization is completed through the cooperation between the Source and Sink.
+
+![](https://miro.medium.com/max/298/1*hsfa9Xtzt7o028XjCpqoOg.png)
+
+Of course, different sources and sinks can cooperate with each other.
+
+For example, you can use MySQL Source, and Doris Sink to synchronize data from MySQL to Doris, or even read data from MySQL Source and write to HDFS Sink.
+
+## How to access data sources and targets
+
+### How to access Source
+Firstly, let’s take a look at how we can access the Source. To elaborate, let’s dive in and check out how we can implement a source and the core interfaces that need to be implemented to access the Source.
+
+The simplest Source is a single concurrent Source. However, if a source does not support state storage and other advanced functions, what interfaces should we implement in these simple single concurrent sources?
+
+Firstly, we need to use getBoundedness in the Source to identify whether the Source supports real-time or offline, or both.
+
+createReader creates a Reader whose main function is to read the specific implementation of data. A single concurrent source is really simple as we only need to implement one method, pollNext, through which the read data is sent.
+
+If concurrent reading is required, what additional interfaces should we implement?
+![](https://miro.medium.com/max/393/1*bRxRjyMOGkVqseQkg0ONWg.png)
+
+For concurrent reading, we’ll introduce a new member, called the Enumerator.
+
+We implement createEnumerator in Source, and the main function of this member is to create an Enumerator to split the task into segments and then send it to the Reader.
+
+For example, a task can be divided into 4 splits.
+
+If it is concurrent twice, it’ll correspond to two Readers. Two of the four splits will be sent to Reader1, and the other two will be sent to Reader2.
+
+If the number of concurrencies is more — for example, let’s say there are four concurrences, then you have to create four Readers. You have to use the corresponding four splits for concurrent reading for improved efficiency.
+
+A corresponding interface in the Enumerator called the addSplitsBack sends the splits to the corresponding Reader. Through this method, the ID of the Reader can be specified.
+
+Similarly, there is an interface called the addSplits in the Reader to receive the splits sent by the Enumerator for data reading.
+
+In a nutshell, for concurrent reading, we need an Enumerator to implement task splitting and send the splits to the reader. Also, the reader receives the splits and uses them for reading.
+
+In addition, if we need to support resuming and exactly-once semantics, what additional interfaces should we implement?
+
+If the goal is to resume the transfer from a breakpoint, we must save the state and restore it. For this, we need to implement a restoreEnumerator in Source.
+
+The restoreEnumerator method is used to restore an Enumerator through the state and restore the split.
+
+Correspondingly, we need to implement a snapshotState in this enumerator, which is used to save the state of the current Enumerator and perform failure recovery during checkpoints.
+
+At the same time, the Reader will also have a snapshotState method to save the split state of the Reader.
+
+In the event of a failed restart, the Enumerator can be restored through the saved state. After the split is restored, reading can be continued from the place of failure, including fetching and incoming data.
+
+The exact one-time semantics actually requires the source to support data replays, such as Kafka, Pulsar, and others. In addition, the sink must be submitted in two phases, i.e., the precise one-time semantics can be achieved with the cooperation of these two sources and sinks.
+
+### How to access Sink
+
+Now, let’s take a look at how to connect to the Sink. What interfaces does the Sink need to implement?
+
+Truth be told, Sink is relatively simple. For concurrent sinks, when state storage and two-phase commit are not supported, the Sink is simple.
+
+To elaborate, the Sink does not distinguish between stream synchronization and batch synchronization as the Sink — and the entire SeaTunnel API system — supports **Unified Stream and Batch Processing.**
+
+Firstly, we need to implement createWriter. A Writer is used for data writing.
+
+You need to implement a writer method in Writer through which data is written to the target library.
+
+![](https://miro.medium.com/max/414/1*xQ7DRHdBGv-ofjSYdSHAoA.png)
+
+As shown in the figure above, if two concurrencies are set, the engine will call the createWriter method twice in order to generate two Writers. The engine will feed data to these two writers, which will write the data to the target through the write method.
+
+For a more advanced setup, for example, we need to support **two-phase commit and state storage**.
+
+Here, what additional interfaces should we implement?
+
+First, let’s introduce a new member, the Committer, whose main role is for the second-stage commit.
+
+![](https://miro.medium.com/max/414/1*cvj1i2A-E-1c_bCZneshtg.png)
+
+Since Sink is stored in state, it is necessary to restore Writer through the state. Hence, restoreWriter should be implemented.
+
+Also, since we have introduced a new member, the Committer, we should also implement a createCommitter in the sink. We can then use this method to create a Committer for the second-stage commit or rollback.
+
+In this case, what additional interfaces does Writer need to implement?
+
+Since it is a two-phase commit, the first-phase commit is done in the Writer through the implementation of the prepareCommit method — which is mainly used for the first-phase commit.
+
+In addition, state storage and failure recovery is also supported, meaning we need snapshotState to take snapshots at checkpoints. This saves the state for failure recovery scenarios.
+
+The Committer is the core here. It is mainly used for rollback and commit operations in the second phase.
+
+For the corresponding process, we need to write data to the database. Here, the engine will trigger the first stage commit during the checkpoint, and then the Writer needs to prepare a commit.
+
+At the same time, it will return commitInfo to the engine, and the engine will judge whether the first stage commits of all writers are successful.
+
+If they are indeed successful, the engine will use the commit method to actually commit.
+
+For MySQL, the first-stage commit just saves a transaction ID and sends it to the commit. The engine determines whether the transaction ID is committed or rolled back.
+
+## How to implement the Connector
+We’ve taken a look at Source and Sink; let’s now look at how to access the data source and implement your own Connector.
+
+Firstly, we need to build a development environment for the Connector.
+
+### The necessary environment
+1. Java 1.8\11, Maven, IntelliJ IDEA
+
+2. Windows users need to additionally download gitbash (https://gitforwindows.org/)
+
+3. Once you have these, you can download the SeaTunnel source code by cloning the git.
+
+4. Download SeaTunnel source code 1, git clone https://github.com/apache/incubator-seatunnel.git2, cd incubator-seatunnel
+
+### SeaTunnel Engineering Structure
+We then open it again through the IDE, and see the directory structure as shown in the figure:
+
+![](https://miro.medium.com/max/700/1*utRhNAsYiqQqBFa4Tjewgw.png)
+
+The directory is divided into several parts:
+
+1. Connector — v2
+
+Specific implementation of the new Connector(Connector — v2) will be placed in this module.
+
+2. connector-v2-dist
+
+The translation layer of the new connector translates into specific engine implementation — instead of implementing under corresponding engines such as Spark, Flink, and ST-Engine. ST-Engine is the “important, big project” the community is striving to implement. This project is worth the wait.
+
+3. examples
+
+This package provides a single-machine local operation method, which is convenient for debugging while implementing the Connector.
+
+4. e2e
+
+The e2e package is for e2e testing of the Connector.
+
+Next, let’s check out how a Connector can be created (based on the new Connector). Here is the step-by-step process:
+
+1. Create a new module in the seatunnel-connectors-v2 directory and name it this way: connector-{connector name}.
+
+2. The pom file can refer to the pom file of the existing connector and add the current child model to the parent model’s pom file.
+
+3. Create two new packages corresponding to the packages of Source and Sink, respectively:
+
+a. org.apache.seatunnel.connectors.seatunnel.{connector name}.source
+
+b. org.apache.seatunnel.connectors.seatunnel.{connector name}.sink
+
+Take this mysocket example shown in the figure:
+
+![](https://miro.medium.com/max/700/1*K1btD2gNwYxj96OJnPfW2Q.png)
+
+To do some implementation, develop the connector. During implementation, you can use the example module for local debugging if you need to debug. That said, this module mainly provides the local running environment of Flink and Spark.
+
+![](https://miro.medium.com/max/700/1*qOc3q7okzo7jObHxloc7WQ.png)
+
+As you can see in the image, there are numerous examples under the “Example” module — including seatunnel-flink-connector-v2-example.
+
+So how do you use them?
+
+Let’s take an example. The debugging steps on Flink are as follows (these actions are under the seatunnel-flink-connector-v2-example module:
+
+1. Add connector dependencies in pom.xml
+
+2. Add the task configuration file under resources/examples
+
+3. Configure the file in the SeaTunnelApiExample main method
+
+4. Run the main method
+
+### Code Demo
+
+This code demonstration is based on DingTalk.
+
+Here’s a reference( 19:35s–37:10s):
+
+https://weixin.qq.com/sph/A1ri7B
+
+![](https://miro.medium.com/max/700/1*ej9ronizPtC09ILWJDlbUg.png)
+
+### New Connectors supported at this stage
+
+As of July 14, contributions and statistics for the completed connectors are welcome. You are more than welcome to try them out, and raise issues in our community if you find bugs.
+
+![](https://miro.medium.com/max/700/1*RHNJDcbvKmSt2UGGSz3Icg.png)
+
+The Connector shared below have already been claimed and developed:
+
+![](https://miro.medium.com/max/700/1*RHNJDcbvKmSt2UGGSz3Icg.png)
+
+Also, we have Connectors in the roadmap — the connectors we want to support in the near future. To foster the process, the SeaTunnel Community initiated SeaTunnel Connector Access Incentive Plan, you are more than welcome to contribute to the project.
+
+SeaTunnel Connector Access Incentive Plan: https://github.com/apache/incubator-seatunnel/issues/1946
+
+You can claim tasks that haven’t been marked in the comment area, and take a spree home! Here is part of the connectors that need to be accessed as soon as possible:
+![](https://miro.medium.com/max/414/1*n-ixPtq066Acx4Ja5qNQqw.png)
+In fact, the implementations of Connectors like Feishu, DingTalk, and Facebook messenger are quite simple as the connectors do not need to carry a large amount of data (just a simple Source and Sink). This is in sharp contrast to Hive and other databases that need to consider transaction consistency or concurrency issues.
+
+We welcome everyone to make contributions and join our Apache SeaTunnel family!
+
+## About SeaTunnel
+SeaTunnel (formerly Waterdrop) is an easy-to-use, ultra-high-performance distributed data integration platform that supports the real-time synchronization of massive amounts of data and can synchronize hundreds of billions of data per day stably and efficiently.
+
+### Why do we need SeaTunnel?
+
+SeaTunnel does everything it can to solve the problems you may encounter in synchronizing massive amounts of data.
+
+* Data loss and duplication
+* Task buildup and latency
+* Low throughput
+* Long application-to-production cycle time
+* Lack of application status monitoring
+
+### SeaTunnel Usage Scenarios
+* Massive data synchronization
+* Massive data integration
+* ETL of large volumes of data
+* Massive data aggregation
+* Multi-source data processing
+
+### Features of SeaTunnel
+
+* Rich components
+* High scalability
+* Easy to use
+* Mature and stable