You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/09/23 12:38:19 UTC

[GitHub] [incubator-seatunnel] hailin0 opened a new pull request, #2865: [Imporve][Connector-V2] Support AT_LEAST_ONCE for iceberg source connector

hailin0 opened a new pull request, #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865

   <!--
   
   Thank you for contributing to SeaTunnel! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   ## Contribution Checklist
   
     - Make sure that the pull request corresponds to a [GITHUB issue](https://github.com/apache/incubator-seatunnel/issues).
   
     - Name the pull request in the form "[Feature] [component] Title of the pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
   
     - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix typo in README.md doc`.
   
   -->
   
   ## Purpose of this pull request
   
   Support AT_LEAST_ONCE for iceberg source connector
   
   ## Check list
   
   * [ ] Code changed are covered with tests, or it does not need tests for reason:
   * [ ] If any new Jar binary package adding in your PR, please add License Notice according
     [New License Guide](https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [ ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/incubator-seatunnel/tree/dev/docs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] ashulin commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by GitBox <gi...@apache.org>.
ashulin commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r979221173


##########
seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/source/enumerator/IcebergBatchSplitEnumerator.java:
##########
@@ -43,29 +45,39 @@ public IcebergBatchSplitEnumerator(@NonNull SourceSplitEnumerator.Context<Iceber
         super(context, sourceConfig, restoreState != null ?
             restoreState.getPendingSplits() : Collections.EMPTY_MAP);
         this.icebergScanContext = icebergScanContext;
+        // split enumeration is not needed during restore scenario
+        this.shouldEnumerate = restoreState == null;
     }
 
     @Override
-    public void run() {
-        super.run();
-
+    public synchronized void run() {
         Set<Integer> readers = context.registeredReaders();
-        log.debug("No more splits to assign." +
-            " Sending NoMoreSplitsEvent to reader {}.", readers);
+        if (shouldEnumerate) {
+            loadAllSplitsToPendingSplits(icebergTableLoader.loadTable());
+            assignPendingSplits(readers);
+        }
+
+        log.debug("No more splits to assign. Sending NoMoreSplitsEvent to readers {}.", readers);
         readers.forEach(context::signalNoMoreSplits);

Review Comment:
   Please judge whether to `context#signalNoMoreSplits` in `AbstractSplitEnumerator#assignPendingSplits`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r982216383


##########
seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/source/enumerator/IcebergStreamSplitEnumerator.java:
##########
@@ -51,35 +50,41 @@ public IcebergStreamSplitEnumerator(@NonNull SourceSplitEnumerator.Context<Icebe
         }
     }
 
+    @Override
+    public synchronized void run() {
+        loadNewSplitsToPendingSplits(icebergTableLoader.loadTable());
+        assignPendingSplits(context.registeredReaders());
+    }
+
     @Override
     public IcebergSplitEnumeratorState snapshotState(long checkpointId) throws Exception {
         return new IcebergSplitEnumeratorState(enumeratorPosition.get(), pendingSplits);
     }
 
     @Override
-    public void handleSplitRequest(int subtaskId) {
-        synchronized (this) {
-            if (pendingSplits.isEmpty() ||
-                pendingSplits.get(subtaskId) == null) {
-                refreshPendingSplits();
-            }
-            assignPendingSplits(Collections.singleton(subtaskId));
+    public synchronized void handleSplitRequest(int subtaskId) {
+        if (pendingSplits.isEmpty() ||

Review Comment:
   I think we need get the checkpointLock here. I found two problem in this code.
   
   1、From this code we can know if the split send to reader complete, it will remove from `pendingSplits`. And we store the `pendingSplits` when the `snapshotState ` method called. We must ensure `pendingSplits` update and store `pendingSplits` to hdfs synchronization.
   
   ```
   protected void assignPendingSplits(Set<Integer> pendingReaders) {
           log.debug("Assign pendingSplits to readers {}", pendingReaders);
   
           for (int pendingReader : pendingReaders) {
               List<IcebergFileScanTaskSplit> pendingAssignmentForReader = pendingSplits.remove(pendingReader);
               if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
                   log.info("Assign splits {} to reader {}",
                       pendingAssignmentForReader, pendingReader);
                   try {
                       context.assignSplit(pendingReader, pendingAssignmentForReader);
                   } catch (Exception e) {
                       log.error("Failed to assign splits {} to reader {}",
                           pendingAssignmentForReader, pendingReader, e);
                       pendingSplits.put(pendingReader, pendingAssignmentForReader);
                   }
               }
           }
       }
   ```
   
   2. Enumerator send split split#1 to Reader and then snapshot to hdfs(Suppose a checkpoint occurs after the send split is completed). If Enumerator Task failed and restored, the split#1 can not be found in snapshotState. The Reader received the split#1 and update `pendingSplits` in reader.  Will `Reader#snapshotState` execute before update `pendingSplits`? f this happens, when the next restore occurs, split # 1 cannot be found after the enumerator is restored, and there is no split # 1 after the Reader is restored. As a result, split # 1 is lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r982216383


##########
seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/source/enumerator/IcebergStreamSplitEnumerator.java:
##########
@@ -51,35 +50,41 @@ public IcebergStreamSplitEnumerator(@NonNull SourceSplitEnumerator.Context<Icebe
         }
     }
 
+    @Override
+    public synchronized void run() {
+        loadNewSplitsToPendingSplits(icebergTableLoader.loadTable());
+        assignPendingSplits(context.registeredReaders());
+    }
+
     @Override
     public IcebergSplitEnumeratorState snapshotState(long checkpointId) throws Exception {
         return new IcebergSplitEnumeratorState(enumeratorPosition.get(), pendingSplits);
     }
 
     @Override
-    public void handleSplitRequest(int subtaskId) {
-        synchronized (this) {
-            if (pendingSplits.isEmpty() ||
-                pendingSplits.get(subtaskId) == null) {
-                refreshPendingSplits();
-            }
-            assignPendingSplits(Collections.singleton(subtaskId));
+    public synchronized void handleSplitRequest(int subtaskId) {
+        if (pendingSplits.isEmpty() ||

Review Comment:
   I think we need get the checkpointLock here. I found two problem in this code.
   
   1、From this code we can know if the split send to reader complete, it will remove from `pendingSplits`. And we store the `pendingSplits` when the `snapshotState ` method called. We must ensure `pendingSplits` update and store `pendingSplits` to hdfs synchronization.
   
   ```
   protected void assignPendingSplits(Set<Integer> pendingReaders) {
           log.debug("Assign pendingSplits to readers {}", pendingReaders);
   
           for (int pendingReader : pendingReaders) {
               List<IcebergFileScanTaskSplit> pendingAssignmentForReader = pendingSplits.remove(pendingReader);
               if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
                   log.info("Assign splits {} to reader {}",
                       pendingAssignmentForReader, pendingReader);
                   try {
                       context.assignSplit(pendingReader, pendingAssignmentForReader);
                   } catch (Exception e) {
                       log.error("Failed to assign splits {} to reader {}",
                           pendingAssignmentForReader, pendingReader, e);
                       pendingSplits.put(pendingReader, pendingAssignmentForReader);
                   }
               }
           }
       }
   ```
   
   2. Enumerator send split split#1 to Reader and then snapshot to hdfs(Suppose a checkpoint occurs after the send split is completed). If Enumerator Task failed and restored, the split#1 can not be found in snapshotState.
   
   The Reader received the split#1 and update `pendingSplits` in reader.  Will `Reader#snapshotState` execute before update `pendingSplits`? f this happens, when the next restore occurs, split # 1 cannot be found after the enumerator is restored, and there is no split # 1 after the Reader is restored. As a result, split # 1 is lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] hailin0 commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by "hailin0 (via GitHub)" <gi...@apache.org>.
hailin0 commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r1098697262


##########
seatunnel-e2e/seatunnel-connector-v2-e2e/connector-iceberg-hadoop3-e2e/src/test/java/org/apache/seatunnel/e2e/connector/iceberg/hadoop3/IcebergSourceIT.java:
##########
@@ -124,6 +126,7 @@ public void testIcebergSource(TestContainer container) throws IOException, Inter
     }
 
     private void initializeIcebergTable() {
+        FileUtil.fullyDelete(new File(CATALOG_DIR));

Review Comment:
   Test cases can interfere with each other



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] hailin0 commented on pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by GitBox <gi...@apache.org>.
hailin0 commented on PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#issuecomment-1256903992

   @ashulin PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] CalvinKirs commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by GitBox <gi...@apache.org>.
CalvinKirs commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r1071821130


##########
seatunnel-e2e/seatunnel-connector-v2-e2e/connector-iceberg-hadoop3-e2e/src/test/java/org/apache/seatunnel/e2e/connector/iceberg/hadoop3/IcebergSourceIT.java:
##########
@@ -124,6 +126,7 @@ public void testIcebergSource(TestContainer container) throws IOException, Inter
     }
 
     private void initializeIcebergTable() {
+        FileUtil.fullyDelete(new File(CATALOG_DIR));

Review Comment:
   Would this be a problem if it wasn't completely deleted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r1012849108


##########
docs/en/connector-v2/source/Iceberg.md:
##########
@@ -10,7 +10,7 @@ Source connector for Apache Iceberg. It can support batch and stream mode.
 
 - [x] [batch](../../concept/connector-v2-features.md)
 - [x] [stream](../../concept/connector-v2-features.md)
-- [x] [exactly-once](../../concept/connector-v2-features.md)
+- [x] [at-least-once](../../concept/connector-v2-features.md)
 - [x] [schema projection](../../concept/connector-v2-features.md)
 - [x] [parallelism](../../concept/connector-v2-features.md)
 - [ ] [support user-defined split](../../concept/connector-v2-features.md)

Review Comment:
   Please add `changed log` reference https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/connector-v2/source/Jdbc.md



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org