You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/09/28 10:14:00 UTC

[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2865: [Imporve][Connector-V2] Imporve iceberg source connector

EricJoy2048 commented on code in PR #2865:
URL: https://github.com/apache/incubator-seatunnel/pull/2865#discussion_r982216383


##########
seatunnel-connectors-v2/connector-iceberg/src/main/java/org/apache/seatunnel/connectors/seatunnel/iceberg/source/enumerator/IcebergStreamSplitEnumerator.java:
##########
@@ -51,35 +50,41 @@ public IcebergStreamSplitEnumerator(@NonNull SourceSplitEnumerator.Context<Icebe
         }
     }
 
+    @Override
+    public synchronized void run() {
+        loadNewSplitsToPendingSplits(icebergTableLoader.loadTable());
+        assignPendingSplits(context.registeredReaders());
+    }
+
     @Override
     public IcebergSplitEnumeratorState snapshotState(long checkpointId) throws Exception {
         return new IcebergSplitEnumeratorState(enumeratorPosition.get(), pendingSplits);
     }
 
     @Override
-    public void handleSplitRequest(int subtaskId) {
-        synchronized (this) {
-            if (pendingSplits.isEmpty() ||
-                pendingSplits.get(subtaskId) == null) {
-                refreshPendingSplits();
-            }
-            assignPendingSplits(Collections.singleton(subtaskId));
+    public synchronized void handleSplitRequest(int subtaskId) {
+        if (pendingSplits.isEmpty() ||

Review Comment:
   I think we need get the checkpointLock here. I found two problem in this code.
   
   1、From this code we can know if the split send to reader complete, it will remove from `pendingSplits`. And we store the `pendingSplits` when the `snapshotState ` method called. We must ensure `pendingSplits` update and store `pendingSplits` to hdfs synchronization.
   
   ```
   protected void assignPendingSplits(Set<Integer> pendingReaders) {
           log.debug("Assign pendingSplits to readers {}", pendingReaders);
   
           for (int pendingReader : pendingReaders) {
               List<IcebergFileScanTaskSplit> pendingAssignmentForReader = pendingSplits.remove(pendingReader);
               if (pendingAssignmentForReader != null && !pendingAssignmentForReader.isEmpty()) {
                   log.info("Assign splits {} to reader {}",
                       pendingAssignmentForReader, pendingReader);
                   try {
                       context.assignSplit(pendingReader, pendingAssignmentForReader);
                   } catch (Exception e) {
                       log.error("Failed to assign splits {} to reader {}",
                           pendingAssignmentForReader, pendingReader, e);
                       pendingSplits.put(pendingReader, pendingAssignmentForReader);
                   }
               }
           }
       }
   ```
   
   2. Enumerator send split split#1 to Reader and then snapshot to hdfs(Suppose a checkpoint occurs after the send split is completed). If Enumerator Task failed and restored, the split#1 can not be found in snapshotState.
   
   The Reader received the split#1 and update `pendingSplits` in reader.  Will `Reader#snapshotState` execute before update `pendingSplits`? f this happens, when the next restore occurs, split # 1 cannot be found after the enumerator is restored, and there is no split # 1 after the Reader is restored. As a result, split # 1 is lost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org