You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/21 12:53:44 UTC

[GitHub] [iceberg] openinx edited a comment on issue #2575: Flakey flink unit tests TestFlinkTableSink#testHashDistributeMode

openinx edited a comment on issue #2575:
URL: https://github.com/apache/iceberg/issues/2575#issuecomment-1046845868


   Here we have only 1 parallelism in this unit test to write records in this stream flow: 
   
   ```
   Source -> Shuffle -> IcebergStreamWriter -> IcebergFilesCommitter.
   ```
   
   And we have the following records that need to write into the partitioned table (partition field is `data`):
   
   ```
   (1, 'aaa'), (1, 'bbb'), (1, 'ccc')
   (2, 'aaa'), (2, 'bbb'), (2, 'ccc')
   (3, 'aaa'), (3, 'bbb'), (3, 'ccc')
   ```
   
   Then: 
   
   **Step#1**  we write the following records into orc files: 
   
   ```
   (1, 'aaa'), (1, 'bbb'), (1, 'ccc')
   (2, 'aaa'), (2, 'bbb'), (2, 'ccc')
   (3, 'bbb'), (3, 'ccc')
   
   # Notice: the record (3, 'aaa') was not emitted to the orc file in the checkpoint#1 
   ```
   
   **Step#2**  checkpoint barrier was encountered, so the IcebergStreamWriter emits the `1.orc`, `2.orc`, `3.orc` to IcebergFilesCommitter.
   **Step#3**  IcebergFilesCommitter was trying to execute `snapshotState`;
   **Step#4**  The IcebergStreamWriter emitted a `4.orc` with `(3, 'aaa')` included (endInput), and then close itself. 
   **Step#5**  The IcebergFilesCommitter commit the transaction with 4 orc data files. (notifyCheckpointComplete)
   
   The log message seems did the above steps, and finally commit all the 4 orc data files in a single transaction. But in fact, the flink  IcebergFilesCommitter won't accept any new file (`4.orc`) to the pending transaction when it is executing the `snapshotState` in **Step#3** because the flink's `StreamTask` is a single thread consuming message from a FIFO queue. I think there must be other wrong thing but I cannot reproduce this thing.
   
   Anyway, the root cause for this failure unit test is: we don't control all the events precisely in a single checkpoint (In this case, few records are accumulated in one checkpoint, but there is still someone which was remained for the `endInput` to emit).  So I think the solution I proposed in this https://github.com/apache/iceberg/pull/4117#issuecomment-1042718844 should still work to fix it fundamentally. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org