You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "zoucao (via GitHub)" <gi...@apache.org> on 2023/02/06 10:00:06 UTC

[GitHub] [iceberg] zoucao opened a new issue, #6756: The out-of-order problem occurs around the process of recovery

zoucao opened a new issue, #6756:
URL: https://github.com/apache/iceberg/issues/6756

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Flink
   
   ### Please describe the bug 🐞
   
   Recently, we face an out-of-order problem when consuming records from Iceberg. In a nutshell, splits belonging to snapshot-A are sent after the splits belonging to snapshot-B, rather snapshot-A is submitted before snapshot-B.
   
   **Description**:
   If the Flink job failover strategy is set as 'region',  in some scenarios, only the failed subtask will do a failover, and the 'SourceCoordinator' will not reset to the latest completed checkpoint, the coordinator will only add the splits belonging to the failed subtask back. However, in the beginning, the enumerator will register a split-discovering logic at a fixed rate in [workerExecutor](https://github.com/apache/flink/blob/6e4e6c68a4a179d932dacafb2771cc84f730bbc0/flink-runtime/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinatorContext.java#L307).  
   In the process of recovery, `addSplitsBack` will be invoked in 'coordinatorExecutor' to send splits back to Enumerator, at the same time, split-discovering logic will be executed in 'workerExecutor' concurrently. 
   Before the `coordinatorExecutor` invokes `addSplitsBack` to add splits belonging to snapshot-A to the queue,  the 'workerExecutor' triggers discovering splits belonging to snapshot-B will cause the out-of-order problem.
   
   **Here is the log being simplified**
   ```
    2023-01-16 10:53:48.255 INFO  org.apache.flink.runtime.source.coordinator.SourceCoordinator - Marking checkpoint 1831 as completed for source Source: TableSourceScan(table=XXX.)
   2023-01-16 10:54:55.274 INFO  org.apache.iceberg.flink.source.enumerator.ContinuousSplitPlannerImpl - Discovered 24 splits from incremental scan: from snapshot (exclusive) is IcebergEnumeratorPosition{snapshotId=225303270571799332, snapshotTimestampMs=1673837603654}, to snapshot (inclusive) is IcebergEnumeratorPosition{snapshotId=9165752853229073849, snapshotTimestampMs=1673837689612} for table XXX
   ... asign splits to task
   
   2023-01-16 10:56:05.274 INFO  org.apache.iceberg.flink.source.enumerator.ContinuousSplitPlannerImpl - Discovered 24 splits from incremental scan: from snapshot (exclusive) is IcebergEnumeratorPosition{snapshotId=9165752853229073849, snapshotTimestampMs=1673837689612}, to snapshot (inclusive) is IcebergEnumeratorPosition{snapshotId=4821326740215356149, snapshotTimestampMs=1673837760375} for table XXX
   ... asign splits to task
   
   2023-01-16 10:56:45.243 INFO  org.apache.iceberg.flink.source.enumerator.ContinuousSplitPlannerImpl - Discovered 16 splits from incremental scan: from snapshot (exclusive) is IcebergEnumeratorPosition{snapshotId=4821326740215356149, snapshotTimestampMs=1673837760375}, to snapshot (inclusive) is IcebergEnumeratorPosition{snapshotId=9128554009266494680, snapshotTimestampMs=1673837801766} for table XXX
   ... asign splits to task
   
   2023-01-16 10:56:48.262 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    - Triggering checkpoint 1832 (type=CHECKPOINT) @ 1673837808255 for job 59edfbcfae9e87eff5d3faebcfbd74f5.
   2023-01-16 10:57:16.210 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    - Decline checkpoint 1832 by task 002d6bde4042610c0b701181eb6e84a1 of job 59edfbcfae9e87eff5d3faebcfbd74f5 at container_e1379_1671091519846_1892587_01_000002 @ xxxxx (dataPort=39621) : org.apache.flink.runtime.checkpoint.CheckpointException: Task name with subtask : Source: TableSourceScan(table=XXX)
   
   executing task failover.
   
   2023-01-16 10:57:45.301 INFO  org.apache.iceberg.flink.source.enumerator.ContinuousSplitPlannerImpl - Discovered 16 splits from incremental scan: from snapshot (exclusive) is IcebergEnumeratorPosition{snapshotId=9128554009266494680, snapshotTimestampMs=1673837801766}, to snapshot (inclusive) is IcebergEnumeratorPosition{snapshotId=8771475830417620708, snapshotTimestampMs=1673837854298} for table XXX
   2023-01-16 10:58:19.509 INFO  org.apache.iceberg.flink.source.enumerator.AbstractIcebergEnumerator - Add 64 splits back to the pool for failed subtask 0
   
   
   2023-01-16 10:58:19.884 INFO  org.apache.iceberg.flink.source.enumerator.AbstractIcebergEnumerator - Assign split to subtask 0: IcebergSourceSplit{tasks=IcebergSourceSplit{commit_snapshot_id=8771475830417620708, file=XXX}
   ... asign splits to task
   
   2023-01-16 10:58:21.071 INFO  org.apache.iceberg.flink.source.enumerator.AbstractIcebergEnumerator - Assign split to subtask 0: IcebergSourceSplit{tasks=IcebergSourceSplit{commit_snapshot_id=9165752853229073849, file=XXX}
   ... asign splits to task
   
   2023-01-16 10:58:23.775 INFO  org.apache.iceberg.flink.source.enumerator.AbstractIcebergEnumerator - Assign split to subtask 0: IcebergSourceSplit{tasks=IcebergSourceSplit{commit_snapshot_id=4821326740215356149, file=XXX}
   ... asign splits to task
   
   2023-01-16 10:58:25.245 INFO  org.apache.iceberg.flink.source.enumerator.AbstractIcebergEnumerator - Assign split to subtask 0: IcebergSourceSplit{tasks=IcebergSourceSplit{commit_snapshot_id=9128554009266494680, file=XXX}
   ... asign splits to task
   
   
   2023-01-16 10:59:05.234 INFO  org.apache.iceberg.flink.source.enumerator.ContinuousSplitPlannerImpl - Discovered 24 splits from incremental scan: from snapshot (exclusive) is IcebergEnumeratorPosition{snapshotId=8771475830417620708, snapshotTimestampMs=1673837854298}, to snapshot (inclusive) is IcebergEnumeratorPosition{snapshotId=5467477100114240656, snapshotTimestampMs=1673837937738} for table XXX
   
   ```
   
   **Here are the snapshot_id and the commit_time**
   ![image](https://user-images.githubusercontent.com/32817398/216937277-fcce98ef-79b2-4979-8e0f-7b980018997f.png)
   
   
   **repair suggestion**
   Maybe we should fix it from both the Flink side and the Iceberg side.
   - From the Flink side, block the workExecutor(splits-discovering logic) if subtask is failed, util all failed subtask's splits had been added back.
   - If `addSplitsBack` is invoked, we add the splits to the head of the queue.
   
   Before posting it to the Flink community, I want to share it here first, maybe someone has a more elegant plan to fix it, any comments are welcome, and looking forward to your feedback!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zoucao commented on issue #6756: The out-of-order problem occurs around the process of recovery

Posted by "zoucao (via GitHub)" <gi...@apache.org>.
zoucao commented on issue #6756:
URL: https://github.com/apache/iceberg/issues/6756#issuecomment-1420050929

   gentle ping @stevenzwu, what do you think about it  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] The out-of-order problem occurs around the process of recovery [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6756:
URL: https://github.com/apache/iceberg/issues/6756#issuecomment-1890167270

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6756: The out-of-order problem occurs around the process of recovery

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6756:
URL: https://github.com/apache/iceberg/issues/6756#issuecomment-1667015175

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [I] The out-of-order problem occurs around the process of recovery [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #6756: The out-of-order problem occurs around the process of recovery
URL: https://github.com/apache/iceberg/issues/6756


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org