You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/12 13:49:49 UTC

[GitHub] [iceberg] Reo-LEI opened a new issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Reo-LEI opened a new issue #3102:
URL: https://github.com/apache/iceberg/issues/3102


   Currently, `IcebergFilesCommitter` will validate all snapshot history for every time commit new snapshot  in `commitDeltaTxn` . That means that the same snapshot will be verified multiple times, and take a lot of time to read manifests and manifest file.  And That is the reason why for `IcebergFilesCommitter` need opening multiple Avro metadata files and take several minutes 
    in https://github.com/apache/iceberg/issues/2900#issuecomment-895244837 (the more detailed reason is that flink will call `notifyCheckpointComplete(ckptId)` immediately after calling `snapshotState(ckptId)`, and committer will travel all snapshot history 
   to verify whether the data files which are referenced by pos-delete files still exists. That will block the commiter thread and make `snapshotState(ckptId+1)` timeout if hdfs response slow or table has too many manifest file need to travel.)
   
   I think `IcebergFilesCommitter` doesn't need to validate all snapshot history for every commit, just need to validate snapshots between last committed snapshot id and current snapshot id. For `IcebergFilesCommitter` first commit, we still need to travel all snapshot history to ensure referenced data files still exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Reo-LEI closed issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Posted by GitBox <gi...@apache.org>.
Reo-LEI closed issue #3102:
URL: https://github.com/apache/iceberg/issues/3102


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Reo-LEI edited a comment on issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Posted by GitBox <gi...@apache.org>.
Reo-LEI edited a comment on issue #3102:
URL: https://github.com/apache/iceberg/issues/3102#issuecomment-917645574


   Actually, in our production environment, we use flink upsert an iceberg table and commit every 5mins. But flink job will frequently fail by checkpoint timeout(15min).
   ![企业微信截图_16306356796267](https://user-images.githubusercontent.com/17312872/132991017-87a9f189-1eb1-465f-bbab-0f2b06688a29.png)
   
    I found `IcebergFilesCommitter` will take too many time to travel all snapshot history. So I specify the data file validation start from `lastCommittedSnapshotId`(as #3103). Now the commit time has been reduced from a few minutes to a few seconds.
   ![企业微信截图_1631255018986](https://user-images.githubusercontent.com/17312872/132991226-6059967a-2973-4008-aa29-8654eff3009e.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Reo-LEI commented on issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Posted by GitBox <gi...@apache.org>.
Reo-LEI commented on issue #3102:
URL: https://github.com/apache/iceberg/issues/3102#issuecomment-919073751


   > But if some of the files owned by the old snapshot are deleted by mistake.
   
   @coolderli Thanks for your attention, but I think we don't need to worry about this case, because 'IcebergFilesCommitter' only validate the data files which are referenced by the not commited pos-delete files. And pos-delete file will only referenced the same txn data file. That is mean the referenced data files will not owned by other snapshot and only will be the uncommitted data files.
   
   So I don't think your case will happend, but we can go a step further and discuss all possible situations. We assume that the table already has a historical snapshot.
   
   **Case-1:**  flink job **first start** and **not** uncommitted data.
   `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will be init as init value(-1).
   
   **Case-2:**  flink job **restore** from checkpoint and **not** uncommitted data.
   `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will be init as init value(-1).
   
   **Case-3:**  flink job **restore** from checkpoint and **have** uncommitted data.
   `IcebergFilesCommitter` will commit all uncommitted data. First, `lastCommittedSnapshotId` will be init as init value(-1), and then committer will validate all data files which is referenced by uncommitted pos-delete files from current snapshot to `lastCommittedSnapshotId`. Because `lastCommittedSnapshotId` value is -1, so committer will travel all snapshot history to ensure data files still exist and guarantee all snapshot history are valid. After that, `lastCommittedSnapshotId` will be update to the commited snapshotId.
   
   **Case-4:**  flink job keep **running** and **not** uncommitted data.
   `IcebergFilesCommitter` will do nothing and `lastCommittedSnapshotId` will keep its value.
   
   **Case-5:**  flink job keep **running** and **have** uncommitted data.
   `IcebergFilesCommitter` will commit all uncommitted data, and validate all referenced data files from current snapshot to `lastCommittedSnapshotId`. And then update `lastCommittedSnapshotId` the commited snapshotId. That can ensure all referenced data files are exist and not be delete between to commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Reo-LEI commented on issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Posted by GitBox <gi...@apache.org>.
Reo-LEI commented on issue #3102:
URL: https://github.com/apache/iceberg/issues/3102#issuecomment-946719494


   Close for #3258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Reo-LEI commented on issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Posted by GitBox <gi...@apache.org>.
Reo-LEI commented on issue #3102:
URL: https://github.com/apache/iceberg/issues/3102#issuecomment-917645574


   Actually, in our production environment, we use flink upsert an iceberg table and commit every 5mins. But flink job will frequently fail by checkpoint timeout(15min).
   ![企业微信截图_16306356796267](https://user-images.githubusercontent.com/17312872/132991017-87a9f189-1eb1-465f-bbab-0f2b06688a29.png)
   
    I found `IcebergFilesCommitter` will take too many time to travel all snapshot history. So I specify the data file validation start from `lastCommittedSnapshotId`(as #3101). Now the commit time has been reduced from a few minutes to a few seconds.
   ![企业微信截图_1631255018986](https://user-images.githubusercontent.com/17312872/132991226-6059967a-2973-4008-aa29-8654eff3009e.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] coolderli commented on issue #3102: Reduce flink IcebergFilesCommitter validate snapshot history when commit table

Posted by GitBox <gi...@apache.org>.
coolderli commented on issue #3102:
URL: https://github.com/apache/iceberg/issues/3102#issuecomment-917794140


   I think this is useful. But if some of the files owned by the old snapshot are deleted by mistake, how could we find this problem? Maybe when we query the data and found the error `FileNotFoundException`, and we have to roll back to an old-time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org