You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2022/10/07 06:10:11 UTC

[GitHub] [gobblin] phet commented on pull request #3575: Enhance `IcebergDataset` to detect when files already at dest then proceed with only delta

phet commented on PR #3575:
URL: https://github.com/apache/gobblin/pull/3575#issuecomment-1271153373

   > it seems as long as there is one new snapshot generate on source, we will go through all the data files available on source to do copy even there is only one new file added.
   
   actually we'll always go through the complete metadata on source to list every file reachable from at least one snapshot.  it's true we do that even when there's not any 'new', unreplicated snapshot.  actual copy however only happens for files that are not present on the destination.  further, we need not examine every file, to determine whether it exists on dest.  rather, thanks to the immutability of iceberg files, we may short-circuit evaluation of an entire subtree of the iceberg metadata, when the root (e.g. manifest-list or manifest) is found already to exist at dest.  for details, I've added the comment `// ALGO:` in `IcebergDataset.getFilePathsToFileStatus()`
   
   > How do we plan to handle the file deletion on source? i.e. expire snapshot operation?
   
   good question!  the answer is: distcp is not responsible.  instead we will expect reachability analysis and orphan file deletion to happen elsewhere.  a good candidate would be the destination catalog we'll eventually register the copied 'metadata.json' file with.  e.g. that catalog would hold the metadata version prior to the registration and could easily determine which snapshots 'expire' from the act of replacing the older metadata file with the newer one (replication has copied from source)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@gobblin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org