You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/29 08:09:32 UTC

[GitHub] [iceberg] ConeyLiu opened a new pull request #2890: Fixes RemoveOrphanFiles delete files unexpected

ConeyLiu opened a new pull request #2890:
URL: https://github.com/apache/iceberg/pull/2890


   `RemoveOrphanFiles` use `actualFileDF leftanti join validFileDF` to determine which files should be removed. We will list all the files under the provided or table location directory with `FileSystem.listStatus` and create the `actualFileDF`. `validFileDF` is created by index those metadata file and reference.
   
   However, `FileSystem.listStatus` will `qualify` the given path. For example: a path: `hdfs:/path` will be qualified with `hdfs://host:port/path`.  If the `warehouse` is set as: `hdfs:/path`:
   
   `validFileDF`:
       hdfs:/path/file1
       hdfs:/path/file2
       hdfs:/path/file3
       ....
   
   `actualFileDF`:
       hdfs://host:port/path/file1
       hdfs://host:port/path/file2
       hdfs://host:port/path/file3
       ....
   
   Then, all the files in `actualFileDF` will be treated as invalid.
   
   In this patch, we only compare the pure path (remove the schema and authority) when doing the `leftanti join`.
   
   Updated existed UTs to test it.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2890: Fixes RemoveOrphanFiles delete files unexpected

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2890:
URL: https://github.com/apache/iceberg/pull/2890#issuecomment-891355561


   There have been multiple discussions around this. I'll try to fetch the old thread on slack later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #2890: Fixes RemoveOrphanFiles delete files unexpected

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #2890:
URL: https://github.com/apache/iceberg/pull/2890#issuecomment-918940938


   gentle ping @rdblue @aokolnychyi, could you help to review this? Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ConeyLiu commented on pull request #2890: Fixes RemoveOrphanFiles delete files unexpected

Posted by GitBox <gi...@apache.org>.
ConeyLiu commented on pull request #2890:
URL: https://github.com/apache/iceberg/pull/2890#issuecomment-897545328


   Hi @aokolnychyi, could you help to review this? Thanks a lot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org