You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/08/07 11:43:16 UTC

[GitHub] [incubator-iceberg] aokolnychyi edited a comment on issue #351: Provide an API to modify records within files

aokolnychyi edited a comment on issue #351: Provide an API to modify records within files
URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-519054168
 
 
   I see the benefit of reusing `OverwriteFiles`, so let's think it through.
   
   We have the following methods in `OverwriteFiles`:
   ```
   OverwriteFiles overwriteByRowFilter(Expression expr);
   OverwriteFiles addFile(DataFile file);
   OverwriteFiles validateAddedFiles();
   ```
   
   First of all, we need to track the base snapshot id when the data was read and it should not change during retries. That's the reason why we accept the base snapshot id in the constructor. We collect all added files since that snapshot. So, we will probably need to do the same for `OverwriteFiles`.
   
   Second, we cannot mark deleted files using `overwriteByRowFilter` as that would require all rows to match the predicate and it would also require proper column stats when we have predicates on non-partition columns. So, it seems we will need some sort of `overwriteFiles(deletedFiles, addedFiles)` or `deleteFile`.
   
   Third, we cannot use the expression passed to `overwriteByRowFilter` for conflict resolution because of the aforementioned differences. Also, in certain cases, we cannot convert all filters in query engines into equivalent Iceberg filters. Internally, we prohibit concurrent appends in such cases and pass a set of files to delete. It seems we need a separate expression for validating concurrent appends. So, `allowConflictingAppends` or `validateNoConflictingAppends` will probably need to accept a row filter as well. By default, we can set that expression to either true or false depending on what behaviour we want.
   
   Fourth, we already have `validateAddedFiles` that ensures that all files added by `OverwriteFiles` match the predicate passed in `overwriteByRowFilter`. Calling `validateAddedFiles` will only make sense if we use `overwriteByRowFilter`, so we need to document that properly to avoid misuse of the API. Then `validateAddedFiles` will operate on files passed to `OverwriteFiles` and `allowConflictingAppends` or `validateNoConflictingAppends` will operate on all files added since the base snapshot.
   
   @rdblue, how do you see the merged API?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org