You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/07 14:52:03 UTC

[GitHub] [iceberg] szehon-ho opened a new issue #2435: RepairManifestsAction

szehon-ho opened a new issue #2435:
URL: https://github.com/apache/iceberg/issues/2435


   A while ago @aokolnychyi had suggested a new action "RepairManifestsAction", and I wanted to discuss prior to starting to look at it.
   
   The initial idea is that it allows Iceberg table metadata to be cleanly repaired after a metadata bug, such as   #1980 (file size incorrect in manifest file)
   
   So one proposal is to have a RepairManifestsAction with the typical apis like:
   `actions.repairManifests().filter(Expression expr).caseSensitive(boolean value)`. //filter out data files that this action will read
   
   And have following options that make different levels of spark jobs:
   `withRepairFileSize(boolean value)`. //Reads file sizes from FileSystem, rewrites manifests
   `withRepairRowCount(boolean value)` //Opens the file and counts rows, rewrites manifests with the row count
   `withRepairMetrics(boolean value)` //Opens the file, and from the file metadata rewrites the manifest with latest metrics
   `withRemoveDeletedFiles(boolean value)` //if any datafile referenced by manifest is removed, remove it from manifest
   
   I was thinking that the last one may also be helpful, if a datafile becomes accidentally removed or corrupted in Iceberg table , an exception "java.io.FileNotFoundException: No such file or directory" is hit on a query to that table.  Maybe later having 'repair manifest-list or repair metadata' is also useful to handle cases if manifest files themselves are removed..
   
   This is a bit different than 'RewriteManifests' as that just rewrites manifests but does not fix them by reading data-files.  The only way to solve an issue like #1980 today would probably be just rewriting all data files, which is a bit expensive.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-815014836


   Thanks Russell for looking, yea I have not yet looked at the Actions API, that sounds great then.  Yea I agree, maybe rowcount can be just treated along with metrics as both require opening the file anyway, something like 'fullMode(boolean)'.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] KarlManong edited a comment on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
KarlManong edited a comment on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-974747373


   > Can this action fix the problem which accidental deleted partition data files?
   
   我们有一张分区表,存放在hdfs上,里面有一个分区的数据被hdfs client删除了。这个表只要访问了关联到错误分区的数据就直接抛出FileNotFoundException,而且还不能通过delete from table where partition = error_part的方式删除。目前只能通过手工修正meta的方式修复。
   
   ------
   
   We have a partition table, which is stored on hdfs, and the data of a partition is deleted by hdfs client. As long as this table accesses the data associated with the error partition, `FileNotFoundException` will be thrown directly, and it cannot be deleted by sql `deleting from table where partition = error_part`. Currently, it can only be repaired by manually correcting the meta.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-815413243


   https://github.com/apache/iceberg/blob/27eaa796530f56eb99d111e6110be3e1a6eb99e7/api/src/main/java/org/apache/iceberg/actions/Action.java#L30 
   
   This is the new api so it has a built in "option" and "options" that you can use for seldom changed or rarely changed things. I think in this case most of the time users will want to repair metrics and files-sizes (if they are broken) and deleted files in extreme cases to try to save a corrupted table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-815413695


   Another thing, I wonder if this fits into "rewrite manifests" as an action? I should think that through a bit more but maybe this works as an option for rewriting manfiests since that action already will be rewriting this information. Currently it just does so blindly though without checking if the underlying files are still there or if the metadata is correct ... I believe :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho edited a comment on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho edited a comment on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-823421527


   @aokolnychyi @karuppayya  as we discussed a little offline, also if you guys have any more thoughts (from our sync, the idea is to not focus necessarily first on filters, but just implement the logic for a full table scan and then iterate).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-825742484


   Hi guys, looking through the code there doesn't seem any good way other than making a new kind of FileScan, that not only has Files but also the ManifestFile it came from (so that during the spark job, it can identify the manifestFile as updated).
   
   If you guys have time to see, is it in line with your thoughts and if there are any clean way to do it?  Putting it in BaseFileScanTask is acceptable? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-823421527


   @aokolnychyi @karuppayya  as we discussed a little offline, also if you guys have any more thoughts (from our sync, the idea is to not focus necessarily first on filters, but just implement the logic for a table scan).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-814998102


   The last one there "removeDeletedFiles" is the most dangerous imho, but I think this is a pretty good setup. 
   
   Although with the new Action API it probably just makes sense to have all the "withX" as "options" defaulted to true except for removeDeletedFiles. I'm also not sure if fixing rowcount and metrics seperately really needs to be a thing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] KarlManong edited a comment on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
KarlManong edited a comment on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-974747373


   > Can this action fix the problem which accidental deleted partition data files?
   
   我们有一张分区表,存放在hdfs上,里面有一个分区的数据被hdfs client删除了。这个表只要访问了关联到错误分区的数据就直接抛出FileNotFoundException,而且还不能通过delete from table where partition = error_part的方式删除。目前只能通过手工修正meta的方式修复。
   
   ------
   
   We have a partition table, which is stored on hdfs, and the data of a partition is deleted by hdfs client. As long as this table accesses the data associated with the error partition, `FileNotFoundException` will be thrown directly, and it cannot be deleted by sql `deleting from table where partition = error_part`. Currently, it can only be repaired by manually correcting the meta.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho edited a comment on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho edited a comment on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-815916918


   Yea I can see how it is similar.  Yea for sure the rewrite manifest does not do any reading of the data-files to check if they are write, so I guess these could be kind of a pre-processing step to make a correct manifest file before rewriting them.
   
   There might be some confusing configs if we put the two together, for instance RewriteManifests has predicates<ManifestFile>  whereas RepairManifests as I see would have a more traditional predicate (expression), unless it makes sense to have it be a ManifestFile predicate (but then it's more complex going from manifest->data_file->manifest).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho edited a comment on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho edited a comment on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-815014836


   Thanks Russell for looking, yea I have not yet looked at the new Actions API, that sounds great then.  Yea I agree, maybe rowcount can be just treated along with metrics as both require opening the file anyway, something like 'fullMode(boolean)'.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] KarlManong commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
KarlManong commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-974747373


   > Can this action fix the problem which accidental deleted partition data files?
   
   我们有一张分区表,存放在hdfs上,里面有一个分区的数据被hdfs client删除了。这个表只要访问了关联到错误分区的数据就直接抛出FileNotFoundException,而且还不能通过delete from table where partition = error_part的方式删除。目前只能通过手工修正meta的方式修复。
   ------
   We have a partition table, which is stored on hdfs, and the data of a partition is deleted by hdfs client. As long as this table accesses the data associated with the error partition, `FileNotFoundException` will be thrown directly, and it cannot be deleted by sql `deleting from table where partition = error_part`. Currently, it can only be repaired by manually correcting the meta.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-1023664937


   +1, I am curious to know if there's any existing way to reconcile the manifest based on current status quo of the underlying file system. 
   
   Example scenario we ran into is that, the parquet data file is accidentally deleted outside the knowledge of the iceberg but manifest file still keep track of it. Is there any viable way to repair the given manifest? Any suggestion would be helpful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-974041417


   Can you elaborate a bit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] KarlManong commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
KarlManong commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-974747373


   > Can this action fix the problem which accidental deleted partition data files?
   
   我们有一张分区表,存放在hdfs上,里面有一个分区的数据被hdfs client删除了。这个表只要访问了关联到错误分区的数据就直接抛出FileNotFoundException,而且还不能通过delete from table where partition = error_part的方式删除。目前只能通过手工修正meta的方式修复。
   ------
   We have a partition table, which is stored on hdfs, and the data of a partition is deleted by hdfs client. As long as this table accesses the data associated with the error partition, `FileNotFoundException` will be thrown directly, and it cannot be deleted by sql `deleting from table where partition = error_part`. Currently, it can only be repaired by manually correcting the meta.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-825745780


   @szehon-ho I'm not sure I understand what you are asking?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-826669837


   Ah sorry I was not clear, I was looking through the code brainstorming about the implementation and the problem was this:
   
   -Currently ManifestGroup::planFiles() returns iterable of FileScanTask (actual class BaseFileScanTask)
   -But FileScanTask has only list of DataFiles.
   
   So if for RepairManifest, we launch a Spark job on FileScanTasks like for normal scans, we will lose track of the ManifestFile that each DataFile belonged to, which we need after we collect the result to repair the relevant ManifestFIle.  So my idea is either:
   
   - Add ManifestFile as a member of BaseFileScanTask (it's available in ManifestGroup::planFiles but just not added).  This would increase the memory usage a bit for general scan planning.
   - Have a new specialized type of BaseFileScanTask that has this ManifestFile field.  But then it's a bit awkward to have it returned only by ManifestGroup::planFiles only for RepairManifestAction (flag based, or separate method).  Thanks
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-826858440


   Ah yeah, I think we usually do new custom Scan's, See
   https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestEntriesTable.java#L96-L140
   For example. I would hesitate to touch anything around "planFiles" for this and would instead op for reading the manifest files directly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] KarlManong commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
KarlManong commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-973665059


   Can this action fix the problem which accidental deleted partition data files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-815916918


   Yea I can see how it is similar.  Yea for sure the rewrite manifest does not do any reading of the data-files to check if they are write, so I guess these could be kind of a pre-processing step to make a correct manifest file before rewriting them.
   
   There might be some confusing configs if we put the two together, for instance RewriteManifests has predicates <ManifestFile>  whereas RepairManifests as I see would have a more traditional predicate, unless it makes sense to have it be a ManifestFile predicate (but then it's more complex going from manifest->data_file->manifest).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho edited a comment on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho edited a comment on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-825742484


   Hi guys, looking through the code there doesn't seem any good way other than making a new kind of FileScanTask, that not only has Files but also the ManifestFile it came from (so that during the spark job, it can identify the manifestFile as updated).
   
   If you guys have time to see, is it in line with your thoughts and if there are any clean way to do it?  Putting it in BaseFileScanTask is acceptable? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #2435: RepairManifestsAction

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #2435:
URL: https://github.com/apache/iceberg/issues/2435#issuecomment-1023792640


   Hi guys, in my experience running 'delete from table where part=error_part' works, can you give a try?  I'm actually not sure why it says fail for @KarlManong , if its using the iceberg catalog from spark-sql then that query should be removing only metadata and not try to access the missing data file.
   
   Of course this change can fix it but its still waiting review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org