You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/12 23:38:34 UTC

[GitHub] [iceberg] aokolnychyi opened a new issue #1598: Spark SQL Extensions: Rewrite manifests

aokolnychyi opened a new issue #1598:
URL: https://github.com/apache/iceberg/issues/1598


   ```
   CALL catalog.schema.rewrite_manifests(
     namespace => 'namespace_name', -- required
     table => 'table_name', -- required
     min_manifest_size => 0.5 * of target manifest size, -- optional
     max_manifest_size => 1.5 * of target manifest size, -- optional
     min_num_manifests_to_rewrite => 10, -- optional
     min_clustering_ratio => 0.75 -- optional
   )
   ```
   
   The command can return the produced snapshot id, the number of deleted and added manifests, the number of records we rewrote metadata for.
   
   It can work as follows:
   
   - Iterate through the list of manifests and find out what manifest files are not optimal from the size perspective. We have the target manifest size in table properties and the stored procedure can accept allowed deviations (with some default value). 
   - Analyze the clustering of metadata entries within optimal manifests. We need to find out non-overlapping manifests and compute the total number of entries in them. Then we should compare that number to the total number of entries in all manifests. This gives us an idea of how well our metadata is clustered. We can check whether manifests overlap based on min/max stats for partition columns.
   - If clustering is bad, we should rewrite all metadata. Rewriting all metadata is relatively cheap even for tables with millions of files if snapshot id inheritance is enabled.
   - If clustering is OK, we should look only into non-optimal files from the size perspective.
       - If the number of too small or too big files is larger than the threshold, we should rewrite those manifests.
       - If the number of too small or too big files is smaller or equal to the threshold, nothing should be done as the clustering is OK and we don't have enough manifests to rewrite.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1598: Spark SQL Extensions: Rewrite manifests

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1598:
URL: https://github.com/apache/iceberg/issues/1598#issuecomment-738697896


   We have merged a simple version. I'll probably keep this one open until we refine the approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1598: Spark SQL Extensions: Rewrite manifests

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1598:
URL: https://github.com/apache/iceberg/issues/1598#issuecomment-738697896


   We have merged a simple version. I'll probably keep this one for a while until we refine the approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1598: Spark SQL Extensions: Rewrite manifests

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1598:
URL: https://github.com/apache/iceberg/issues/1598#issuecomment-905824545


   This was released in 0.11.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1598: Spark SQL Extensions: Rewrite manifests

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1598:
URL: https://github.com/apache/iceberg/issues/1598#issuecomment-707398116


   I have a prototype for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue closed issue #1598: Spark SQL Extensions: Rewrite manifests

Posted by GitBox <gi...@apache.org>.
rdblue closed issue #1598:
URL: https://github.com/apache/iceberg/issues/1598


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org