You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/03 23:26:23 UTC

[GitHub] [iceberg] kbendick opened a new issue, #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

kbendick opened a new issue, #4689:
URL: https://github.com/apache/iceberg/issues/4689

   When writing to a table that is _not_ partitioned, if the `write.summary.partition-limit` property of the table is set to some non-zero value, partition summaries will be written still.
   
   Here's a sample snapshot summary for an unpartitioned table that received one write. Notice that the prefix `partitions.` is included before the partition summary, which is the [changed partitions prefix](https://github.com/apache/iceberg/blob/3584c79022ec70f79b326550736b4600d249e4a2/core/src/main/java/org/apache/iceberg/SnapshotSummary.java#L54) which is usually followed by a field name and the value for that partition.
   
   ```
   "summary" : {
         "operation" : "append",
         "spark.app.id" : "local-1651499999454",
         "added-data-files" : "1",
         "added-records" : "1",
         "added-files-size" : "608",
         "changed-partition-count" : "1",
         "partition-summaries-included" : "true",
         "partitions." : "added-data-files=1,added-records=1,added-files-size=608",
         "total-records" : "1",
         "total-files-size" : "608",
         "total-data-files" : "1",
         "total-delete-files" : "0",
         "total-position-deletes" : "0",
         "total-equality-deletes" : "0"
       }
   ```
   
   Also of note is that if a table was previously partitioned, and then the partition field is dropped, and data is then inserted afterwards, the partition summary can have a reference to the partition field for the value `null`.
   
   ```
   summary" : {
         "operation" : "append",
         "spark.app.id" : "local-1651499999454",
         "added-data-files" : "2",
         "added-records" : "2",
         "added-files-size" : "1215",
         "changed-partition-count" : "1",
         "partition-summaries-included" : "true",
         "partitions.cat=null" : "added-data-files=2,added-records=2,added-files-size=1215",
   ...
   ```
   
   Is there some use case where we intentionally want to include partition summaries for the "global" partition of an unpartitioned table? The same data seems to be available in the rest of the summary, but maybe this was intentional.
   
   I'm almost certain that number 2 is not intentional (though maybe that has something to do with the void transform).
   
   cc @rdblue @RussellSpitzer @flyrain @aokolnychyi @szehon-ho @danielcweeks for visibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by GitBox <gi...@apache.org>.
flyrain commented on issue #4689:
URL: https://github.com/apache/iceberg/issues/4689#issuecomment-1116774523

   I'm not aware of any use case that needs partition summary when the table is no partitioned. This looks confusing to me. We may remove it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #4689:
URL: https://github.com/apache/iceberg/issues/4689#issuecomment-1378062505

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4689:
URL: https://github.com/apache/iceberg/issues/4689#issuecomment-1150430856

   I think I found the issue for this. PartitionoSpec.unpartitioned() results in a spec of `{"spec-id":0,"fields":[]}`, so I think we need to ensure that the partition specs have fields and are not the void spec.
   
   I will investigate further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on issue #4689:
URL: https://github.com/apache/iceberg/issues/4689#issuecomment-1116785119

   I think both cases are erroneous.  Partition summaries should only be emitted if the current schema has partitions and the summaries should reflect that schema.
   
   Even if the table previously had summaries, including summaries about prior partitioning strategies would be ambiguous.  There are still edge cases with partition evolution, but the summaries are not a complete view of tables changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by github-actions.
github-actions[bot] commented on issue #4689:
URL: https://github.com/apache/iceberg/issues/4689#issuecomment-1402860482

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by GitBox <gi...@apache.org>.
kbendick commented on issue #4689:
URL: https://github.com/apache/iceberg/issues/4689#issuecomment-1157866674

   Following up on this
   > Also of note is that if a table was previously partitioned, and then the partition field is dropped, and data is then inserted afterwards, the partition summary can have a reference to the partition field for the value null.
   
   This is an artifact of the V1 table's using the void transform, which uses `alwaysNull` for its transform, so this one is a little harder to deal with (and I'm not 100% sure it matters too much as it's for V1 tables and anybody using the partition summaries is likely somewhat of a power user). This is at least well defined behavior.
   
   Otherwise, I have a PR which removes these summaries https://github.com/apache/iceberg/pull/5009.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables

Posted by github-actions.
github-actions[bot] closed issue #4689: Partition Summaries Found in Snapshot Summaries for Unpartitioned Tables
URL: https://github.com/apache/iceberg/issues/4689


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org