You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/16 15:42:21 UTC

[GitHub] [iceberg] findepi opened a new issue, #6442: Extends Iceberg table stats API to allow publish data and stats atomically

findepi opened a new issue, #6442:
URL: https://github.com/apache/iceberg/issues/6442

   ### Feature Request / Improvement
   
   Currently `UpdateStatistics` (`org.apache.iceberg.Transaction#updateStatistics`) allows adding statistics for an existing snapshot.
   As a result, it is currently not possible publish a snapshot with statistics already collected.
   
   Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE),
   but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
   
   It's not difficult to 
   
   - publish data change snapshot (adding new files)
   - take a note of new snapshot ID
   - add statistics for that snapshot
   
   however this has some drawbacks
   
   - new data is published without stats, so other queries can be planned sub-optimally, leading to eg improper use of cluster resources, or even unexpected query failures (if data changed significantly)
   - someone may run ANALYZE on the new snapshot (unknowingly or intentionally), and this will end up with two different threads wanting to add stats  to it -- wasted work
   
   
   We should make it possible to publish data change together with new stats.
   This may will require API changes
   It may also require spec changes, if we want to use "inherit snapshot ID" model.
   (Maybe we don't have to, since stats are in metadata?)
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1361245849

   Good idea!
   
   cc @rdblue @ajantha-bhat 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1597897672

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1357221410

   > but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
   
   I think we have discussed this for partitions stats too. 
   @rdblue mentioned we cannot have writers to write stats on the fly (with insert, CTAS, update), because it needs bumping the Iceberg spec to V3 as some writers will write stats and some writer will not write stats and it can cause inconsistency. 
   
   we agreed on using ANALYZE syntax or procedure for generating stats until V3 format is ready.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1357940593

   Hmm that is probably not possible, but I guess that's were we should modify the api? We do know the snapshot ID before we actually do the commit, so we should be able to just fill it in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1355104260

   cc @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1357748925

   @findepi shouldn't you be able to just change any write only commit into a transaction with both updates the append and updates the statistics?
   
   Like
   ```
   AppendFiles(A, B, C)
   ```
   becomes
   ```
   Transaction Begin
     AppendFiles(A, B, C)
     Update Statistics (A, B ,C)
   Transaction End
   Commit Transaction // Creates one Snapshot which both appends files and updates statistics
   ```
   
   Then it's up to the framework to build those transactions when required. This would be similar to the mergeSchema functions in Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by GitBox <gi...@apache.org>.
findepi commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1357926286

   Update Statistics API requires to pass a snapshot ID.
   @RussellSpitzer  Is the snapshot ID known before transaction commits?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] findepi commented on issue #6442: Extends Iceberg table stats API to allow publish data and stats atomically

Posted by "findepi (via GitHub)" <gi...@apache.org>.
findepi commented on issue #6442:
URL: https://github.com/apache/iceberg/issues/6442#issuecomment-1601002374

   it's remains needed by Trino


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org