You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/09 21:47:30 UTC

[GitHub] [iceberg] huaxingao opened a new issue, #6549: Collecting Iceberg NDV Statistics for Spark Engine

huaxingao opened a new issue, #6549:
URL: https://github.com/apache/iceberg/issues/6549

   ### Feature Request / Improvement
   
   NDV is important information for query optimization. Currently, Trino calculates NDV statistics during analyzing table and writes NDV statistics to the Iceberg puffin file. These NDV statistics are used by the Trino query optimizer to find the best query plan.  However, iceberg table level NDV is not collected and used by Spark engine yet. I have a [proposal](https://docs.google.com/document/d/1ZvTw9G2rLuETREC1MCoZubvHo2SpwTpEz_58F9jy07I/edit) to collect iceberg NDV for Spark engine so this info can be used for Spark query optimization. 
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

huaxingao commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1376373980

   cc @aokolnychyi @RussellSpitzer @flyrain 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

flyrain commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1378049721

   The current implementation of Spark CBO only needs the table level NDV. File-level sketches could be useful for future optimization though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1378341579

   >  Spark currently doesn't use partition-level stats in CBO.
   
   >  The current implementation of Spark CBO only needs the table level NDV. File-level sketches could be useful for future optimization though.
   
   Interesting. Usually, the tables used in real-time applications are partitioned and most of the query filters are on the partition column.  Hence I thought partition-level stats are more useful than table-level stats. But yeah, need more time for engines to adopt it I guess. Currently, Dremio uses partition stats. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] ajantha-bhat commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

ajantha-bhat commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1376687568

   @huaxingao: Thanks for the proposal. 
   
   I was about to implement some CALL procedure to collect stats for partition level stats. 
   We can have a table-level NDV stats collection first and I can later improve it to collect partition-level stats.  Because we have concluded that writers cannot write stats in V2 format. It needs to bump the spec version to V3. So, the only way to collect stats in V2/V1 format is via ANALYZE TABLE or CALL procedure. 
   More info about partition level stats can be found here: https://www.mail-archive.com/dev@iceberg.apache.org/msg03885.html
   
   Does partition-level stats help Spark-3.4 too? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] closed issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #6549: Collecting Iceberg NDV Statistics for Spark Engine
URL: https://github.com/apache/iceberg/issues/6549


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

flyrain commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1376508992

   > It would be ideal if we can calculate NDV synchronically in iceberg in the future.
   
   It requires a more efficient way to store metadata in Iceberg. For example, replacing the manifest files with a key-value store.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1633358356

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

huaxingao commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1376374413

   also cc @rdblue @findepi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

huaxingao commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1376489008

   Yes we can take the stored procedure approach. The advantage of this approach is that we don't have to consider delete file separately, and this approach is more consistent with the Trino approach. 
   
   It would be ideal if we can calculate NDV synchronically in iceberg in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] github-actions[bot] commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1656479587

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

huaxingao commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1377607338

   It's a great idea to collect partition stats in iceberg. Thanks @ajantha-bhat 
   
   Spark currently doesn't use partition-level stats in CBO.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huaxingao commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

huaxingao commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1382430906

   Here is the [PR](https://github.com/apache/iceberg/pull/6582) for implementing a Spark stored procedure to collect NDV.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1377731239

   The beauty of file level sketches is we can do "per scan" NDV calculations. So we actually can do much better than partition stats. That said I think the NDV sketches 's for files should probably still be stored in puffin files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #6549: Collecting Iceberg NDV Statistics for Spark Engine

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #6549:
URL: https://github.com/apache/iceberg/issues/6549#issuecomment-1376460828

   I think while it may be helpful to collect sketches at write time, for older tables and for a POC I think we should start with just an "analyze" like procedure that just uses a specific snapshots and generates a puffin file with all the expected NDV stats for the entire snapshot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org