You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/27 12:14:22 UTC

[GitHub] [iceberg] cccs-jc opened a new issue #2527: Spark Dynamic Partition Pruning

cccs-jc opened a new issue #2527:
URL: https://github.com/apache/iceberg/issues/2527


   I'm unable to use Spark dynamic Partition Pruning with Iceberg table. Is this a known limitation of Iceberg?
   
   It seems that Iceberg does not have statistics which Spark relies on.
   
   `describe extended table1 column1`
   
   Describing columns is not supported for v2 tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] vusal-iom commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
vusal-iom commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-901552057


   I also hit the same problem while doing TPCDS 10TB benchmarking. The reason was, only HadoopFsRelation had been taken account: https://github.com/apache/spark/blob/fceabe2372ab2a53401059e6019f441d0580aeab/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L64
   
   But, the latest code has V2 Datasource handling part already. See: https://issues.apache.org/jira/browse/SPARK-35779
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-901586306


   Iceberg doesn't implement SupportsRuntimeFiltering yet as added in that PR
   for Spark 3.2 (we actually don't support 3.2 at all yet)
   
   On Wed, Aug 18, 2021 at 9:06 PM vusal-iom ***@***.***> wrote:
   
   > I also hit the same problem while doing TPCDS 10TB benchmarking. The
   > reason was, only HadoopFsRelation had been taken account:
   > https://github.com/apache/spark/blob/fceabe2372ab2a53401059e6019f441d0580aeab/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L64
   >
   > But, the latest code has V2 Datasource handling part already. See:
   > https://issues.apache.org/jira/browse/SPARK-35779
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/2527#issuecomment-901552057>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADE2YMJUHGQ6YF7O6JIPCLT5RRMHANCNFSM43U33RPQ>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-827656513


   I don't think that it's incompatible with Iceberg from the source
   
   If i'm reading this correctly:
   
   https://github.com/apache/spark/blob/19c7d2f3d8cda8d9bc5dfc1a0bf5d46845b1bc2f/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L130-L132
   
   There are basically two ways of estimating the filtering effectiveness
      * a stats based one which you are correct we would not trigger as we don't keep "distinct count" stats
      * a fallback method which just uses a user defined constant
        *. conf.dynamicPartitionPruningFallbackFilterRatio
        
   In either case it then multiples the effectiveness against the size the plan reports (which we do report)
   
   https://github.com/apache/spark/blob/19c7d2f3d8cda8d9bc5dfc1a0bf5d46845b1bc2f/sql/core/src/main/scala/org/apache/spark/sql/execution/dynamicpruning/PartitionPruning.scala#L154
   
     
      
      


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-jc commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
cccs-jc commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-830675833


   I created a mock fact table and a mock dimension table using a traditional Hive catalog. I was able to activate the dynamic partition pruning optimization. It's quite easy to identify in the spark UI. The query runs very fast then dpp is used.
   
   I then used the same mock data generator functions to create tables using iceberg. I partition the fact table in the same was as with traditional Hive. I run the exact same join however spark uses a sort-merge-join instead of the dynamic partition pruning optimization. It does not even use a Broadcast Join which surprised me.
   
   I can reproduce the issue quite easily. What information would be useful to put in this issue?
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pan3793 commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
pan3793 commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-903412789


   Spark 3.2 is coming (RC phase now), does Iceberg have a timeline for Spark 3.2 support?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-jc commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
cccs-jc commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-827729244


   Ha okay, I'm now setting the filter ratio
   
   `.config("spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio", 0.001)`
   or
   ` .config("spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio", 100)`
   
   Trying to determine if filters will be pushed down. Should I see them pushed down in the BatchScan for the probing table. My explain plan does not show any pushed down filters
   
   `BatchScan[rrname#293, timeperiod#296] test_table1 [filters=]`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-jc edited a comment on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
cccs-jc edited a comment on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-830675833


   @RussellSpitzer  
   I created a mock fact table and a mock dimension table using a traditional Hive catalog. I was able to activate the dynamic partition pruning optimization. It's quite easy to identify in the spark UI. The query runs very fast then dpp is used.
   
   I then used the same mock data generator functions to create tables using iceberg. I partition the fact table in the same was as with traditional Hive. I run the exact same join however spark uses a sort-merge-join instead of the dynamic partition pruning optimization. It does not even use a Broadcast Join which surprised me.
   
   I can reproduce the issue quite easily. What information would be useful to put in this issue?
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-jc commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
cccs-jc commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-827771270


   I was able to reproduce my issue with just parquet files (no iceberg). I saw the sizes in the plan.
   
   So looks like it's an issue in Spark or a miss-configuration on my part. I've reported my example to Spark.
   
   https://issues.apache.org/jira/browse/SPARK-35245
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #2527: Spark Dynamic Partition Pruning

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #2527:
URL: https://github.com/apache/iceberg/issues/2527#issuecomment-903461057


   We will start the work for supporting Spark 3.2 as soon as it is available. I'll also take care of implementing dynamic filtering.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org