You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/22 07:14:51 UTC

[GitHub] [iceberg] singhpk234 opened a new issue #4188: [Spark] Support Runtime Filtering on Data Columns

singhpk234 opened a new issue #4188:
URL: https://github.com/apache/iceberg/issues/4188


   At present Iceberg only supports runtime Filtering on partition columns with **Apache Spark**, this issue proposes how we can achieve the runtime Filtering in data columns. Let’s say a runtime filter on data column is injected by spark to Iceberg at present we will simply evaluate it against partition columns and since the filter was present in data column level, hence a TrueLiteral will be returned.
   
   What we can do instead is we already have DataFile API’s to get lower / upper bounds per column, we can use this to evaluate runtime filter and filter out FileScanTask’s since the place where we apply runtime filtering we have have these scan tasks and within them the data file.
   
   ```
   Let’s say we get a filter like :
   
   dataColumn IN (1, 21, ....100)
   
   dataColumn range from DataFile1 (20 , 70)
   
   we can create filters such as : 
   - (1 >= 20  && 1 <= 70) || (21 >= 20 && 21 <= 70) || (100 >= 20 && 100 <= 70)
   
   ```
   
   even we can do one step optimization as parquet does when number of literals within IN subQuery is above certain threshold it converts the IN to a range query : (ref : [SPARK-32792](https://issues.apache.org/jira/browse/SPARK-32792)). 
   Note : Having a large IN set size can lead to performance degradation when the filter is evaluated at DS level
   
   ```
   - dataColumn IN (1, 21, 22 ,25, 30, 35...100)
   
   When inset size > certain threshold, converted to :
   
   - dataColumn >= 1 && dataColumn <= 100
   
   dataColumn range from DataFile1 (20 , 70)
   
   we can create filters such as : 
   - (20 >= 1 && 70 <= 100) || (20 >= 1 && 20 <= 100) || (70 >= 1 && 70 <= 100)
   
   ```
   
   Now to achieve this there is a pre-requisite that spark **should add dynamic filter on data columns** so that it gets percolated via DSV2 code path. There have been past attempts in spark to achieve the same. ref : [pull/30146](https://github.com/apache/spark/pull/30146)
   
   Pros : 
   
   * We will seamlessly support dynamic Pruning on data column when it’s introduced in spark
   * It will be good to have for non-partitioned table 
   
   Cons : 
   
   * The code added will be present as dead-code until this functionality is introduced in spark
   
   references : 
   [1] Trino : [[CodePointer](https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitSource.java#L213-L220)]
   [2] SparkDataFile API’s [[CodePointer](https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/SparkDataFile.java#L145-L160)]
   [3] Impala : https://issues.apache.org/jira/browse/IMPALA-3654
   
   cc @rdblue @aokolnychyi @RussellSpitzer @jackye1995 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 edited a comment on issue #4188: [Spark] Support Runtime Filtering on Data Columns

Posted by GitBox <gi...@apache.org>.

singhpk234 edited a comment on issue #4188:
URL: https://github.com/apache/iceberg/issues/4188#issuecomment-1053957571


   ping !!
   
   ---
   
   Happy to take this up if the community is inclined with this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on issue #4188: [Spark] Support Runtime Filtering on Data Columns

Posted by GitBox <gi...@apache.org>.

singhpk234 commented on issue #4188:
URL: https://github.com/apache/iceberg/issues/4188#issuecomment-1053957571


   ping !!
   
   ---
   
   Happy to take this up if the community thinks it adds value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on issue #4188: [Spark] Support Runtime Filtering on Data Columns

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on issue #4188:
URL: https://github.com/apache/iceberg/issues/4188#issuecomment-1056102348


   I think we should be really careful with allowing filtering on arbitrary data columns. In a lot of cases, this may end up useless or even degrade the performance. We currently only allow filtering on data columns used as source fields for partition transforms.
   
   I would consider allowing filtering by sort columns defined in the table. However, I would probably (not sure) hide it under a flag so that users would have to opt in.
   
   Thoughts, @rdblue @RussellSpitzer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org