You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/08 18:26:38 UTC

[GitHub] [iceberg] puchengy opened a new issue, #4997: [BUG] Spark 3.2 query's partition filter broken when there is a UDF inside

puchengy opened a new issue, #4997:
URL: https://github.com/apache/iceberg/issues/4997

   Hey, we are seeing Spark 3.2 query's partition filter broken when there is a UDF inside. For example,
   
   ```
   select * from tbl where dt between date_add('2022-01-01, -1) and '2022-01-01'
   ```
   
   The query plan shows
   ```
   ...
   spark_catalog.db.tbl[filters=dt IS NOT NULL, dt <= '2022-01-01']
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154430977

   @puchengy The example you pasted shows there is a function applied to the column
   
   see
   
   ```
   (cast(dt#119 as date) >= 2022-06-05) 
    AND (dt#119 <= 2022-06-06))
   ```
   
   In general this is a problem for Datasources, although in 3.2? 3.3? a rule was added to force cast the literal side of the predicate but I think that's still limited only to certain types of literals. Basically the issue is that if Spark thinks a function like `cast` must be applied to a DataSource Column then it is not allowed to push down that predicate. It can only push down predicates where the columns are not modified in any way before being compared to a literal.
   
   I wrote about this a long time ago [here](https://www.russellspitzer.com/2016/04/18/Catalyst-Debugging/)
   
   The issue in your particular query is that _dt_ is a **string** and the output of your function is a **date**. Spark needs to resolve this mismatch so it casts _dt_ as a date so the types match. The types now match, but it is impossible to pushdown the predicate. To fix it you cast the literal output as **string** before Spark has a chance to cast _dt_.  You'll notice that in your example the other predicate is always pushed down correctly because it is a literal string being compared to a literal column.
   
   This has been the case in Spark for external datasources for a very long time so I don't think it's a new issue.
   
   In this case you fix it by casting the "literal" part of the predicate to match the column type.
   ```sql
   select * from tbl where dt between cast(date_add('2022-01-01, -1) as String) and '2022-01-01'
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150618527

   > @kbendick Hi thanks for the suggestion, I think we can have what you mentioned as a workaround for now, however, we still want more for it to be resolved as it is not easy for us to change every SQL query that fall into this case for our customers.
   > 
   > with that I will ping the reviewer of #3400 here: hey, @rdblue @flyrain @karuppayya @szehon-ho @RussellSpitzer @aokolnychyi when you get a chance, could you take a look? thanks.
   
   I agree it's not a good solution for pure SQL users. I'm surprised that disabling the broadcast join didn't help. The people you tagged would know more than I would.
   
   But I just ran a full test and using the syntax `select * from db.tbl WHERE dt between '2022-06-06' - interval 1 days AND '2022-06-06';` the pushdown worked entirely.
   
   ```
   spark-sql> create table date_partition_pushdown_test(id bigint, dt date) using iceberg partitioned by (dt);
   spark-sql> insert into date_partition_pushdown_test VALUES(1, date '2022-06-06'), (2, date '2022-06-05'), (3, date '2022-06-04');
   spark-sql> explain extended select * from date_partition_pushdown_test where dt between '2022-06-06' - interval 1 days AND '2022-06-06'
            > ;
   == Parsed Logical Plan ==
   'Project [*]
   +- 'Filter (('dt >= (2022-06-06 - INTERVAL '1' DAY)) AND ('dt <= 2022-06-06))
      +- 'UnresolvedRelation [date_partition_pushdown_test], [], false
   
   == Analyzed Logical Plan ==
   id: bigint, dt: date
   Project [id#6L, dt#7]
   +- Filter ((dt#7 >= cast(cast(2022-06-06 - INTERVAL '1' DAY as string) as date)) AND (dt#7 <= cast(2022-06-06 as date)))
      +- SubqueryAlias demo.date_partition_pushdown_test
         +- RelationV2[id#6L, dt#7] demo.date_partition_pushdown_test
   
   == Optimized Logical Plan ==
   Filter ((isnotnull(dt#7) AND (dt#7 >= 2022-06-05)) AND (dt#7 <= 2022-06-06))
   +- RelationV2[id#6L, dt#7] demo.date_partition_pushdown_test
   
   == Physical Plan ==
   *(1) Project [id#6L, dt#7]
   +- *(1) Filter ((isnotnull(dt#7) AND (dt#7 >= 2022-06-05)) AND (dt#7 <= 2022-06-06))
      +- BatchScan[id#6L, dt#7] demo.date_partition_pushdown_test [filters=dt IS NOT NULL, dt >= 19148, dt <= 19149] RuntimeFilters: []
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150600979

   You’re right, they don’t.
   
   Instead of using `cast` in the interval one, I’d suggest using something like `DATE “2022-06-05”` (but I might be mixing up my Flink SQL and Spark SQL).
   
   As a general guideline the explicit use of cast operator tends to limit filter push down.
   
   Some of the folks who worked on https://github.com/apache/iceberg/pull/3400 might be better able to help with this issue than I am.
   
   For now, if this runs inside of a Java / Scala / Python program, I’d suggest generating the WHERE clause as a string via the programming language while we gather further input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] lirui-apache commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by "lirui-apache (via GitHub)" <gi...@apache.org>.

lirui-apache commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1650970214

   Thanks @RussellSpitzer and @puchengy for your inputs. In my above example, I guess casting the column or the literal have different semantics, e.g. both `cast('123' as int) = 123` and `cast('0123' as int) = 123` yield true, but `'0123' = cast(123 as string)` is false. We can educate users not to write such filters, but sometimes the query is generated by BI tools and users may not have control over that. So I suppose we'll have to implement split level filtering in spark like what trino does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: [BUG] Spark 3.2 query's partition filter broken when there is a UDF inside

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150255006

   Hi @puchengy!
   
   Can you please indicate what version of Spark (full version, including patch, and whether or not it's a fork), as well as the exact Iceberg version you're using?
   
   Also, can you please add the full (detailed) query plan? Sometimes filters show up kind of funny as some of them might be pushed down further (or handled farther up).
   
   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154529386

   @puchengy There were ... But it's been a while since I saw them and I searched for a bit and couldn't find them again. I'm pretty sure the issue with fully applying the fix was that apparently it was always safer to call cast on the table column than on the literal ... I think .... I can't remember the full logic or what the pr was called. I thought the logic would be in Analyzer.scala but I didn't see it when I skimmed through.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.

RussellSpitzer commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1650035076

   Yeah it's still a Spark issue and could be fixed in Spark. A rule like "never cast a column when you could cast a literal"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150528263

   Also, thinking out loud here, but what plan gets generated if you use `INTERVAL`?  For example `explain select * from db.tbl where dt between '2022-01-01' - interval 1 days` (or something equivalent).
   
   If that worked, that would be my suggestion solution as disabling broadcast joins is not exactly the best resolution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy closed issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy closed issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator
URL: https://github.com/apache/iceberg/issues/4997


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] lirui-apache commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by "lirui-apache (via GitHub)" <gi...@apache.org>.

lirui-apache commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1649555725

   Hey guys, we're facing a similar issue where our table is also partitioned by `dt string`, and our user sometimes writes filters like `dt = 20230725` by mistake. The filter is translated into `cast(dt as int) = 20230725` and can't be pushed down to iceberg and the query ends up as a full table scan.
   I wonder whether there's a solution for this. We're considering disallowing full table scan in `SparkBatchQueryScan`, but that seems unfair because there's indeed a partition filter in the query. Trino handles such cases by evaluating the filter expression on split level, because the partition column is a known literal for each split. So perhaps we can do the same in spark?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150602784

   I just tried and you can use the string without cast. That might help for the interval one.
   
   ```
   spark-sql> select '2022-01-01' - interval 1 days;
   2021-12-31 00:00:00
   Time taken: 2.867 seconds, Fetched 1 row(s)
   ```
   
   So is it not possible to do `select * from db.tbl WHERE dt between '2022-06-06' - interval 1 days AND '2022-06-06'`;
   
   I'm thinking the removal of the explicit cast might help out (as you can see the case in the filter in the second plan).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150604767

   @kbendick Hi thanks for the suggestion, I think we can have what you mentioned as a workaround for now, however, we still want more for it to be resolved as it is not easy for us to change every SQL query that fall into this case for our customers.
   
   with that I will ping the reviewer of #3400 here: hey, @rdblue @flyrain @karuppayya @szehon-ho @RussellSpitzer when you get a chance, could you take a look? thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150628646

   Sure thing. Like I said, my table was `create table tbl(id bigint, dt date) using iceberg partitioned by (dt)`. So an Iceberg table with an identity transformation.
   
   Do let us know what you find if it’s relevant at all to the community!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1207498607

   It's a spark limitation, it's only able to pass filter expressions to external Datasources which previously just included basic predicates, although with V2 expressions we probably will have more latitude here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1650983153

   No sure what that mean but would be happy if you can make a contribution
   later on! Thanks in advance
   
   On Tue, Jul 25, 2023 at 9:41 PM Rui Li ***@***.***> wrote:
   
   > Thanks @RussellSpitzer <https://github.com/RussellSpitzer> and @puchengy
   > <https://github.com/puchengy> for your inputs. In my above example, I
   > guess casting the column or the literal have different semantics, e.g. both cast('123'
   > as int) = 123 and cast('0123' as int) = 123 yield true, but '0123' =
   > cast(123 as string) is false. We can educate users not to write such
   > filters, but sometimes the query is generated by BI tools and users may not
   > have control over that. So I suppose we'll have to implement split level
   > filtering in spark like what trino does.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/iceberg/issues/4997#issuecomment-1650970214>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AB5S57DIODF6PGTM2CW4UT3XSCNWLANCNFSM5YHO22SQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150673172

   Also, Iceberg does have partition evolution. You could consider updating the partition schema to use a proper date going forward. Given that Iceberg’s partitioning is hidden, it could potentially be done in a way that is abstract to end users for things going forward.
   
   But I would suggest looking into the changes you’ve made to your Spark distribution as well as the behavior in OSS spark.
   
   And if possible closing this issue and linking to it from a new issue when you’ve gathered more info, as the issue title is somewhat misleading now imo and we’re also emailing a number of people by still using this issue. I’ll leave that choice up to you. 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150627437

   @kbendick Thank you so much for your help. I will work on my own and conduct some small testings, my goal is find out the issue with your input or provide a set of queries that can reproduce my issue.
   
   For others: no need to look into this right now, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154412690

   @kbendick Hey sorry for the delay on the reply. In our spark version, we are seeing the filter pushed down to the scan operator. 
   
   ```
   spark-sql> create table tbl_test_predicate_4 (id bigint, dt string) using parquet partitioned by (dt);
   Response code
   Time taken: 1.03 seconds
   spark-sql> explain select * from tbl_test_predicate_4 where dt between date_add('2022-06-06', -1) AND '2022-06-06';
   plan
   == Physical Plan ==
   *(1) ColumnarToRow
   +- FileScan parquet pyang.tbl_test_predicate_4[id#12L,dt#13] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#13), (cast(dt#13 as date) >= 2022-06-05), (dt#13 <= 2022-06-06)], PushedFilters: [], ReadSchema: struct<id:bigint>
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: [BUG] Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150339974

   I know for certain types of partition predicates (at least for aggregates I think), there are certain joins that need to be disabled.
   
   I'm not sure (and don't think actually) that's the case here, but worth mentioning.
   
   Can you provide the explain output with a plain parquet table the same as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150606785

   @puchengy with the suggested syntax (the latest interval one w/o the cast), I think the filter pushdown will work. i just tried something similar and it did.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: [BUG] Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150349886

   This relates to https://github.com/apache/iceberg/pull/3400
   
   You have to disable broadcast joins for dynamic partition filtering to work, as broadcast joins are generally still considered faster.
   
   Can you set `spark.sql.autoBroadcastJoinThreshold = -1` in your spark config and then try running the explain again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] sunchao commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

sunchao commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1208324423

   @puchengy it seems so, that rule currently only handles casts between numeric types, but seems it also be extended to handle casts between date and string (although we should be conservative there and do not allow any invalid case to slip through). Feel free to raise a PR in Spark if you want to work on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1207663648

   @RussellSpitzer is it possible to add similar support as what https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala has done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by "puchengy (via GitHub)" <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1650082595

   My organization ends up doing something internally, even hacky but we think it will fit our cases (we haven't get any feedback on this yet). What we did is we implemented a custom iceberg spark extension, where we inspect the query plan and reject any query that has such filter: (1) such filter has a child that is a scan operator on iceberg table; (2) the filter has a filtering on string partition column and comparing with date literal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] lirui-apache commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by "lirui-apache (via GitHub)" <gi...@apache.org>.

lirui-apache commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1651378211

   I haven't checked whether it's possible with spark. I'll give it a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150651922

   @puchengy can you test if the same table with Spark but created not as an Iceberg table (eg `using parquet`) would have predicate push down?
   
   If not, then this likely needs to be addressed upstream in Spark. If it _does_ push down when using a string column with the `date_add` function, there might an optimized rule we can ensure is run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: [BUG] Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150343103

   Also, can you remove the `[BUG]` from the title? It's slightly misleading imo. We can add it back if others agree it is a bug, but seeing `[BUG]` like that makes me think there's incorrect output, which isn't the case.
   
   If we check and plain Spark with Parquet has the same behavior, then it's more of a feature request.
   
   I think `BUG` should be limited to issues of highest concern, namely ones that cause incorrect output.
   
   Some of us have been discussing issue tags and we will hopefully talk about it at the next community sync up, so this is just my opinion at the moment. But seeing `[BUG]` in an issue could be a little alarming for newer users etc. Especially if this is the same behavior that a plain table stored `as parquet` would give.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1151339181

   @puchengy let me know if you can test this out with plain old `parquet` tables. When I use spark-sql locally, it's hard to avoid having the table's partition data already cached. I'm not sure if this would happen in a full set up though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: [BUG] Spark 3.2 query's partition filter broken when there is a UDF inside

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150283584

   @kbendick Hey, here is the information:
   
   spark version: 3.2.0, with some custom changes.
   iceberg version: [0.13.1](https://github.com/pinterest/iceberg/commits/pinterest-release-0.13.1), with some patches from the iceberg master branch
   
   full query plan (with some columns masked out):
   
   ```
   spark-sql> explain select * FROM db.tbl WHERE dt BETWEEN date_sub('2022-06-06', 1) and '2022-06-06';
   plan
   == Physical Plan ==
   *(1) Project [col1#66, col2#67L, ..., dt#89, hr#90]
   +- *(1) Filter (((cast(dt#89 as date) >= 2022-06-05) AND isnotnull(dt#89)) AND (dt#89 <= 2022-06-06))
      +- BatchScan[col1#66, col2#67L, ..., dt#89, hr#90] spark_catalog.db.tbl [filters=dt IS NOT NULL, dt <= '2022-06-06'] RuntimeFilters: []
   
   
   Time taken: 35.89 seconds, Fetched 1 row(s)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154468929

   @RussellSpitzer BTW, do you know if ongoing issues in the Spark community that we can follow? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150594476

   @kbendick thanks for the response, here is the query plan you suggest me to try, however, they both seems do not help much.
   
   > Can you set spark.sql.autoBroadcastJoinThreshold = -1 in your spark config and then try running the explain again?
   
   ```
   spark-sql> set spark.sql.autoBroadcastJoinThreshold=-1;
   spark-sql> explain select * FROM db.tbl WHERE dt BETWEEN date_sub('2022-06-06', 1) and '2022-06-06';
   ...
   plan
   == Physical Plan ==
   *(1) Project [col1#36, col2#37L, ..., dt#59, bool_partition_col#60]
   +- *(1) Filter (((cast(dt#59 as date) >= 2022-06-05) AND isnotnull(dt#59)) AND (dt#59 <= 2022-06-06))
      +- BatchScan[col1#36, col2#37L, ..., dt#59, bool_partition_col#60] spark_catalog.db.tbl [filters=dt IS NOT NULL, dt <= '2022-06-06'] RuntimeFilters: []
   ```
   
   >  but what plan gets generated if you use INTERVAL?
   ```
   spark-sql> explain select * FROM db.tbl WHERE dt BETWEEN cast('2022-06-06' as date) - interval 1 day and '2022-06-06';
   ...
   plan
   == Physical Plan ==
   *(1) Project [col1#96, col2#97L, app#98, ..., dt#119, bool_partition_col#120]
   +- *(1) Filter (((cast(dt#119 as date) >= 2022-06-05) AND isnotnull(dt#119)) AND (dt#119 <= 2022-06-06))
      +- BatchScan[col1#96, col2#97L, ..., dt#119, bool_partition_col#120] spark_catalog.db.tbl [filters=dt IS NOT NULL, dt <= '2022-06-06'] RuntimeFilters: []
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150623315

   I suspect that your table partitioning might be different in some way. If you can please share your create table ddl (or the relevant portions). I was actually able to make `date_add` pushdown _without_ disabling broadcast joins:
   
   ```
   spark-sql> explain extended select * from date_partition_pushdown_test where dt between date_add('2022-06-06', -1) AND '2022-06-06';
   ...
   == Physical Plan ==
   *(1) Project [id#20L, dt#21]
   +- *(1) Filter ((isnotnull(dt#21) AND (dt#21 >= 2022-06-07)) AND (dt#21 <= 2022-06-06))
      +- BatchScan[id#20L, dt#21] demo.date_partition_pushdown_test [filters=dt IS NOT NULL, dt >= 19150, dt <= 19149] RuntimeFilters: []
   ```
   
   As you can see, the filter on dt (on the Iceberg partitioning) is pushed down entirely).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1150635141

   @kbendick I found out the issue and can provide reproducible queries. The difference is, in my case, the partition column is of `string` type.
   
   ```
   spark-sql> create table tbl_test_predicate_3 (id bigint, dt string) using iceberg partitioned by (dt);
   spark-sql> explain select * from tbl_test_predicate_3 where dt between date_add('2022-06-06', -1) AND '2022-06-06';
   plan
   == Physical Plan ==
   *(1) Project [id#103L, dt#104]
   +- *(1) Filter (((cast(dt#104 as date) >= 2022-06-05) AND isnotnull(dt#104)) AND (dt#104 <= 2022-06-06))
      +- BatchScan[id#103L, dt#104] spark_catalog.pyang.tbl_test_predicate_3 [filters=dt IS NOT NULL, dt <= '2022-06-06'] RuntimeFilters: []
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1207476648

   @RussellSpitzer @kbendick Hey just to follow up on the potential suggestion here. I wonder do you have any suggestion on how to fix this? Our users are hitting this issue more often and we have to keep educate them which slow down the Iceberg adoption, it will be great if we can resolve it from the engine level.
   
   @RussellSpitzer do you see it is doable to add what we are seeing in this optimizer as well? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

puchengy commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154465738

   @RussellSpitzer got it, thank you very much! And thanks @kbendick as well!!
   
   I am closing this issue right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

kbendick commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154589332

   Thank you @RussellSpitzer for bringing the heat in terms of Catalyst knowledge and for your wonderful blog... you're gonna need to blog more because I'm gonna be sending a lot of traffic that way =)
   
   I had suspected this was generally the case, but the solution of casting to a string again before the comparison to the column escaped me. Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154530854

   Oh i'm silly, my coworker actually did all of this work
   
   Here is the first PR @sunchao submitted a way long time ago :). 
   
   https://github.com/apache/spark/commit/3d08084022a4365966526216a616a3b760450884 
   
   https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org