You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/13 20:56:14 UTC

[GitHub] [iceberg] RussellSpitzer commented on issue #4997: Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator

RussellSpitzer commented on issue #4997:
URL: https://github.com/apache/iceberg/issues/4997#issuecomment-1154430977

   @puchengy The example you pasted shows there is a function applied to the column
   
   see
   
   ```
   (cast(dt#119 as date) >= 2022-06-05) 
    AND (dt#119 <= 2022-06-06))
   ```
   
   In general this is a problem for Datasources, although in 3.2? 3.3? a rule was added to force cast the literal side of the predicate but I think that's still limited only to certain types of literals. Basically the issue is that if Spark thinks a function like `cast` must be applied to a DataSource Column then it is not allowed to push down that predicate. It can only push down predicates where the columns are not modified in any way before being compared to a literal.
   
   I wrote about this a long time ago [here](https://www.russellspitzer.com/2016/04/18/Catalyst-Debugging/)
   
   The issue in your particular query is that _dt_ is a **string** and the output of your function is a **date**. Spark needs to resolve this mismatch so it casts _dt_ as a date so the types match. The types now match, but it is impossible to pushdown the predicate. To fix it you cast the literal output as **string** before Spark has a chance to cast _dt_.  You'll notice that in your example the other predicate is always pushed down correctly because it is a literal string being compared to a literal column.
   
   This has been the case in Spark for external datasources for a very long time so I don't think it's a new issue.
   
   In this case you fix it by casting the "literal" part of the predicate to match the column type.
   ```sql
   select * from tbl where dt between cast(date_add('2022-01-01, -1) as String) and '2022-01-01'
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org