You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/06 11:25:39 UTC

[GitHub] [spark] Kimahriman commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources

Kimahriman commented on code in PR #38003:
URL: https://github.com/apache/spark/pull/38003#discussion_r988913086


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala:
##########
@@ -70,8 +70,9 @@ abstract class FileScanBuilder(
   }
 
   override def pushFilters(filters: Seq[Expression]): Seq[Expression] = {
+    val (deterministicFilters, nonDeterminsticFilters) = filters.partition(_.deterministic)

Review Comment:
   The difference from V1 isn't the only problem. Many built-in non-deterministic functions fail when being pushed down as partition filters (which are evaluated before the execution actually starts I guess) with:
   
   ```
   java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval.
   ```
   Same applies to `input_file_name` at least, not sure what else. This is compounded by the fact that expressions referencing no attributes get pushed as "partition" filters too, because an empty set is a subset of every set. I.e. `rand() > 0.5` gets pushed as a partition filter, as does `length(input_file_name()) > 0`, obviously these are very trivial examples but there are more realistic use cases for some things like this.
   
   The no referenced attributes could probably be considered a separate issue/bug, but the problem here still exists. If someone wants to figure out how to make those compatible with partition filters they could, but right now at least some of it is broken so should probably just be blanket filtered out until then?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org