You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/03/19 13:41:16 UTC

[GitHub] [spark] cloud-fan commented on a change in pull request #31848: [SPARK-33482][SPARK-34756][SQL] Fix FileScan equality check

cloud-fan commented on a change in pull request #31848:
URL: https://github.com/apache/spark/pull/31848#discussion_r597685965



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -84,11 +85,25 @@ trait FileScan extends Scan
 
   protected def seqToString(seq: Seq[Any]): String = seq.mkString("[", ", ", "]")
 
+  private lazy val (normalizedPartitionFilters, normalizedDataFilters) = {
+    val output = readSchema().toAttributes.map(a => a.withName(normalizeName(a.name)))

Review comment:
       Thinking about it again, the `FileScan` equality already considers `fileIndex` and `readSchema`, which means 2 file scans only equal to each other if they read the same set of files and the same set of columns.
   
   Given that, I think the expr IDs do not matter for filters, only the column name matters. For normal v2 sources, they use `Filter` not `Expression`, which do not have expr IDs either.
   
   The data/partition filters are created in `PruneFileSourcePartitions`(see https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L51), and the column names inside filters are already normalized w.r.t. the actual file scan output schema, so we don't need to consider case sensitivity here.
   
   That said, I think the normalize logic here should be very simple: just turn expr IDs to 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org