You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chungmin (Jira)" <ji...@apache.org> on 2021/10/31 13:47:00 UTC

[jira] [Created] (SPARK-37172) Push down filters having both partitioning and non-partitioning columns

Chungmin created SPARK-37172:
--------------------------------

             Summary: Push down filters having both partitioning and non-partitioning columns
                 Key: SPARK-37172
                 URL: https://issues.apache.org/jira/browse/SPARK-37172
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.2.0
            Reporter: Chungmin


Currently, filters having both partitioning and non-partitioning columns are lost during the creation of {{FileSourceScanExec}} and not pushed down to the data source. However, theoretically and practically, there is no reason to exclude such filters from {{dataFilters}}. For any partitioned source data file, the values of partitioning columns are the same for all rows. They can be stored physically (or reconstructed logically) along with statistics for non-partitioning columns to allow more powerful data skipping. If a data source doesn't know how to handle such filters, it can simply ignore such filters.

It's not obvious whether we can change the semantics of {{FileSourceScanExec.dataFilters}} without breaking existing code. It is passed to {{FileIndex.listFiles}} and {{FileFormat.buildReaderWithPartitionValues}} and the contracts for the methods are not clear enough.

If we should not change {{dataFilters}}, we might have to add a new member variable to {{FileSourceScanExec}} (e.g. d{{ataFiltersWithPartitionColumns}}) and add an overload of {{listFiles}} to the {{FileIndex}} trait, which defaults to the existing {{listFiles}} without using the filters. Both {{dataFilters}} and {{dataFiltersWIthoutPartitionColumns}} are optional; implementations can ignore the filters if they can't utilize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org