You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gengliang Wang (JIRA)" <ji...@apache.org> on 2019/05/02 23:51:00 UTC

[jira] [Created] (SPARK-27627) Make option "pathGlobFilter" as a general option for all file sources

Gengliang Wang created SPARK-27627:
--------------------------------------

             Summary: Make option "pathGlobFilter" as a general option for all file sources
                 Key: SPARK-27627
                 URL: https://issues.apache.org/jira/browse/SPARK-27627
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Gengliang Wang


Background:
The data source option "pathGlobFilter" is introduced for Binary file format: https://github.com/apache/spark/pull/24354 , which can be used for filtering file names, e.g. reading "*.png" files only while there is "*.json" files in the same directory.

Proposal:
Make the option "pathGlobFilter" as a general option for all file sources. The path filtering should happen in the path globbing on Driver.

Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly. 

Impact:
1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org