You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Gengliang Wang (JIRA)" <ji...@apache.org> on 2019/05/02 23:51:00 UTC
[jira] [Created] (SPARK-27627) Make option "pathGlobFilter" as a
general option for all file sources
Gengliang Wang created SPARK-27627:
--------------------------------------
Summary: Make option "pathGlobFilter" as a general option for all file sources
Key: SPARK-27627
URL: https://issues.apache.org/jira/browse/SPARK-27627
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Background:
The data source option "pathGlobFilter" is introduced for Binary file format: https://github.com/apache/spark/pull/24354 , which can be used for filtering file names, e.g. reading "*.png" files only while there is "*.json" files in the same directory.
Proposal:
Make the option "pathGlobFilter" as a general option for all file sources. The path filtering should happen in the path globbing on Driver.
Motivation:
Filtering the file path names in file scan tasks on executors is kind of ugly.
Impact:
1. The splitting of file partitions will be more balanced.
2. The metrics of file scan will be more accurate.
3. Users can use the option for reading other file sources.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org