You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Christopher Highman (Jira)" <ji...@apache.org> on 2020/06/10 20:48:00 UTC
[jira] [Created] (SPARK-31962) Provide option to load files after a
specified date when reading from a folder path
Christopher Highman created SPARK-31962:
-------------------------------------------
Summary: Provide option to load files after a specified date when reading from a folder path
Key: SPARK-31962
URL: https://issues.apache.org/jira/browse/SPARK-31962
Project: Spark
Issue Type: Improvement
Components: SQL, Structured Streaming
Affects Versions: 3.1.0
Reporter: Christopher Highman
When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical delta files in CSV format. When I start reading from a folder, however, I might only care about files were created after a certain time.
{code:java}
spark.readStream
.option("header", "true")
.option("delimiter", "\t")
.format("csv")
.load("/mnt/Deltas/")
{code}
In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala], there is a method, _checkAndGlobPathIfNecessary,_ which appears create an in-memory index of files for a given path. There may a rather clean opportunity to consider options here.
Having the ability to provide an option specifying a timestamp by which to begin globbing files would result in quite a bit of less complexity needed on a consumer who leverages the ability to stream from a folder path but does not have an interest in reading what could be thousands of files that are not relevant.
One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
.option("header", "true")
.option("delimiter", "\t")
.option("createdFileTime", "2020-05-01 00:00:00")
.format("csv")
.load("/mnt/Deltas/")
{code}
If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have a created been created at or later than the specified time in order to be consumed for purposes of reading the files in general or for purposes of structured streaming.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org