You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/16 14:15:50 UTC

[GitHub] [spark] cchighman edited a comment on pull request #28841: [SPARK-31962][SQL] Provide option to load files after a specified date when reading from a folder path

cchighman edited a comment on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-644793641

Thanks for your comments, @bart-samwel. I like your way of thinking, there are a lot of unique cases here. To provide more context behind the scenario I'm looking to cover which is a current issue for consumers:

- Imagine you have a massive, massive data lake with routine ETL operations.
- Every couple hours or so, a CSV file is dropped in a "Delta" folder containing perhaps 50 million events, per dataset, and you have a lot of these various datasets.
- Over time, going back a handful of years, the folder hierarchy was rather deterministic which seems to be a common practice, such that you have _/dataset/delta/yyyy-mm-dd/dataset_guid_timestamp.csv_ as folder structure.
- A number of teams may need to begin consuming these files but they are only interested in consuming them starting from a particular date. Prior to this date, there is no longer any interest, and they hope to consume all the delta files for events up to the current date from the specified modified date without needing to write code that concatenates or embeds this for them.
- From this perspective, enterprise consumers have value in being able to specify a modified timestamp to help _checkpoint_ what deltas they're interested in consuming.

Granted, this context is specific to non-streaming file data sources. I was hopeful to find an equivalent perhaps with Structured Streaming but the closest I found was _latestFirst_ and _maxFileAge_ which each have their respective use cases but does not solve this particular one. The connective tissue between my change here lies in the fact that Structured Streaming also leverages InMemoryFileIndex and actively passes a parameter map to its constructor. I'll provide a PR to complete support there, as well, but separately from this MVP piece.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org