You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/02 06:18:11 UTC

[GitHub] [spark] cchighman commented on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652807186

@HeartSaVioR Thank you for your detailed comments. I've been digging into the PR you mentioned along with the associated Kafka Batch sources, etc. I'm leaning towards separating the PRs mainly to reduce complexity in any one PR. I have a few questions.

1.) By separating these PRs, the offset-based semantics would just apply to structured streaming correct? Meaning, _modifiedDateFilter_ would just be used for the batch case? The Kafka batch example uses batch reading with offset-based semantics but that seems unintuitive for the file data source uses case.

2.) _startingOffsetbyTimestamp_ and the associated semantics refer to _the start point of timestamp when a query is started_. In the file stream source use case, there seems to be a distinctive difference between the file _modified date_ and when the query itself is started. From what I'm gathering, because an offset represents a file itself, the language in this sense would actually relate the the modified timestamp on the file as opposed to when the query itself was started? In effect, the file stream is abstract based on the modified time of the file itself?

3.) If a file is modified and exists in SeenFilesMap, but is subsequently modified, I'm guessing one file being modified means the entire file will be reconsumed as we don't consider partial files, correct?

4.) Is there an ideal way to exclude the streaming use case from _PartioningAwareFileIndex_?

Thank you

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org