You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/02 07:48:06 UTC

[GitHub] [spark] cchighman edited a comment on pull request #28841: [SPARK-31962][SQL][SS] Provide option to load files after a specified date when reading from a folder path

cchighman edited a comment on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-652843996


   @HeartSaVioR I still think implementing this at the _PartitioningAwareFileIndex_ level makes a lot of sense and bypasses all the complexities you mentioned above.  There can be some cases where you begin streaming from a file source that could have hundreds of thousands of files and many with the same timestamp.  You want to start the process at a specified point.  _PartitioningAwareFileIndex_ is processed before any other options for structured streaming are considered during _fetchMaxOffset_.  I believe _modifiedDateFilter_ is a great way to determine where you want to start streaming from and is limited to that use case.  The semantics for offset I believe completely apply but I think they would apply to the files that are returned from _InMemoryFileIndex_ or _MetadataLogFileIndex_.
   
   This option is very intuitive for the consumer because, for any given path, they can explicitly set the population of files that would be considered for structured streaming.  `allFiles` in `fetchMaxOffset` would return the starting point that would represent the earliest/latest offsets.  Do you see the difference?
   
   Granted, I can conceptualize how this could be implemented in _FileStreamSource_.  It seems though like the problems you're describing shouldn't impact how we would ultimately filter files based on parameters which seek to limit more of an unbounded problem we might have currently?  I'm asking this just to understand if the complexity is as easy as just adding an extra layer of filtering if the options are specified.
   
   Seems most consideration would be placed in this area in relation to the _seenFilesMap_ and _metadataLogCurrentOffset_
   `  private def fetchMaxOffset(limit: ReadLimit): FileStreamSourceOffset = synchronized {`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org