You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/16 13:16:29 UTC

[GitHub] [spark] gaborgsomogyi edited a comment on pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

gaborgsomogyi edited a comment on pull request #28422:
URL: https://github.com/apache/spark/pull/28422#issuecomment-644757113


   I agree, confusion comes from `latestFirst` basically.
   > But then should we really open the possibility to trace back older files?
   
   I see a use-case where it's useful. The query has has fallen behind and files have piled up. The query must keep-up with the incoming data but also must process older files as a side job.
   
   > Would we just simply do the thing we do with Kafka's "latest" option, which only affects the first batch and no-op in further batches?
   
   Not sure how exactly `latestFirst` should behave then?! Create a single gigantic micro-batch which processes all the data and then switch back to normal mode?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org