You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/15 02:20:55 UTC

[GitHub] mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files

mikedias commented on issue #23782: [SPARK-26875][SS] Add an option on FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-463881234
 
 
   I think that the archive/delete race condition can be addressed by checking the file timestamp before archive/delete. If it is the same as the processed, proceed. If not, skip. This extra step can be enabled only if `includeModifiedFiles` is enabled, which tells that files can be overridden. 
   
   Talking about end users expectations, if they upload a file and it gets deleted/archived, they probably expect a new file with the same name to be processed as well when uploaded again. Do not process the file is not intuitive and is also hard to debug which files names were processed in past. Why my file is not getting processed can be a frequently asked question.
   
   I totally understand the implications of files been unintentionally modified as well pointed by @HeartSaVioR and that's why the option is `false` by default, but I do think we need to provide an option to cover more use-cases and give a solution for users who understand that their files can be overridden.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org