You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/14 12:16:11 UTC

[GitHub] gaborgsomogyi commented on issue #23782: [SPARK-26875][SQL] Add an option on FileStreamSource to include modified files

gaborgsomogyi commented on issue #23782: [SPARK-26875][SQL] Add an option on FileStreamSource to include modified files
URL: https://github.com/apache/spark/pull/23782#issuecomment-463606075
 
 
   The question is why some producer generates the same file again?
   
   From data source perspective I see mainly 2 actually implemented ways:
   * Atomic move to a directory (several engines does that but Spark does it differently because S3 moves the files with copy for example)
   * Write the file a non-atomic way but update metadata file with the already properly written filename. Here the available files are coming from the metadata and all others considered junk.
   
   +1 @HeartSaVioR  and I'm worried with this patch as well.
   * Let's take any filesystem, append a file 10k times and then close it. Is it guaranteed that only after the last append will the timestamp updated and no internal OS flush touch it? If there is no guarantee random exception will be thrown by the SQL engine because maybe half of a row written out.
   * Let's take S3 as another example. Even with S3 guard the file modified, the metadata shows the file is there but because of it's read-after-write consistency the file content can be
     * The original one
     * The new one
     * Empty file
   This change may increase this behaviour.
   
   All in all with my actual understanding I would change the producer.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org