You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/20 02:55:31 UTC

[GitHub] HeartSaVioR opened a new pull request #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata

HeartSaVioR opened a new pull request #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840
 
 
   ## What changes were proposed in this pull request?
   
   This patch proposes adding option in file stream sink to retain only the last batch for file log (metadata). This would help on the case where query is outputting plenty of files per each batch, which compacting metadata files into one could bring non-trivial overhead.
   
   Please refer [the comment in JIRA issue](https://issues.apache.org/jira/browse/SPARK-24295?focusedCommentId=16545577&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16545577) for more details on the overhead current file stream sink metadata and  file stream source metadata file index can bring to high-volume and long-run queries.
   
   As this patch purges old batches and retains only last batch in metadata, metadata file index fails to construct list of files when we enable this option, and as a result file (stream) source cannot read the output directory. To re-enable reading from the output directory, this patch also proposes to add option in file (stream) source which ignores metadata information when reading directory. With this option, end users can also choose the faster one between in-memory file index and metadata file index when metadata file gets much bigger.
   
   ## How was this patch tested?
   
   Added unit tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org