You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shixiong Zhu (Jira)" <ji...@apache.org> on 2020/05/22 23:51:00 UTC

[jira] [Resolved] (SPARK-30915) FileStreamSinkLog: Avoid reading the metadata log file when finding the latest batch ID

     [ https://issues.apache.org/jira/browse/SPARK-30915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shixiong Zhu resolved SPARK-30915.
----------------------------------
    Fix Version/s: 3.1.0
         Assignee: Jungtaek Lim
       Resolution: Fixed

> FileStreamSinkLog: Avoid reading the metadata log file when finding the latest batch ID
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-30915
>                 URL: https://issues.apache.org/jira/browse/SPARK-30915
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>             Fix For: 3.1.0
>
>
> FileStreamSink.addBatch checks the latest batch ID before writing outputs to skip writing batch if the batch was committed before.
> While it's valid to compare the current batch with the latest batch ID, getLatest() method is designed to return both the batch ID as well as content which denotes that the latest metadata log file is being read and deserialized. This would introduces heavy latency when the latest batch is a compacted batch.
> We could just find the metadata log file for latest batch ID, and only do the minimal check without reading content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org