You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/05/30 09:06:00 UTC

[jira] [Assigned] (SPARK-35565) Add a config for ignoring metadata directory of file stream sink

     [ https://issues.apache.org/jira/browse/SPARK-35565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-35565:
------------------------------------

    Assignee: Apache Spark  (was: L. C. Hsieh)

> Add a config for ignoring metadata directory of file stream sink
> ----------------------------------------------------------------
>
>                 Key: SPARK-35565
>                 URL: https://issues.apache.org/jira/browse/SPARK-35565
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.2.0
>            Reporter: L. C. Hsieh
>            Assignee: Apache Spark
>            Priority: Major
>
> FileStreamSink produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log.
> Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark.
> Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily.
> Seems we need a config for ignoring the metadata of FileStreamSink when reading the output path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org