You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alfredo Gimenez (JIRA)" <ji...@apache.org> on 2019/02/25 18:58:00 UTC
[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

    [ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777197#comment-16777197 ] 

Alfredo Gimenez edited comment on SPARK-24295 at 2/25/19 6:57 PM:
------------------------------------------------------------------

We've run into the exact same issue, I uploaded a minimal reproducible example showing the continuously growing metadata compaction files. This is especially an issue in streaming jobs that rely on checkpointing, as we cannot purge metadata files and restart–the checkpointing mechanism depends on the metadata. A current workaround we have is to manually grab the last checkpoint offsets, purge both checkpoints and metadata, and set the "startingOffsets" to the latest offsets that we grabbed. This is obviously not ideal, as it relies on the current serialized data structure for the checkpoints, which can change with spark versions. It also introduces the possibility of losing checkpoint data if a spark job fails before creating a new checkpoint file.

[~kabhwan] taking a look at your PR now, thanks!

Is there another reliable workaround for this setup?


was (Author: alfredo-gimenez-bv):
We've run into the exact same issue, I uploaded a minimal reproducible example showing the continuously growing metadata compaction files. This is especially an issue in streaming jobs that rely on checkpointing, as we cannot purge metadata files and restart–the checkpointing mechanism depends on the metadata. A current workaround we have is to manually grab the last checkpoint offsets, purge both checkpoints and metadata, and set the "startingOffsets" to the latest offsets that we grabbed. This is obviously not ideal, as it relies on the current serialized data structure for the checkpoints, which can change with spark versions. It also introduces the possibility of losing checkpoint data if a spark job fails before creating a new checkpoint file.

[~kabhwan] can you point us to your PR? 

Is there another reliable workaround for this setup?

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> ------------------------------------------------------------------------
>
>                 Key: SPARK-24295
>                 URL: https://issues.apache.org/jira/browse/SPARK-24295
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Iqbal Singh
>            Priority: Major
>         Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing slowness  while reading the data from FileStreamSinkLog dir as spark is defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org