You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Ahn, Daniel" <da...@optum.com.INVALID> on 2020/04/16 00:19:24 UTC

[Structured Streaming] Checkpoint file compact file grows big

Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is there a recommended way to safely age-off old checkpoint data?

Currently we have a Spark Structured Streaming process reading from Kafka and writing to an HDFS sink, with checkpointing enabled and writing to a location on HDFS. This streaming application has been running for 4 months and over time we have noticed that with every 10th job within the application there is about a 5 minute delay between when a job finishes and the next job starts which we have attributed to the checkpoint compaction process. At this point the .compact file that is written is about 2GB in size and the contents of the file show it keeps track of files it processed at the very origin of the streaming application.

This issue can be reproduced with any Spark Structured Streaming process that writes checkpoint files.

Is the best approach for handling the growth of these files to simply delete the latest .compact file within the checkpoint directory, and are there any associated risks with doing so?


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

Re:[Structured Streaming] Checkpoint file compact file grows big

Posted by Kelvin Qin <qy...@126.com>.


SEE:http://spark.apache.org/docs/2.3.1/streaming-programming-guide.html#checkpointing
"Note that checkpointing of RDDs incurs the cost of saving to reliable storage. This may cause an increase in the processing time of those batches where RDDs get checkpointed."


As far as I know, the official documentation states that the checkpoint of the spark streaming application will continue to increase over time.
Whereas data or RDD checkpointing is necessary even for basic functioning if stateful transformations are used.
So,for applications that require long-term aggregation, I choose to use third-party caches in production, such as redis. Maybe you can try Alluxio




Wishes!







在 2020-04-16 08:19:24,"Ahn, Daniel" <da...@optum.com.INVALID> 写道:

Are Spark Structured Streaming checkpoint files expected to grow over time indefinitely? Is there a recommended way to safely age-off old checkpoint data?

 

Currently we have a Spark Structured Streaming process reading from Kafka and writing to an HDFS sink, with checkpointing enabled and writing to a location on HDFS. This streaming application has been running for 4 months and over time we have noticed that with every 10th job within the application there is about a 5 minute delay between when a job finishes and the next job starts which we have attributed to the checkpoint compaction process. At this point the .compact file that is written is about 2GB in size and the contents of the file show it keeps track of files it processed at the very origin of the streaming application.

 

This issue can be reproduced with any Spark Structured Streaming process that writes checkpoint files.

 

Is the best approach for handling the growth of these files to simply delete the latest .compact file within the checkpoint directory, and are there any associated risks with doing so?

 


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

Re: [Structured Streaming] Checkpoint file compact file grows big

Posted by Jungtaek Lim <ka...@gmail.com>.
Deleting the latest .compact file would lose the ability for exactly-once
and lead Spark fail to read from the output directory. If you're reading
the output directory from non-Spark then metadata on output directory
doesn't matter, but there's no exactly-once (exactly-once is achieved
leveraging the metadata, which only Spark can read).

Btw, what you've encountered is the one of known issues on file stream sink
- there're two different JIRA issues filed for the same issue so far
(reported from end users):

https://issues.apache.org/jira/browse/SPARK-24295
https://issues.apache.org/jira/browse/SPARK-29995

I've proposed the retention of output files in file stream sink but haven't
got some love. (That means it's not guaranteed to be addressed)

https://issues.apache.org/jira/browse/SPARK-27188

Given the patch is stale, I'm planning to rework based on latest master
again sooner.

Btw, I've also proposed other improvements to help addressing latency
issues in file stream source & file stream sink but haven't got some love
from committers as well (no guarantee to be addressed)

https://issues.apache.org/jira/browse/SPARK-30804
https://issues.apache.org/jira/browse/SPARK-30866
https://issues.apache.org/jira/browse/SPARK-30900
https://issues.apache.org/jira/browse/SPARK-30915
https://issues.apache.org/jira/browse/SPARK-30946

SPARK-30946 is closely related to the issue - it will help the size of
checkpoint file much smaller and also much shorter elapsed time to compact.
Efficiency would depend on compression ratio, but it could achieve 5 times
faster to compact and 80% smaller (1/5 of original) which would delay the
point of time greatly even without TTL. Say, if you reached the bad state
in 2 weeks, the patch would make it delayed by 8 weeks more (10 weeks to
reach the bad state).

That said, it doesn't completely get rid of necessity of TTL, but open the
chance to have longer TTL without encountering bad state.

If you're adventurous you can apply these patches on your version of Spark
and see whether it helps.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Thu, Apr 16, 2020 at 9:24 AM Ahn, Daniel <da...@optum.com.invalid>
wrote:

> Are Spark Structured Streaming checkpoint files expected to grow over time
> indefinitely? Is there a recommended way to safely age-off old checkpoint
> data?
>
>
>
> Currently we have a Spark Structured Streaming process reading from Kafka
> and writing to an HDFS sink, with checkpointing enabled and writing to a
> location on HDFS. This streaming application has been running for 4 months
> and over time we have noticed that with every 10th job within the
> application there is about a 5 minute delay between when a job finishes and
> the next job starts which we have attributed to the checkpoint compaction
> process. At this point the .compact file that is written is about 2GB in
> size and the contents of the file show it keeps track of files it processed
> at the very origin of the streaming application.
>
>
>
> This issue can be reproduced with any Spark Structured Streaming process
> that writes checkpoint files.
>
>
>
> Is the best approach for handling the growth of these files to simply
> delete the latest .compact file within the checkpoint directory, and are
> there any associated risks with doing so?
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>