You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pappu Yadav <py...@gmail.com> on 2020/04/21 11:23:14 UTC

Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

Hi Team,

While Running Spark Below are some finding.

   1. FileStreamSourceLog is responsible for maintaining input source file
   list.
   2. Spark Streaming delete expired log files on the basis of s
   *park.sql.streaming.fileSource.log.deletion* and
   *spark.sql.streaming.minBatchesToRetain.*
   3. But while compacting logs Spark Streaming write the complete list of
   files streaming has seen till now in HDFS into one single .compact file.
   4. Over the course of time this compact file  is consuming around
   2GB-5GB in HDFS which will delay creation of compact file after every 10th
   Batch and also job restart time will increase.
   5. Why Spark Streaming is logging files in the system which are already
   deleted . While creating compact file there must be some configured timeout
   so that Spark can skip writing expired list of input files.

*Also kindly let me know if i missed something and there is some
configuration already present to handle this. *

Regards
Pappu Yadav

Re: Spark Structure Streaming | FileStreamSourceLog not deleting list of input files | Spark -2.4.0

Posted by Jungtaek Lim <ka...@gmail.com>.
You're hitting an existing issue
https://issues.apache.org/jira/browse/SPARK-17604. While there's no active
PR to address it, I've been planning to take a look sooner than later.

Btw, you may also want to take a look at my previous mail - the topic on
the mail thread was regarding file stream sink metadata growing bigger, but
in fact that's basically the same issue, so you may get some information
from there. (tl;dr. I have bunch of PRs for addressing multiple issues on
file stream source and sink, just having lack of some love.)

https://lists.apache.org/thread.html/rb4ebf1d20d13db0a78694e8d301e51c326f803cb86fc1a1f66f2ae7e%40%3Cuser.spark.apache.org%3E

Thanks,
Jungtaek Lim (HeartSaVioR)

On Tue, Apr 21, 2020 at 8:23 PM Pappu Yadav <py...@gmail.com> wrote:

> Hi Team,
>
> While Running Spark Below are some finding.
>
>    1. FileStreamSourceLog is responsible for maintaining input source
>    file list.
>    2. Spark Streaming delete expired log files on the basis of s
>    *park.sql.streaming.fileSource.log.deletion* and
>    *spark.sql.streaming.minBatchesToRetain.*
>    3. But while compacting logs Spark Streaming write the complete list
>    of files streaming has seen till now in HDFS into one single .compact file.
>    4. Over the course of time this compact file  is consuming around
>    2GB-5GB in HDFS which will delay creation of compact file after every 10th
>    Batch and also job restart time will increase.
>    5. Why Spark Streaming is logging files in the system which are
>    already deleted . While creating compact file there must be some configured
>    timeout so that Spark can skip writing expired list of input files.
>
> *Also kindly let me know if i missed something and there is some
> configuration already present to handle this. *
>
> Regards
> Pappu Yadav
>