You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Papadopoulos, Konstantinos" <Ko...@IRIworldwide.com> on 2019/07/10 10:51:25 UTC

Disk full problem faced due to the Flink tmp directory contents

Hi all,

We are developing several batch processing applications using the DataSet API of the Apache Flink.
For the time being, we are facing an issue with one of our production environments since its disk usage increase enormously. After a quick investigation, we concluded that the /tmp/flink-io-{} directory (under the parent directory of the Apache Flink deployment) contains files of more than 1TB and we need to regularly delete them in order to return our system to its proper functionality. On the first sight, there is no significant impact when deleting these temp files. So, I need your help to answer the following questions:

  *   What kind of data does it stored to the aforementioned directory?
  *   Why does the respective files have such an enormous size?
  *   How can we limit the size of the data written to the respective directory?
  *   Is there any way  to delete such files automatically when not needed yet?

Thanks in advance for your help,
Konstantinos

Re: Disk full problem faced due to the Flink tmp directory contents

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

AFAIK Flink should remove temporary files automatically when they are not
needed anymore.
However, I'm not 100% sure that there are not corner cases when a TM
crashes.

In general it is a good idea to properly configure the directories that
Flink uses for spilling, logging, blob storage, etc.
As Ken said, you can provide multiple directories to write temp data to
multiple disks (using the combined IO rate).

Best, Fabian

Am Mi., 10. Juli 2019 um 15:16 Uhr schrieb Ken Krugler <
kkrugler_lists@transpac.com>:

> Hi Konstantinos,
>
> Typically the data that you are seeing is from records being spilled to
> disk during groupBy/join operations, where the size of one (or multiple,
> for the join case) data sets exceeds what will fit in memory.
>
> And yes, these files can get big, e.g. as big as the sum of your input
> data sizes.
>
> If you split your data stream (one data set being processed by multiple
> operators) then the summed temp size can be multiplied.
>
> You can specify multiple disks to use as temp directories (comma-separated
> list in Flink config), so that’s one way to avoid a single disk becoming
> too full.
>
> You can take a single workflow and break it into multiple pieces that you
> run sequentially, as that can reduce the high water mark for total spilled
> files.
>
> You can write intermediate results to a file, versus relying on spills.
> Though if you use HDFS, and HDFS is using the same disks in your cluster,
> that obviously won’t help, and in fact can be worse due to replication of
> data.
>
> As far as auto-deletion goes, I don’t think Flink supports this. In our
> case, after a job has run we can a shell script (via ssh) on slaves to
> remove temp files.
>
> — Ken
>
> PS - note that logging can also chew up a lot of space, if you set the log
> level to DEBUG, due to HTTP wire traffic.
>
> On Jul 10, 2019, at 3:51 AM, Papadopoulos, Konstantinos <
> Konstantinos.Papadopoulos@IRIworldwide.com> wrote:
>
> Hi all,
>
> We are developing several batch processing applications using the DataSet
> API of the Apache Flink.
> For the time being, we are facing an issue with one of our production
> environments since its disk usage increase enormously. After a quick
> investigation, we concluded that the /tmp/flink-io-{} directory (under the
> parent directory of the Apache Flink deployment) contains files of more
> than 1TB and we need to regularly delete them in order to return our system
> to its proper functionality. On the first sight, there is no significant
> impact when deleting these temp files. So, I need your help to answer the
> following questions:
>
>    - What kind of data does it stored to the aforementioned directory?
>    - Why does the respective files have such an enormous size?
>    - How can we limit the size of the data written to the respective
>    directory?
>    - Is there any way  to delete such files automatically when not needed
>    yet?
>
>
> Thanks in advance for your help,
> Konstantinos
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>

Re: Disk full problem faced due to the Flink tmp directory contents

Posted by Ken Krugler <kk...@transpac.com>.

Hi Konstantinos,

Typically the data that you are seeing is from records being spilled to disk during groupBy/join operations, where the size of one (or multiple, for the join case) data sets exceeds what will fit in memory.

And yes, these files can get big, e.g. as big as the sum of your input data sizes.

If you split your data stream (one data set being processed by multiple operators) then the summed temp size can be multiplied.

You can specify multiple disks to use as temp directories (comma-separated list in Flink config), so that’s one way to avoid a single disk becoming too full.

You can take a single workflow and break it into multiple pieces that you run sequentially, as that can reduce the high water mark for total spilled files.

You can write intermediate results to a file, versus relying on spills. Though if you use HDFS, and HDFS is using the same disks in your cluster, that obviously won’t help, and in fact can be worse due to replication of data.

As far as auto-deletion goes, I don’t think Flink supports this. In our case, after a job has run we can a shell script (via ssh) on slaves to remove temp files.

— Ken

PS - note that logging can also chew up a lot of space, if you set the log level to DEBUG, due to HTTP wire traffic. 

> On Jul 10, 2019, at 3:51 AM, Papadopoulos, Konstantinos <Ko...@IRIworldwide.com> wrote:
> 
> Hi all,
>  
> We are developing several batch processing applications using the DataSet API of the Apache Flink.
> For the time being, we are facing an issue with one of our production environments since its disk usage increase enormously. After a quick investigation, we concluded that the /tmp/flink-io-{} directory (under the parent directory of the Apache Flink deployment) contains files of more than 1TB and we need to regularly delete them in order to return our system to its proper functionality. On the first sight, there is no significant impact when deleting these temp files. So, I need your help to answer the following questions:
> What kind of data does it stored to the aforementioned directory?
> Why does the respective files have such an enormous size?
> How can we limit the size of the data written to the respective directory?
> Is there any way  to delete such files automatically when not needed yet?
>  
> Thanks in advance for your help,
> Konstantinos

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra