You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Sayat Satybaldiyev <sa...@gmail.com> on 2018/10/09 12:33:24 UTC

Flink leaves a lot RocksDB sst files in tmp directory

Dear all,

While running Flink 1.6.1 with RocksDB as a backend and hdfs as
checkpoint FS, I've noticed that after a job has moved to a different host
it leaves quite a huge state in temp folder(1.2TB in total). The files are
not used as TM is not running a job on the current host.

The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but
then it was moved to a different TM. I'm wondering is it intended
behavior or a possible bug?

I've attached files that are left and not used by a job in PrintScreen.

Re: Flink leaves a lot RocksDB sst files in tmp directory

Posted by Stefan Richter <s....@data-artisans.com>.
Hi,

Can you maybe show us what is inside of one of the directory instance? Furthermore, your TM logs show multiple instances of OutOfMemoryErrors, so that might also be a problem. Also how was the job moved? If a TM is killed, of course it cannot cleanup. That is why the data goes to tmp dir so that the OS can eventually take care of it, in container environments this dir should always be cleaned anyways.

Best,
Stefan

> On 11. Oct 2018, at 10:15, Sayat Satybaldiyev <sa...@gmail.com> wrote:
> 
> Thank you Piotr for the reply! We didn't run this job on the previous version of Flink. Unfortunately, I don't have a log file from JM only TM logs. 
> 
> https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing <https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing>
> 
> On Wed, Oct 10, 2018 at 10:08 AM Piotr Nowojski <piotr@data-artisans.com <ma...@data-artisans.com>> wrote:
> Hi,
> 
> Was this happening in older Flink version? Could you post in what circumstances the job has been moved to a new TM (full job manager logs and task manager logs would be helpful)? I’m suspecting that those leftover files might have something to do with local recovery.
> 
> Piotrek 
> 
>> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev <sayatez@gmail.com <ma...@gmail.com>> wrote:
>> 
>> After digging more in the log, I think it's more a bug. I've greped a log by job id and found under normal circumstances TM supposed to delete flink-io files. For some reason, it doesn't delete files that were listed above.
>> 
>> 2018-10-08 22:10:25,865 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
>> 2018-10-08 22:10:25,867 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
>> 2018-10-08 22:10:25,874 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
>> 2018-10-08 22:17:38,680 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
>> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 2018-10-08 22:17:38,686 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
>> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 2018-10-08 22:17:38,691 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
>> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 
>> 
>> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev <sayatez@gmail.com <ma...@gmail.com>> wrote:
>> Dear all,
>> 
>> While running Flink 1.6.1 with RocksDB as a backend and hdfs as checkpoint FS, I've noticed that after a job has moved to a different host it leaves quite a huge state in temp folder(1.2TB in total). The files are not used as TM is not running a job on the current host. 
>> 
>> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but then it was moved to a different TM. I'm wondering is it intended behavior or a possible bug?
>> 
>> I've attached files that are left and not used by a job in PrintScreen.
> 


Re: Flink leaves a lot RocksDB sst files in tmp directory

Posted by Sayat Satybaldiyev <sa...@gmail.com>.
Thank you Piotr for the reply! We didn't run this job on the previous
version of Flink. Unfortunately, I don't have a log file from JM only TM
logs.

https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing

On Wed, Oct 10, 2018 at 10:08 AM Piotr Nowojski <pi...@data-artisans.com>
wrote:

> Hi,
>
> Was this happening in older Flink version? Could you post in what
> circumstances the job has been moved to a new TM (full job manager logs and
> task manager logs would be helpful)? I’m suspecting that those leftover
> files might have something to do with local recovery.
>
> Piotrek
>
> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev <sa...@gmail.com> wrote:
>
> After digging more in the log, I think it's more a bug. I've greped a log
> by job id and found under normal circumstances TM supposed to delete
> flink-io files. For some reason, it doesn't delete files that were listed
> above.
>
> 2018-10-08 22:10:25,865 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
> 2018-10-08 22:10:25,867 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
> 2018-10-08 22:10:25,874 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
> 2018-10-08 22:17:38,680 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close
> JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,686 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,691 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>
>
> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev <sa...@gmail.com>
> wrote:
>
>> Dear all,
>>
>> While running Flink 1.6.1 with RocksDB as a backend and hdfs as
>> checkpoint FS, I've noticed that after a job has moved to a different host
>> it leaves quite a huge state in temp folder(1.2TB in total). The files are
>> not used as TM is not running a job on the current host.
>>
>> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but
>> then it was moved to a different TM. I'm wondering is it intended
>> behavior or a possible bug?
>>
>> I've attached files that are left and not used by a job in PrintScreen.
>>
>
>

Re: Flink leaves a lot RocksDB sst files in tmp directory

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

Was this happening in older Flink version? Could you post in what circumstances the job has been moved to a new TM (full job manager logs and task manager logs would be helpful)? I’m suspecting that those leftover files might have something to do with local recovery.

Piotrek 

> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev <sa...@gmail.com> wrote:
> 
> After digging more in the log, I think it's more a bug. I've greped a log by job id and found under normal circumstances TM supposed to delete flink-io files. For some reason, it doesn't delete files that were listed above.
> 
> 2018-10-08 22:10:25,865 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
> 2018-10-08 22:10:25,867 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
> 2018-10-08 22:10:25,874 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
> 2018-10-08 22:17:38,680 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,686 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,691 INFO  org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting existing instance base directory /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 
> 
> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev <sayatez@gmail.com <ma...@gmail.com>> wrote:
> Dear all,
> 
> While running Flink 1.6.1 with RocksDB as a backend and hdfs as checkpoint FS, I've noticed that after a job has moved to a different host it leaves quite a huge state in temp folder(1.2TB in total). The files are not used as TM is not running a job on the current host. 
> 
> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but then it was moved to a different TM. I'm wondering is it intended behavior or a possible bug?
> 
> I've attached files that are left and not used by a job in PrintScreen.


Re: Flink leaves a lot RocksDB sst files in tmp directory

Posted by Sayat Satybaldiyev <sa...@gmail.com>.
After digging more in the log, I think it's more a bug. I've greped a log
by job id and found under normal circumstances TM supposed to delete
flink-io files. For some reason, it doesn't delete files that were listed
above.

2018-10-08 22:10:25,865 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
2018-10-08 22:10:25,867 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
2018-10-08 22:10:25,874 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
2018-10-08 22:17:38,680 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close
JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
2018-10-08 22:17:38,686 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
2018-10-08 22:17:38,691 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.


On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev <sa...@gmail.com> wrote:

> Dear all,
>
> While running Flink 1.6.1 with RocksDB as a backend and hdfs as
> checkpoint FS, I've noticed that after a job has moved to a different host
> it leaves quite a huge state in temp folder(1.2TB in total). The files are
> not used as TM is not running a job on the current host.
>
> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but
> then it was moved to a different TM. I'm wondering is it intended
> behavior or a possible bug?
>
> I've attached files that are left and not used by a job in PrintScreen.
>