You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Alexey Trenikhun <ye...@msn.com> on 2021/08/24 23:38:42 UTC

checkpoints/.../shared cleanup

Hello,
I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...

Thanks,
Alexey

Re: checkpoints/.../shared cleanup

Posted by Alexey Trenikhun <ye...@msn.com>.

Hi Roman,
By stop-with-savepoint did you mean POST /jobs/:jobid/stop ? I’ve tried this one in the past, but I could not get status of async operation, not sure what endpoint to use for status, tried

/jobs/:jobid/savepoints/:triggerid but it didn’t work

Thanks,
Alexey
________________________________
From: Roman Khachatryan <ro...@apache.org>
Sent: Friday, October 1, 2021 8:51:31 AM
To: Alexey Trenikhun <ye...@msn.com>
Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
Subject: Re: checkpoints/.../shared cleanup

Hi Alexey,

Thanks for sharing this information.
I also don't see anything suspicious in the log.

Yes, Flink deletes files one-by-one and any untracked files won't be deleted.
During the cancellation, if there is an ongoing upload, that upload
can become untracked (though there will be an attempt to delete the
file right after the upload, but that's not guaranteed). TM logs would
probably shed more light here.

Have you tried stop-with-savepoint instead of cancellation after
savepoint? As it's a more graceful way of shutting down, state
artifacts should be deleted after stop-with-savepoint succeeds.

Regards,
Roman

On Sat, Sep 11, 2021 at 2:58 AM Alexey Trenikhun <ye...@msn.com> wrote:
>
> Hi Roman
>
> It is kubernetes deployment. JM starts with: standalone-job --job-classname <my-class> --host $(POD_IP) --job-id 00000000000000000000000000000000 <my-job-specific-args>. Stop is done via API POST /jobs/{jobId}/savepoints {cancel-job=true}, then we wait for completion, if it is complete in 10 minutes, we use "Terminate job" (PATHCH /jobs/{jobId}). I've also tried with cancel from UI, same result shared is not empty
> Flink 1.13.2
> job-id is hardcoded to 00000000000000000000000000000000
>
> I don't see anything suspicus in log, there is exception that job is cancelled. I'm attaching example, when job was canceled via UI, after that there were 4 files left in shared folder.
>
> I suspect that Flink, doesn't clean whole folder (prefix) but instead delete tracked files one by one, and maybe something bad happened during execution (e.g. failed checkoint), which leaded to loosing track of some file(s), and then during shutdown these files are not deleted, because Flink already not tracking them
>
> Thanks,
> Alexey
> ________________________________
> From: Roman Khachatryan <ro...@apache.org>
> Sent: Monday, September 6, 2021 4:20 PM
> To: Alexey Trenikhun <ye...@msn.com>
> Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> I tried to reproduce the issue and I see that the folder grows
> (because of the underlying FS) but the files under shared/ are
> removed. With large state, it takes quite some time though. Do you see
> any errors/warnings in the logs while stopping the job?
>
> Could you please share:
> - the commands or API you use to start and stop the job
> - Flink version
> - the API to choose the job ID?
>
>
> Regards,
> Roman
>
> On Tue, Aug 31, 2021 at 10:07 PM Alexey Trenikhun <ye...@msn.com> wrote:
> >
> > I'm running Flink in Application Mode and set jobId explicitly
> >
> > ________________________________
> > From: Khachatryan Roman <kh...@gmail.com>
> > Sent: Monday, August 30, 2021 7:16 AM
> > To: Alexey Trenikhun <ye...@msn.com>
> > Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi,
> >
> > I think the documentation is correct. Once the job is stopped with
> > savepoint, any of its "regular" checkpoints are discarded, and as a
> > result any shared state gets unreferenced and is also discarded.
> > Savepoints currently do not have shared state.
> >
> > Furthermore, the new job should have a new ID and therefore a new folder.
> > Are you referring to the old folders?
> >
> > However, the removal process is asynchronous and the client doesn't
> > wait for all the artifacts to be removed.
> > Then the cluster will wait for removal to complete before termination.
> > Are you running Flink in session mode?
> >
> > Regards,
> > Roman
> >
> > On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <ye...@msn.com> wrote:
> > >
> > > "the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows
> > >
> > > Thanks,
> > > Alexey
> > > ________________________________
> > > From: Alexey Trenikhun <ye...@msn.com>
> > > Sent: Thursday, August 26, 2021 6:37:27 PM
> > > To: Matthias Pohl <ma...@ververica.com>
> > > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > > Subject: Re: checkpoints/.../shared cleanup
> > >
> > > Hi Matthias,
> > >
> > > I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.
> > >
> > > [1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> > > Checkpoints | Apache Flink
> > > Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
> > > ci.apache.org
> > >
> > > Thanks,
> > > Alexey
> > > ________________________________
> > > From: Matthias Pohl <ma...@ververica.com>
> > > Sent: Thursday, August 26, 2021 5:42 AM
> > > To: Alexey Trenikhun <ye...@msn.com>
> > > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > > Subject: Re: checkpoints/.../shared cleanup
> > >
> > > Hi Alexey,
> > > thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.
> > >
> > > Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.
> > >
> > > I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.
> > >
> > > Best,
> > > Matthias
> > >
> > > On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:
> > >
> > > Hello,
> > > I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...
> > >
> > > Thanks,
> > > Alexey

Re: checkpoints/.../shared cleanup

Posted by Roman Khachatryan <ro...@apache.org>.

Hi Alexey,

Thanks for sharing this information.
I also don't see anything suspicious in the log.

Yes, Flink deletes files one-by-one and any untracked files won't be deleted.
During the cancellation, if there is an ongoing upload, that upload
can become untracked (though there will be an attempt to delete the
file right after the upload, but that's not guaranteed). TM logs would
probably shed more light here.

Have you tried stop-with-savepoint instead of cancellation after
savepoint? As it's a more graceful way of shutting down, state
artifacts should be deleted after stop-with-savepoint succeeds.

Regards,
Roman

On Sat, Sep 11, 2021 at 2:58 AM Alexey Trenikhun <ye...@msn.com> wrote:
>
> Hi Roman
>
> It is kubernetes deployment. JM starts with: standalone-job --job-classname <my-class> --host $(POD_IP) --job-id 00000000000000000000000000000000 <my-job-specific-args>. Stop is done via API POST /jobs/{jobId}/savepoints {cancel-job=true}, then we wait for completion, if it is complete in 10 minutes, we use "Terminate job" (PATHCH /jobs/{jobId}). I've also tried with cancel from UI, same result shared is not empty
> Flink 1.13.2
> job-id is hardcoded to 00000000000000000000000000000000
>
> I don't see anything suspicus in log, there is exception that job is cancelled. I'm attaching example, when job was canceled via UI, after that there were 4 files left in shared folder.
>
> I suspect that Flink, doesn't clean whole folder (prefix) but instead delete tracked files one by one, and maybe something bad happened during execution (e.g. failed checkoint), which leaded to loosing track of some file(s), and then during shutdown these files are not deleted, because Flink already not tracking them
>
> Thanks,
> Alexey
> ________________________________
> From: Roman Khachatryan <ro...@apache.org>
> Sent: Monday, September 6, 2021 4:20 PM
> To: Alexey Trenikhun <ye...@msn.com>
> Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> I tried to reproduce the issue and I see that the folder grows
> (because of the underlying FS) but the files under shared/ are
> removed. With large state, it takes quite some time though. Do you see
> any errors/warnings in the logs while stopping the job?
>
> Could you please share:
> - the commands or API you use to start and stop the job
> - Flink version
> - the API to choose the job ID?
>
>
> Regards,
> Roman
>
> On Tue, Aug 31, 2021 at 10:07 PM Alexey Trenikhun <ye...@msn.com> wrote:
> >
> > I'm running Flink in Application Mode and set jobId explicitly
> >
> > ________________________________
> > From: Khachatryan Roman <kh...@gmail.com>
> > Sent: Monday, August 30, 2021 7:16 AM
> > To: Alexey Trenikhun <ye...@msn.com>
> > Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi,
> >
> > I think the documentation is correct. Once the job is stopped with
> > savepoint, any of its "regular" checkpoints are discarded, and as a
> > result any shared state gets unreferenced and is also discarded.
> > Savepoints currently do not have shared state.
> >
> > Furthermore, the new job should have a new ID and therefore a new folder.
> > Are you referring to the old folders?
> >
> > However, the removal process is asynchronous and the client doesn't
> > wait for all the artifacts to be removed.
> > Then the cluster will wait for removal to complete before termination.
> > Are you running Flink in session mode?
> >
> > Regards,
> > Roman
> >
> > On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <ye...@msn.com> wrote:
> > >
> > > "the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows
> > >
> > > Thanks,
> > > Alexey
> > > ________________________________
> > > From: Alexey Trenikhun <ye...@msn.com>
> > > Sent: Thursday, August 26, 2021 6:37:27 PM
> > > To: Matthias Pohl <ma...@ververica.com>
> > > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > > Subject: Re: checkpoints/.../shared cleanup
> > >
> > > Hi Matthias,
> > >
> > > I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.
> > >
> > > [1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> > > Checkpoints | Apache Flink
> > > Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
> > > ci.apache.org
> > >
> > > Thanks,
> > > Alexey
> > > ________________________________
> > > From: Matthias Pohl <ma...@ververica.com>
> > > Sent: Thursday, August 26, 2021 5:42 AM
> > > To: Alexey Trenikhun <ye...@msn.com>
> > > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > > Subject: Re: checkpoints/.../shared cleanup
> > >
> > > Hi Alexey,
> > > thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.
> > >
> > > Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.
> > >
> > > I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.
> > >
> > > Best,
> > > Matthias
> > >
> > > On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:
> > >
> > > Hello,
> > > I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...
> > >
> > > Thanks,
> > > Alexey

Re: checkpoints/.../shared cleanup

Posted by Alexey Trenikhun <ye...@msn.com>.

Hi Roman

  *   It is kubernetes deployment. JM starts with: standalone-job --job-classname <my-class> --host $(POD_IP) --job-id 00000000000000000000000000000000 <my-job-specific-args>. Stop is done via API POST /jobs/{jobId}/savepoints {cancel-job=true}, then we wait for completion, if it is complete in 10 minutes, we use "Terminate job" (PATHCH /jobs/{jobId}). I've also tried with cancel from UI, same result shared is not empty
  *   Flink 1.13.2
  *   job-id is hardcoded to 00000000000000000000000000000000

I don't see anything suspicus in log, there is exception that job is cancelled. I'm attaching example, when job was canceled via UI, after that there were 4 files left in shared folder.

I suspect that Flink, doesn't clean whole folder (prefix) but instead delete tracked files one by one, and maybe something bad happened during execution (e.g. failed checkoint), which leaded to loosing track of some file(s), and then during shutdown these files are not deleted, because Flink already not tracking them

Thanks,
Alexey
________________________________
From: Roman Khachatryan <ro...@apache.org>
Sent: Monday, September 6, 2021 4:20 PM
To: Alexey Trenikhun <ye...@msn.com>
Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
Subject: Re: checkpoints/.../shared cleanup

I tried to reproduce the issue and I see that the folder grows
(because of the underlying FS) but the files under shared/ are
removed. With large state, it takes quite some time though. Do you see
any errors/warnings in the logs while stopping the job?

Could you please share:
- the commands or API you use to start and stop the job
- Flink version
- the API to choose the job ID?


Regards,
Roman

On Tue, Aug 31, 2021 at 10:07 PM Alexey Trenikhun <ye...@msn.com> wrote:
>
> I'm running Flink in Application Mode and set jobId explicitly
>
> ________________________________
> From: Khachatryan Roman <kh...@gmail.com>
> Sent: Monday, August 30, 2021 7:16 AM
> To: Alexey Trenikhun <ye...@msn.com>
> Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi,
>
> I think the documentation is correct. Once the job is stopped with
> savepoint, any of its "regular" checkpoints are discarded, and as a
> result any shared state gets unreferenced and is also discarded.
> Savepoints currently do not have shared state.
>
> Furthermore, the new job should have a new ID and therefore a new folder.
> Are you referring to the old folders?
>
> However, the removal process is asynchronous and the client doesn't
> wait for all the artifacts to be removed.
> Then the cluster will wait for removal to complete before termination.
> Are you running Flink in session mode?
>
> Regards,
> Roman
>
> On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <ye...@msn.com> wrote:
> >
> > "the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows
> >
> > Thanks,
> > Alexey
> > ________________________________
> > From: Alexey Trenikhun <ye...@msn.com>
> > Sent: Thursday, August 26, 2021 6:37:27 PM
> > To: Matthias Pohl <ma...@ververica.com>
> > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi Matthias,
> >
> > I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.
> >
> > [1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> > Checkpoints | Apache Flink
> > Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
> > ci.apache.org
> >
> > Thanks,
> > Alexey
> > ________________________________
> > From: Matthias Pohl <ma...@ververica.com>
> > Sent: Thursday, August 26, 2021 5:42 AM
> > To: Alexey Trenikhun <ye...@msn.com>
> > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi Alexey,
> > thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.
> >
> > Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.
> >
> > I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.
> >
> > Best,
> > Matthias
> >
> > On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:
> >
> > Hello,
> > I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...
> >
> > Thanks,
> > Alexey

Re: checkpoints/.../shared cleanup

Posted by Roman Khachatryan <ro...@apache.org>.

I tried to reproduce the issue and I see that the folder grows
(because of the underlying FS) but the files under shared/ are
removed. With large state, it takes quite some time though. Do you see
any errors/warnings in the logs while stopping the job?

Could you please share:
- the commands or API you use to start and stop the job
- Flink version
- the API to choose the job ID?


Regards,
Roman

On Tue, Aug 31, 2021 at 10:07 PM Alexey Trenikhun <ye...@msn.com> wrote:
>
> I'm running Flink in Application Mode and set jobId explicitly
>
> ________________________________
> From: Khachatryan Roman <kh...@gmail.com>
> Sent: Monday, August 30, 2021 7:16 AM
> To: Alexey Trenikhun <ye...@msn.com>
> Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi,
>
> I think the documentation is correct. Once the job is stopped with
> savepoint, any of its "regular" checkpoints are discarded, and as a
> result any shared state gets unreferenced and is also discarded.
> Savepoints currently do not have shared state.
>
> Furthermore, the new job should have a new ID and therefore a new folder.
> Are you referring to the old folders?
>
> However, the removal process is asynchronous and the client doesn't
> wait for all the artifacts to be removed.
> Then the cluster will wait for removal to complete before termination.
> Are you running Flink in session mode?
>
> Regards,
> Roman
>
> On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <ye...@msn.com> wrote:
> >
> > "the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows
> >
> > Thanks,
> > Alexey
> > ________________________________
> > From: Alexey Trenikhun <ye...@msn.com>
> > Sent: Thursday, August 26, 2021 6:37:27 PM
> > To: Matthias Pohl <ma...@ververica.com>
> > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi Matthias,
> >
> > I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.
> >
> > [1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> > Checkpoints | Apache Flink
> > Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
> > ci.apache.org
> >
> > Thanks,
> > Alexey
> > ________________________________
> > From: Matthias Pohl <ma...@ververica.com>
> > Sent: Thursday, August 26, 2021 5:42 AM
> > To: Alexey Trenikhun <ye...@msn.com>
> > Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi Alexey,
> > thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.
> >
> > Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.
> >
> > I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.
> >
> > Best,
> > Matthias
> >
> > On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:
> >
> > Hello,
> > I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...
> >
> > Thanks,
> > Alexey

Re: checkpoints/.../shared cleanup

Posted by Alexey Trenikhun <ye...@msn.com>.

I'm running Flink in Application Mode and set jobId explicitly

________________________________
From: Khachatryan Roman <kh...@gmail.com>
Sent: Monday, August 30, 2021 7:16 AM
To: Alexey Trenikhun <ye...@msn.com>
Cc: Matthias Pohl <ma...@ververica.com>; Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
Subject: Re: checkpoints/.../shared cleanup

Hi,

I think the documentation is correct. Once the job is stopped with
savepoint, any of its "regular" checkpoints are discarded, and as a
result any shared state gets unreferenced and is also discarded.
Savepoints currently do not have shared state.

Furthermore, the new job should have a new ID and therefore a new folder.
Are you referring to the old folders?

However, the removal process is asynchronous and the client doesn't
wait for all the artifacts to be removed.
Then the cluster will wait for removal to complete before termination.
Are you running Flink in session mode?

Regards,
Roman

On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <ye...@msn.com> wrote:
>
> "the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows
>
> Thanks,
> Alexey
> ________________________________
> From: Alexey Trenikhun <ye...@msn.com>
> Sent: Thursday, August 26, 2021 6:37:27 PM
> To: Matthias Pohl <ma...@ververica.com>
> Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi Matthias,
>
> I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.
>
> [1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> Checkpoints | Apache Flink
> Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
> ci.apache.org
>
> Thanks,
> Alexey
> ________________________________
> From: Matthias Pohl <ma...@ververica.com>
> Sent: Thursday, August 26, 2021 5:42 AM
> To: Alexey Trenikhun <ye...@msn.com>
> Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi Alexey,
> thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.
>
> Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.
>
> I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.
>
> Best,
> Matthias
>
> On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:
>
> Hello,
> I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...
>
> Thanks,
> Alexey

Re: checkpoints/.../shared cleanup

Posted by Khachatryan Roman <kh...@gmail.com>.

Hi,

I think the documentation is correct. Once the job is stopped with
savepoint, any of its "regular" checkpoints are discarded, and as a
result any shared state gets unreferenced and is also discarded.
Savepoints currently do not have shared state.

Furthermore, the new job should have a new ID and therefore a new folder.
Are you referring to the old folders?

However, the removal process is asynchronous and the client doesn't
wait for all the artifacts to be removed.
Then the cluster will wait for removal to complete before termination.
Are you running Flink in session mode?

Regards,
Roman

On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <ye...@msn.com> wrote:
>
> "the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows
>
> Thanks,
> Alexey
> ________________________________
> From: Alexey Trenikhun <ye...@msn.com>
> Sent: Thursday, August 26, 2021 6:37:27 PM
> To: Matthias Pohl <ma...@ververica.com>
> Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi Matthias,
>
> I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.
>
> [1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> Checkpoints | Apache Flink
> Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
> ci.apache.org
>
> Thanks,
> Alexey
> ________________________________
> From: Matthias Pohl <ma...@ververica.com>
> Sent: Thursday, August 26, 2021 5:42 AM
> To: Alexey Trenikhun <ye...@msn.com>
> Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi Alexey,
> thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.
>
> Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.
>
> I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.
>
> Best,
> Matthias
>
> On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:
>
> Hello,
> I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...
>
> Thanks,
> Alexey

Re: checkpoints/.../shared cleanup

Posted by Alexey Trenikhun <ye...@msn.com>.

"the shared subfolder still grows" - while upgrading job, we cancel job with savepoint, my expectations that Flink will clean checkpoint  including shared directory, since checkpoints are not reatained, then we start upgraded job from savepoint, however when I look into shared folder I see older files from previous version of job. This upgrade process repeated again, as result the shared subfolder grows and grows

Thanks,
Alexey
________________________________
From: Alexey Trenikhun <ye...@msn.com>
Sent: Thursday, August 26, 2021 6:37:27 PM
To: Matthias Pohl <ma...@ververica.com>
Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
Subject: Re: checkpoints/.../shared cleanup

Hi Matthias,

I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.

[1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
Checkpoints | Apache Flink<https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints>
Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
ci.apache.org

Thanks,
Alexey
________________________________
From: Matthias Pohl <ma...@ververica.com>
Sent: Thursday, August 26, 2021 5:42 AM
To: Alexey Trenikhun <ye...@msn.com>
Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
Subject: Re: checkpoints/.../shared cleanup

Hi Alexey,
thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.

Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.

I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.

Best,
Matthias

On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com>> wrote:
Hello,
I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...

Thanks,
Alexey

Re: checkpoints/.../shared cleanup

Posted by Alexey Trenikhun <ye...@msn.com>.

Hi Matthias,

I don't use externalized checkpoints (from Flink UI Persist Checkpoints Externally: Disabled), why do you think checkpoint(s) should be retained? It kind of contradicts with documentation [1] - Checkpoints are by default not retained and are only used to resume a job from failures.

[1] - https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
Checkpoints | Apache Flink<https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints>
Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution. See Checkpointing for how to enable and configure checkpoints for your program. Checkpoint Storage # When checkpointing is enabled, managed state is persisted to ensure ...
ci.apache.org

Thanks,
Alexey
________________________________
From: Matthias Pohl <ma...@ververica.com>
Sent: Thursday, August 26, 2021 5:42 AM
To: Alexey Trenikhun <ye...@msn.com>
Cc: Flink User Mail List <us...@flink.apache.org>; sjwiesman@gmail.com <sj...@gmail.com>
Subject: Re: checkpoints/.../shared cleanup

Hi Alexey,
thanks for reaching out to the community. I have a question: What do you mean by "the shared subfolder still grows"? As far as I understand, the shared folder contains the state of incremental checkpoints. If you cancel the corresponding job and start a new job from one of the retained incremental checkpoints, it is required for the shared folder of the previous job to be still around since it contains the state. The new job would then create its own shared subfolder. Any new incremental checkpoints will write their state into the new job's shared subfolder while still relying on shared state of the previous job for older data. The RocksDB Backend is in charge of consolidating the incremental state.

Hence, you should be careful with removing the shared folder in case you're planning to restart the job later on.

I'm adding Seth to this thread. He might have more insights and/or correct my limited knowledge of the incremental checkpoint process.

Best,
Matthias

On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com>> wrote:
Hello,
I use incremental checkpoints, not externalized, should content of checkpoint/.../shared be removed when I cancel job  (or cancel with savepoint). Looks like in our case shared continutes to grow...

Thanks,
Alexey

Re: checkpoints/.../shared cleanup

Posted by Matthias Pohl <ma...@ververica.com>.

Hi Alexey,
thanks for reaching out to the community. I have a question: What do you
mean by "the shared subfolder still grows"? As far as I understand, the
shared folder contains the state of incremental checkpoints. If you cancel
the corresponding job and start a new job from one of the retained
incremental checkpoints, it is required for the shared folder of the
previous job to be still around since it contains the state. The new job
would then create its own shared subfolder. Any new incremental checkpoints
will write their state into the new job's shared subfolder while still
relying on shared state of the previous job for older data. The RocksDB
Backend is in charge of consolidating the incremental state.

Hence, you should be careful with removing the shared folder in case you're
planning to restart the job later on.

I'm adding Seth to this thread. He might have more insights and/or correct
my limited knowledge of the incremental checkpoint process.

Best,
Matthias

On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <ye...@msn.com> wrote:

> Hello,
> I use incremental checkpoints, not externalized, should content of
> checkpoint/.../shared be removed when I cancel job  (or cancel with
> savepoint). Looks like in our case shared continutes to grow...
>
> Thanks,
> Alexey
>