You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/01/20 21:36:12 UTC

[GitHub] [flink] rkhachatryan opened a new pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

rkhachatryan opened a new pull request #18431:
URL: https://github.com/apache/flink/pull/18431


   ## What is the purpose of the change
   
   Add documentation for State Changelog (FLIP-158):
    - state-backends
    - configuration
    - metrics
   
   The PR contains TODOs (and is not ready to merge as is):
   - Chinese versions aren't updated
   - testing results are desired
   - FLINK-25739 (include jars into dist/opt) desired
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): yes (internal: flink-docs -> flink-dstl-dfs)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? see above
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1024229436


   Thanks for the review @curcur and @infoverload .
   I'll merge the PR once the build is green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f58ab18ad9596e758779524e0cb4f967f3ba9a90 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915) 
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5259a8092544f6288cb0b47dcb306f2e5846daab Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837) 
   * f58ab18ad9596e758779524e0cb4f967f3ba9a90 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r790794843



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).

Review comment:
       1. I think explaining when the compaction can be triggered is outside of scope of this section. There are several compaction algorithms, to start with.
   2. Agree




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r790412520



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs

Review comment:
       Usually, if we do not have a Chinese corresponding version, we will copy the English version to the Chinese Version and open a ticket there?
   
   The ticket can be grabbed by anyone.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).

Review comment:
       1. Add some context here:
   Current Incremental Checkpoints depend on the implementation of different types of state backends. For example, for rocksdb, compaction happens when ... 
   
   2. Explain a bit why compaction is bad
   Compaction may cause more data to be uploaded, and more time to upload....

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload

Review comment:
       upload state change/changelogs

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).

Review comment:
       "as well as synchronous phase (in particular, long-tail)"
   
   Could you explain a bit more. Can not infer directly from what stated above.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,

Review comment:
       "latter: denoting the second or second mentioned of two people or things."

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the

Review comment:
       "is checkpointed" => "is snapshotted"
   
   It may confuse people with the normal Flink checkpointing procedure.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.
+
+For more details, see [FLIP-158](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints).
+
+### Installation
+
+Changelog jars are included into the standard Flink distribution.
+
+Please make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.
+
+### Configuration
+
+An example configuration in yaml:
+```yaml
+state.backend.changelog.enabled: true
+state.backend.changelog.storage: filesystem # currently, only filesystem and memory (for tests) are supported
+dstl.dfs.base-path: s3://<bucket-name> # similar to state.checkpoints.dir
+```
+
+Please keep the following defaults (see [limitations](#limitations)):
+```yaml
+execution.checkpointing.max-concurrent-checkpoints: 1
+state.backend.local-recovery: false
+```
+
+Please refer to [configuration reference]({{< ref "docs/deployment/config#state-changelog-options" >}}) for other options.
+
+Changelog can also be enabled or disabled per-job programmatically:
+{{< tabs  >}}
+{{< tab "Java" >}}
+```java
+StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
+env.enableChangelogStateBackend(true);
+```
+{{< /tab >}}
+{{< tab "Scala" >}}
+```scala
+val env = StreamExecutionEnvironment.getExecutionEnvironment()
+env.enableChangelogStateBackend(true)
+```
+{{< /tab >}}
+{{< tab "Python" >}}
+```python
+env = StreamExecutionEnvironment.get_execution_environment()
+env.enable_changelog_statebackend(true)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Monitoring
+
+Available metrics are listed [here]({{< ref "docs/ops/metrics#changelog" >}}).
+
+In the UI, if a task is back-pressured by writing state changes, it will be shown as busy (red).
+
+### Upgrading existing jobs
+
+**Enabling Changelog**
+
+Resuming from both savepoints and checkpoints is supported:
+- given an existing non-changelog job
+- take either a [savepoint]({{< ref "docs/ops/state/savepoints#resuming-from-savepoints" >}}) or a [checkpoint]({{< ref "docs/ops/state/checkpoints#resuming-from-a-retained-checkpoint" >}})
+- alter configuration (enable Changelog)
+- resume from the taken snapshot
+
+**Disabling Changelog**
+Resuming only from [savepoints]({{< ref "docs/ops/state/savepoints#resuming-from-savepoints" >}})
+is supported. Resuming from [checkpoints]({{<  ref "docs/ops/state/checkpoints#resuming-from-a-retained-checkpoint" >}})
+is planned in the future versions.
+
+**State migration** (including changing TTL) is currently not supported

Review comment:
       Include these in the limitation section?

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost

Review comment:
       mark here to remove todo in the final version.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       The snapshot creation is still synchronous, is it a typo? I am a bit confused here.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       `However, recovery time combined with checkpoint
   duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
   case.`
   
   I am not sure of that. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1024121085


   I've accept the PR. Thanks for the effort Roman!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791311410



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       That's what I am not sure about.
   
   Snapshot creation time: (isn't that just preparing for changelogs to be uploaded)? 
   And previously snapshot creat time: flush + prepare SSTs to be uploaded?
   
   I think most of the time reduced is from the async phase (uploading before ahead, so the async phase does not take very long time).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791455950



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       > Snapshot creation time: (isn't that just preparing for changelogs to be uploaded)?
   > And previously snapshot creat time: flush + prepare SSTs to be uploaded?
   
   Yes.
   
   > Would these two parts significantly different?
   > I think most of the time reduced is from the async phase
   
   For changelog, it's usually under 100ms, without it can take tens of seconds (above p99%). That can sound not much, but this phase blocks the entire processing. I describe it [here](https://github.com/apache/flink/pull/18431#discussion_r790436952).
   
   But actually this statement is about reduction by using Async checkpoints which were implemented long time ago, not related to the Changelog.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).

Review comment:
       Explained, PTAL.

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).

Review comment:
       I've expanded this section to address (2), do you still have any concerns about (1)?

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       > I also saw a lot of cases that recovery time increased by tens of seconds but checkpoint duration does not decrease that much.
   
   If checkpoint duration doesn't decrease significantly then it's probably not a suitable case to use Changelog. So the question I think is what's the ratio between these times and how to put it in the docs.
   
   > I do not think we should mix recovery time with checkpoint duration here.
   
   Reducing **effective** recovery time is actually one of the goals of the project (please see FLIP [Motivation](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-Motivation) and discussions); and the mean to achieve this is by having less data to replay for the whole pipeline (while having data to replay by Changelog on Task level).

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).

Review comment:
       Sure, explained, PTAL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r794474367



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} This feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time and, therefore, end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by asynchronous snapshots

Review comment:
       Sure, I'll linkt to [RocksDB](https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/state_backends/#the-embeddedrocksdbstatebackend) section where they are mentioned.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409",
       "triggerID" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 62ef3ee01ff06ad4ab5b9bae5497268c52315409 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791456268



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).

Review comment:
       Explained, PTAL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r790783796



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       I agree, it's confusing.
   
   What I mean here is minimizing the synchronous phase duration by using some form of persistent data structures (CoW for Heap and immutable SSTables in RocksDB). 
   This is [called](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/state_backends/#the-embeddedrocksdbstatebackend) asynchronous checkpoints/snapshots.
   
   
   WDYT about
   ```
   2. Snapshot creation time (so-called synchronous phase), reduced by immutable (or asynchronous) snapshots. 
   ```
   Or maybe just remove the clarification?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r794202114



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} This feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time and, therefore, end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by asynchronous snapshots

Review comment:
       **Without Changelog:**
   Sync phase: flush in-mem data -> disk, prepare what-ssts to upload
   
   **With Changelog:**
   Sync phase: prepare state-changes to upload
   
   From the differences above, I can not infer what async snapshots refer to and can not see what problem the "async snapshots are addressed"? Or if I misunderstand the above process?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791467158



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       > I also saw a lot of cases that recovery time increased by tens of seconds but checkpoint duration does not decrease that much.
   
   If checkpoint duration doesn't decrease significantly then it's probably not a suitable case to use Changelog. So the question I think is what's the ratio between these times and how to put it in the docs.
   
   > I do not think we should mix recovery time with checkpoint duration here.
   
   Reducing **effective** recovery time is actually one of the goals of the project (please see FLIP [Motivation](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints#FLIP158:Generalizedincrementalcheckpoints-Motivation) and discussions); and the mean to achieve this is by having less data to replay for the whole pipeline (while having data to replay by Changelog on Task level).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791313073



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).

Review comment:
       I am not asking to explain the entire different compaction algorithms, but since you mentioned "at least one task in every checkpoint that uploads a lot of data (e.g. after compaction)." You need to explain why.
   
   That's what I mean by providing some context: what causes to upload a lot data; that's compaction, then why compaction causes more data to upload... e.t.c
   
   I do not think people can infer directly why compaction can cause more data to upload until you explain at least a little bit of different level of compaction e.t.c. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1022945920


   Hey @rkhachatryan , I've read through he doc again
   
   1. Two things for doc itself: one is about async snapshot; the other is about recovery time (details are put inline)
   2. For the commits, is there any way to squash the commit to one? Now it includes 13 commits (not counting fix-up). That's too many.
   3. Should we mention leftovers in the limitation? (This is not a must but I feel we should be honest on this part)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r793495141



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} This feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time and, therefore, end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by asynchronous snapshots

Review comment:
       > What are asynchronous snapshots? and how it is different from asynchronous phase
   
   synchronous phase = create a snapshot
   asynchronous phase = upload a snapshot
   
   Implemented naively, sync phase would duplicate **all** the state (copying all in-memory tables and/or on-disk files). So that async phase and processing can proceed concurrently.
   
   With Asynchronous snapshots (probably a misnormer), sync phase only flushes or modifies small portions of data, as I explained [above](https://github.com/apache/flink/pull/18431#discussion_r790783796).
   
   > Do you mean the materialization is separated and independent with the checkpointing/snapshotting?
   
   No. As I said [above](https://github.com/apache/flink/pull/18431#discussion_r791455950), asynchronous snapshots are not related to Changelog.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2512d7564bf165d426cc722a4bd1a83db1c5edb2 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2512d7564bf165d426cc722a4bd1a83db1c5edb2 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314) 
   * 62ef3ee01ff06ad4ab5b9bae5497268c52315409 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r793603478



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       I've added the latter statement in 2512d75, PTAL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5259a8092544f6288cb0b47dcb306f2e5846daab Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837) 
   * f58ab18ad9596e758779524e0cb4f967f3ba9a90 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f58ab18ad9596e758779524e0cb4f967f3ba9a90 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r790762736



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs

Review comment:
       Yes, that's what I was going to do, I just wanted to get the approval first for the English version (otherwise translation would be inconsistent).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1022945920


   Hey @rkhachatryan , I've read through he doc again
   
   1. Two things for doc itself: one is about async snapshot; the other is about recovery time (details are put inline)
   2. For the commits, is there any way to squash the commit to one? Now it includes 13 commits (not counting fix-up). That's too many.
   3. Should we mention leftovers in the limitation section? (This is not a must but I feel we should be honest on this part)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan merged pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan merged pull request #18431:
URL: https://github.com/apache/flink/pull/18431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f58ab18ad9596e758779524e0cb4f967f3ba9a90 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915) 
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1018628157


   Updated `Installation` section after agreeing on including all jars into the main Flink jar (#18432).
   Added `Limitations` and `Upgrading` sections,
   mention of Recovery time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r794486545



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs

Review comment:
       FLINK-25867




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791311410



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       That's what I am not sure about.
   
   Snapshot creation time: (isn't that just preparing for changelogs to be uploaded)? 
   And previously snapshot creat time: flush + prepare SSTs to be uploaded?
   
   I think most of the time reduced is from the async phase (uploading before ahead, so the async phase does not take very long time).

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       That's what I am not sure about.
   
   Snapshot creation time: (isn't that just preparing for changelogs to be uploaded)? 
   And previously snapshot creat time: flush + prepare SSTs to be uploaded?
   
   Would these two parts significantly different?
   
   I think most of the time reduced is from the async phase (uploading before ahead, so the async phase does not take very long time).

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).

Review comment:
       I am not asking to explain the entire different compaction algorithms, but since you mentioned "at least one task in every checkpoint that uploads a lot of data (e.g. after compaction)." You need to explain why.
   
   That's what I mean by providing some context: what causes to upload a lot data; that's compaction, then why compaction causes more data to upload... e.t.c
   
   I do not think people can infer directly why compaction can cause more data to upload until you explain at least a little bit of different level of compaction e.t.c. 
   

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       Recovery time really depends on how much data to restore.
   
   I also saw a lot of cases that recovery time increased by tens of seconds but checkpoint duration does not decrease that much.
   
   What we showed in the graph is a job instance with 196GB state size, and there are a lot of cases that the above statement does not hold.
   
   I do not think we should mix recovery time with checkpoint duration here. 
   

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       Recovery time really depends on how much data to restore, and how fast can upload data to dfs.
   
   I also saw a lot of cases that recovery time increased by tens of seconds but checkpoint duration does not decrease that much.
   
   What we showed in the graph is a job instance with 196GB state size, and there are a lot of cases that the above statement does not hold.
   
   I do not think we should mix recovery time with checkpoint duration here. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409",
       "triggerID" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2512d7564bf165d426cc722a4bd1a83db1c5edb2 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314) 
   * 62ef3ee01ff06ad4ab5b9bae5497268c52315409 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409",
       "triggerID" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "triggerType" : "PUSH"
     }, {
       "hash" : "480987ed4a563ed50095cba131e5de92f05fda63",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30433",
       "triggerID" : "480987ed4a563ed50095cba131e5de92f05fda63",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 62ef3ee01ff06ad4ab5b9bae5497268c52315409 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409) 
   * 480987ed4a563ed50095cba131e5de92f05fda63 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30433) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r793506543



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       There is **no guarantee**, but there **is evidence** for some cases. 
   And this is also true for the checkpoint duration (there is even guarantee that it will increase in some cases, e.g. many updates of a single key).
   
   That's why I put it as "likely". Do you prefer some other wording?
   
   Or I could add something like "However, it's also possible that the effective recovery time will increase, depending on the actual ratio of the aformentioned times", WDYT?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d321afe82c3aadf6b2071e056253b6cb422aa017 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244) 
   * 2512d7564bf165d426cc722a4bd1a83db1c5edb2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791458494



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).

Review comment:
       I've expanded this section to address (2), do you still have any concerns about (1)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1022224983


   Thanks a lot @infoverload, I've accepted your suggestions.
   FYI @curcur, the PR was updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791455950



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       > Snapshot creation time: (isn't that just preparing for changelogs to be uploaded)?
   > And previously snapshot creat time: flush + prepare SSTs to be uploaded?
   
   Yes.
   
   > Would these two parts significantly different?
   > I think most of the time reduced is from the async phase
   
   For changelog, it's usually under 100ms, without it can take tens of seconds (above p99%). That can sound not much, but this phase blocks the entire processing. I describe it [here](https://github.com/apache/flink/pull/18431#discussion_r790436952).
   
   But actually this statement is about reduction by using Async checkpoints which were implemented long time ago, not related to the Changelog.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d321afe82c3aadf6b2071e056253b6cb422aa017 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244) 
   * 2512d7564bf165d426cc722a4bd1a83db1c5edb2 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r793315064



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       What I want to say is there is no evidence showing that recovery time is guaranteed to be reduced. 
   1. "less data replay" has an assumption that checkpoint duration is reduced (there is not always a significant reduction as I mentioned above).
   2. Replaying of changelog to DB is an additional cost for recovery with changelog vs. normal recovery. 
   
   1+2 made me not be convinced with 
   "However, recovery time combined with checkpoint duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover case."
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1022945920


   Hey @rkhachatryan , I've read through he doc again
   
   1. Two things for doc itself: one is about async snapshot; the other is about recovery time (details are put inline)
   2. For the commits, is there any way to squash the commit to one? Now it includes 13 commits (not counting fix-up)...
   3. Should we mention leftovers in the limitation? (This is not a must but I feel we should be honest on this part)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1022921395


   > Thanks a lot @infoverload, I've accepted your suggestions. FYI @curcur, the PR was updated.
   
   thanks a lot @rkhachatryan and @infoverload , I will take a look!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r793338475



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} This feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time and, therefore, end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by asynchronous snapshots

Review comment:
       `asynchronous snapshots`
   
   What are asynchronous snapshots? and how it is different from `asynchronous phase`
   
   Do you mean the materialization is separated and independent with the checkpointing/snapshotting?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r794274595



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} This feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time and, therefore, end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by asynchronous snapshots

Review comment:
       OK, I finally got what you said.
   
   Is there any reference you can use here? similar to Unaligned checkpoints and Buffer debloating?
   
   It would be too confusing otherwise.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791314908



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       Recovery time really depends on how much data to restore, and how fast can upload data to dfs.
   
   I also saw a lot of cases that recovery time increased by tens of seconds but checkpoint duration does not decrease that much.
   
   What we showed in the graph is a job instance with 196GB state size, and there are a lot of cases that the above statement does not hold.
   
   I do not think we should mix recovery time with checkpoint duration here. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5259a8092544f6288cb0b47dcb306f2e5846daab Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * d321afe82c3aadf6b2071e056253b6cb422aa017 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409",
       "triggerID" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "triggerType" : "PUSH"
     }, {
       "hash" : "480987ed4a563ed50095cba131e5de92f05fda63",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "480987ed4a563ed50095cba131e5de92f05fda63",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 62ef3ee01ff06ad4ab5b9bae5497268c52315409 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409) 
   * 480987ed4a563ed50095cba131e5de92f05fda63 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952571


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit bd061da8f43a98196cdf4ef99e55ae6cda9317eb (Thu Jan 20 21:41:28 UTC 2022)
   
   **Warnings:**
    * **1 pom.xml files were touched**: Check for build and licensing issues.
    * Documentation files were touched, but no `docs/content.zh/` files: Update Chinese documentation or file Jira ticket.
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [CANCELED](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   * 5259a8092544f6288cb0b47dcb306f2e5846daab Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd061da8f43a98196cdf4ef99e55ae6cda9317eb Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r790817737



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       Could you elaborate a bit?
   
   In the experiments, we saw a reduction of checkpoint duration by minutes, and increase of recovery times by tens of seconds. This is not guaranteed of course, but I think it is likely.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d321afe82c3aadf6b2071e056253b6cb422aa017 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791314908



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       Recovery time really depends on how much data to restore.
   
   I also saw a lot of cases that recovery time increased by tens of seconds but checkpoint duration does not decrease that much.
   
   What we showed in the graph is a job instance with 196GB state size, and there are a lot of cases that the above statement does not hold.
   
   I do not think we should mix recovery time with checkpoint duration here. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] infoverload commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
infoverload commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r792382240



##########
File path: docs/content/docs/ops/metrics.md
##########
@@ -1203,6 +1203,59 @@ Note that for failed checkpoints, metrics are updated on a best efforts basis an
 ### RocksDB
 Certain RocksDB native metrics are available but disabled by default, you can find full documentation [here]({{< ref "docs/deployment/config" >}}#rocksdb-native-metrics)
 
+### State changelog

Review comment:
       ```suggestion
   ### State Changelog
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.

Review comment:
       ```suggestion
   Upload time can be decreased by [incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
   However, most incremental state backends perform some form of compaction periodically, which results in re-uploading the
   old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
   data tends to be very high in every checkpoint.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload state changes
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.
+
+For more details, see [FLIP-158](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints).
+
+### Installation
+
+Changelog jars are included into the standard Flink distribution.
+
+Please make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.
+
+### Configuration
+
+An example configuration in yaml:
+```yaml
+state.backend.changelog.enabled: true
+state.backend.changelog.storage: filesystem # currently, only filesystem and memory (for tests) are supported
+dstl.dfs.base-path: s3://<bucket-name> # similar to state.checkpoints.dir
+```
+
+Please keep the following defaults (see [limitations](#limitations)):
+```yaml
+execution.checkpointing.max-concurrent-checkpoints: 1
+state.backend.local-recovery: false
+```
+
+Please refer to [configuration reference]({{< ref "docs/deployment/config#state-changelog-options" >}}) for other options.
+
+Changelog can also be enabled or disabled per-job programmatically:

Review comment:
       ```suggestion
   Changelog can also be enabled or disabled per job programmatically:
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload state changes
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.
+
+For more details, see [FLIP-158](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints).
+
+### Installation
+
+Changelog jars are included into the standard Flink distribution.
+
+Please make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.

Review comment:
       ```suggestion
   Changelog JARs are included into the standard Flink distribution.
   
   Make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       ```suggestion
   2. Snapshot creation time (so-called synchronous phase), addressed by asynchronous snapshots
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload state changes
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       ```suggestion
   Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval` setting,
   the changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
   duration will likely still be lower than in non-changelog setups, providing lower end-to-end latency even in failover
   case.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload state changes
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.
+
+For more details, see [FLIP-158](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints).
+
+### Installation
+
+Changelog jars are included into the standard Flink distribution.
+
+Please make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.
+
+### Configuration
+
+An example configuration in yaml:
+```yaml
+state.backend.changelog.enabled: true
+state.backend.changelog.storage: filesystem # currently, only filesystem and memory (for tests) are supported
+dstl.dfs.base-path: s3://<bucket-name> # similar to state.checkpoints.dir
+```
+
+Please keep the following defaults (see [limitations](#limitations)):
+```yaml
+execution.checkpointing.max-concurrent-checkpoints: 1
+state.backend.local-recovery: false
+```
+
+Please refer to [configuration reference]({{< ref "docs/deployment/config#state-changelog-options" >}}) for other options.
+
+Changelog can also be enabled or disabled per-job programmatically:
+{{< tabs  >}}
+{{< tab "Java" >}}
+```java
+StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
+env.enableChangelogStateBackend(true);
+```
+{{< /tab >}}
+{{< tab "Scala" >}}
+```scala
+val env = StreamExecutionEnvironment.getExecutionEnvironment()
+env.enableChangelogStateBackend(true)
+```
+{{< /tab >}}
+{{< tab "Python" >}}
+```python
+env = StreamExecutionEnvironment.get_execution_environment()
+env.enable_changelog_statebackend(true)
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+### Monitoring
+
+Available metrics are listed [here]({{< ref "docs/ops/metrics#changelog" >}}).
+
+In the UI, if a task is back-pressured by writing state changes, it will be shown as busy (red).

Review comment:
       ```suggestion
   If a task is backpressured by writing state changes, it will be shown as busy (red) in the UI.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}

Review comment:
       ```suggestion
   {{< hint warning >}} This feature is in experimental status. {{< /hint >}}
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload state changes
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.
+
+For more details, see [FLIP-158](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints).
+
+### Installation
+
+Changelog jars are included into the standard Flink distribution.
+
+Please make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.
+
+### Configuration
+
+An example configuration in yaml:

Review comment:
       ```suggestion
   Here is an example configuration in YAML:
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload state changes
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.
+
+For more details, see [FLIP-158](https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints).
+
+### Installation
+
+Changelog jars are included into the standard Flink distribution.
+
+Please make sure to [add]({{< ref "docs/deployment/filesystems/overview" >}}) the necessary filesystem plugins.
+
+### Configuration
+
+An example configuration in yaml:
+```yaml
+state.backend.changelog.enabled: true
+state.backend.changelog.storage: filesystem # currently, only filesystem and memory (for tests) are supported
+dstl.dfs.base-path: s3://<bucket-name> # similar to state.checkpoints.dir
+```
+
+Please keep the following defaults (see [limitations](#limitations)):
+```yaml
+execution.checkpointing.max-concurrent-checkpoints: 1
+state.backend.local-recovery: false
+```
+
+Please refer to [configuration reference]({{< ref "docs/deployment/config#state-changelog-options" >}}) for other options.

Review comment:
       ```suggestion
   Please refer to the [configuration section]({{< ref "docs/deployment/config#state-changelog-options" >}}) for other options.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.

Review comment:
       ```suggestion
   Changelog is a feature that aims to decrease checkpointing time and, therefore, end-to-end latency in exactly-once mode.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.

Review comment:
       ```suggestion
   With Changelog enabled, Flink uploads state changes continuously and forms a changelog. On checkpoint, only the relevant
   part of this changelog needs to be uploaded. The configured state backend is snapshotted in the
   background periodically. Upon successful upload, the changelog is truncated.
   ```

##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,129 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The last one (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}).
+However, most Incremental State Backends perform some form of compaction periodically, which results in re-uploading the
+old state in addition to the new changes. In large deployments, the probability of at least one task uploading lots of
+data tend to be very high in every checkpoint.
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is snapshotted in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase duration is reduced, as well as synchronous phase - because no data needs to be flushed
+to disk. In particular, long-tail latency is improved.
+
+On the flip side, resource usage is higher:

Review comment:
       ```suggestion
   However, resource usage is higher:
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f6b74d1dc0808adc8f7f1128adcd7a7477c858",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c1f6b74d1dc0808adc8f7f1128adcd7a7477c858",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   * 13e788e57463ac850c6d55152d5f92501f269bdd Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240) 
   * c1f6b74d1dc0808adc8f7f1128adcd7a7477c858 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r793315064



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).
+
+On the flip side, resource usage is higher:
+
+- more files are created on DFS
+- more IO bandwidth is used to upload
+- more CPU used to serialize state changes
+- more memory used by Task Managers to buffer state changes
+- todo: more details after testing, maybe link to blogpost
+
+Recovery time is another thing to consider. Depending on the `state.backend.changelog.periodic-materialize.interval`,
+changelog can become lengthy and replaying it may take more time. However, recovery time combined with checkpoint
+duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover
+case.

Review comment:
       What I want to say is there is no evidence showing that recovery time is guaranteed to be reduced. 
   1. "less data replay" has an assumption that checkpoint duration is reduced (there is not always a significant reduction as I mentioned above).
   2. Replaying of changelog to DB is an additional cost for recovery with changelog vs. normal recovery. 
   
   1+2 together made me not be convinced with 
   "However, recovery time combined with checkpoint duration will likely be still lower than in non-changelog setup, providing lower end-to-end latency even in failover case."
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791456268



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots
+3. Snapshot upload time (asynchronous phase)
+
+The latter (upload time) can be decreased by [Incremental checkpoints]({{< ref "#incremental-checkpoints" >}}). However,
+even with Incremental checkpoints, large deployments tend to have at least one task in every checkpoint that uploads a
+lot of data (e.g. after compaction).
+
+With Changelog enabled, Flink uploads state changes continuously, forming a changelog. On checkpoint, only the relevant
+part of this changelog needs to be uploaded. Independently, configured state backend is checkpointed in the
+background periodically. Upon successful upload, changelog is truncated.
+
+As a result, asynchronous phase is reduced, as well as synchronous phase (in particular, long-tail).

Review comment:
       Sure, explained, PTAL.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 939c7760b4809ef0e2d1ea897aa8d1217913669d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] curcur commented on a change in pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
curcur commented on a change in pull request #18431:
URL: https://github.com/apache/flink/pull/18431#discussion_r791311410



##########
File path: docs/content/docs/ops/state/state_backends.md
##########
@@ -325,6 +325,126 @@ public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory {
 
 {{< top >}}
 
+## Enabling Changelog
+
+// todo: Chinese version of all changed docs
+
+// todo: mention in [large state tuning]({{< ref "docs/ops/state/large_state_tuning" >}})? or 1.16?
+
+{{< hint warning >}} The feature is in experimental status. {{< /hint >}}
+
+{{< hint warning >}} Enabling Changelog may have a negative performance impact on your application (see below). {{< /hint >}}
+
+### Introduction
+
+Changelog is a feature that aims to decrease checkpointing time, and therefore end-to-end latency in exactly-once mode.
+
+Most commonly, checkpoint duration is affected by:
+
+1. Barrier travel time and alignment, addressed by
+   [Unaligned checkpoints]({{< ref "docs/ops/state/checkpointing_under_backpressure#unaligned-checkpoints" >}})
+   and [Buffer debloating]({{< ref "docs/ops/state/checkpointing_under_backpressure#buffer-debloating" >}})
+2. Snapshot creation time (so-called synchronous phase), addressed by Asynchronous snapshots

Review comment:
       That's what I am not sure about.
   
   Snapshot creation time: (isn't that just preparing for changelogs to be uploaded)? 
   And previously snapshot creat time: flush + prepare SSTs to be uploaded?
   
   Would these two parts significantly different?
   
   I think most of the time reduced is from the async phase (uploading before ahead, so the async phase does not take very long time).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #18431: [FLINK-25024][docs] Add Changelog backend docs

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #18431:
URL: https://github.com/apache/flink/pull/18431#issuecomment-1017952324


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29833",
       "triggerID" : "bd061da8f43a98196cdf4ef99e55ae6cda9317eb",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29837",
       "triggerID" : "5259a8092544f6288cb0b47dcb306f2e5846daab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=29915",
       "triggerID" : "f58ab18ad9596e758779524e0cb4f967f3ba9a90",
       "triggerType" : "PUSH"
     }, {
       "hash" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30058",
       "triggerID" : "939c7760b4809ef0e2d1ea897aa8d1217913669d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30240",
       "triggerID" : "13e788e57463ac850c6d55152d5f92501f269bdd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30244",
       "triggerID" : "d321afe82c3aadf6b2071e056253b6cb422aa017",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30314",
       "triggerID" : "2512d7564bf165d426cc722a4bd1a83db1c5edb2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30409",
       "triggerID" : "62ef3ee01ff06ad4ab5b9bae5497268c52315409",
       "triggerType" : "PUSH"
     }, {
       "hash" : "480987ed4a563ed50095cba131e5de92f05fda63",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30433",
       "triggerID" : "480987ed4a563ed50095cba131e5de92f05fda63",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 480987ed4a563ed50095cba131e5de92f05fda63 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=30433) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org