You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by alpinegizmo <gi...@git.apache.org> on 2017/01/20 16:09:09 UTC

[GitHub] flink pull request #3184: [5456] [docs] state intro and new interfaces

GitHub user alpinegizmo opened a pull request:

    https://github.com/apache/flink/pull/3184

    [5456] [docs] state intro and new interfaces

    This is an update to the state intro for Flink 1.2, along with some introduction to the new state interfaces. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/alpinegizmo/flink 5456-docs-state-intro

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3184
    
----
commit f39bee9bdbfd678647b428986bcea6e0bcdc5029
Author: David Anderson <da...@alpinegizmo.com>
Date:   2017-01-18T14:56:18Z

    Resurrected and updated parts of the state intro.

commit ade63c1f575b126cbb2467313aa517603864d231
Author: David Anderson <da...@alpinegizmo.com>
Date:   2017-01-20T15:57:06Z

    fixed broken redirect and liquid syntax problem

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3184: [5456] [docs] state intro and new interfaces

Posted by alpinegizmo <gi...@git.apache.org>.
Github user alpinegizmo closed the pull request at:

    https://github.com/apache/flink/pull/3184


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3184: [5456] [docs] state intro and new interfaces

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/3184
  
    @alpinegizmo Could you please check out these changes: https://github.com/aljoscha/flink/tree/pr-3184-state-documentation? I took your PR, reshuffled the sections (keyed state is now mentioned before operator state) and added a section about operator state interfaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3184: [5456] [docs] state intro and new interfaces

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3184#discussion_r98240690
  
    --- Diff: docs/dev/stream/state.md ---
    @@ -39,40 +39,247 @@ if necessary) to allow applications to hold very large state.
     This document explains how to use Flink's state abstractions when developing an application.
     
     
    -## Keyed State and Operator state
    +## Operator State and Keyed State
     
    -There are two basic state backends: `Keyed State` and `Operator State`.
    +There are two basic kinds of state in Flink: `Operator State` and `Keyed State`.
     
    -#### Keyed State
    +### Operator State
     
    -*Keyed State* is always relative to keys and can only be used in functions and operators on a `KeyedStream`.
    -Examples of keyed state are the `ValueState` or `ListState` that one can create in a function on a `KeyedStream`, as
    -well as the state of a keyed window operator.
    -
    -Keyed State is organized in so called *Key Groups*. Key Groups are the unit by which keyed state can be redistributed and
    -there are as many key groups as the defined maximum parallelism.
    -During execution each parallel instance of an operator gets one or more key groups.
    +With *Operator State* (or *non-keyed state*), each operator state is
    +bound to one parallel operator instance.
    +The Kafka source connector is a good motivating example for the use of Operator State
    +in Flink. Each parallel instance of this Kafka consumer maintains a map
    +of topic partitions and offsets as its Operator State.
     
    -#### Operator State
    +New interfaces in Flink 1.2 subsume the `Checkpointed` interface in Flink 1.0 and
    +1.1, which has been deprecated. 
    +These new Operator State interfaces support redistributing state among
    +parallel operator instances when the parallelism is changed. 
     
    -*Operator State* is state per parallel subtask. It subsumes the `Checkpointed` interface in Flink 1.0 and Flink 1.1.
    -The new `CheckpointedFunction` interface is basically a shortcut (syntactic sugar) for the Operator State.
    -
    -Operator State needs special re-distribution schemes when parallelism is changed. There can be different variations of such
    -schemes; the following are currently defined:
    +There can be different schemes for doing this redistribution; the following are currently defined:
     
       - **List-style redistribution:** Each operator returns a List of state elements. The whole state is logically a concatenation of
         all lists. On restore/redistribution, the list is evenly divided into as many sublists as there are parallel operators.
         Each operator gets a sublist, which can be empty, or contain one or more elements.
     
     
    +### Keyed State
    +
    +*Keyed State* is always relative to keys and can only be used in functions and operators on a `KeyedStream`.
    +
    +You can think of Keyed State as Operator State that has been partitioned,
    +or sharded, with exactly one state-partition per key. 
    +Each keyed-state is logically bound to a unique
    +composite of <parallel-operator-instance, key>, and since each key
    +"belongs" to exactly one parallel instance of a keyed operator, we can
    +think of this simply as <operator, key>. 
    +
    +Keyed State is further organized into so-called *Key Groups*. Key Groups are the
    +atomic unit by which Flink can redistribute Keyed State; 
    +there are exactly as many Key Groups as the defined maximum parallelism.
    +During execution each parallel instance of a keyed operator works with the keys
    +for one or more Key Groups.
    +
    +
     ## Raw and Managed State
     
     *Keyed State* and *Operator State* exist in two forms: *managed* and *raw*.
     
     *Managed State* is represented in data structures controlled by the Flink runtime, such as internal hash tables, or RocksDB.
    -Examples are "ValueState", "ListState", etc. Flink's runtime encodes the states and writes them into the checkpoints.
    +Examples are "ValueState", "ListState", etc. Flink's runtime encodes
    +the states and writes them into the checkpoints. 
     
    -*Raw State* is state that users and operators keep in their own data structures. When checkpointed, they only write a sequence of bytes into
    +*Raw State* is state that operators keep in their own data structures. When checkpointed, they only write a sequence of bytes into
     the checkpoint. Flink knows nothing about the state's data structures and sees only the raw bytes.
     
    +All datastream functions can use managed state, but the raw state interfaces can only be used when implementing operators. 
    +Using managed state (rather than raw state) is recommended, since with
    +managed state Flink is able to automatically redistribute state when the parallelism is
    +changed, and also do better memory management.
    +
    +
    +## Using Managed Operator State
    +
    +A stateful function can implement either the more general `CheckpointedFunction` 
    +interface, or the `ListCheckpointed<T extends Serializable>` interface (which is semantically closer to the old 
    +`Checkpointed` one).
    +
    +[The Flink Function Migration documentation](../migration.html) has
    --- End diff --
    
    I think we shouldn't refer to the migration guide here because these interfaces are not only valid in a migration context but are valid for any Flink job.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3184: [5456] [docs] state intro and new interfaces

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3184#discussion_r98240527
  
    --- Diff: docs/dev/stream/state.md ---
    @@ -39,40 +39,247 @@ if necessary) to allow applications to hold very large state.
     This document explains how to use Flink's state abstractions when developing an application.
     
     
    -## Keyed State and Operator state
    +## Operator State and Keyed State
     
    -There are two basic state backends: `Keyed State` and `Operator State`.
    +There are two basic kinds of state in Flink: `Operator State` and `Keyed State`.
     
    -#### Keyed State
    +### Operator State
     
    -*Keyed State* is always relative to keys and can only be used in functions and operators on a `KeyedStream`.
    -Examples of keyed state are the `ValueState` or `ListState` that one can create in a function on a `KeyedStream`, as
    -well as the state of a keyed window operator.
    -
    -Keyed State is organized in so called *Key Groups*. Key Groups are the unit by which keyed state can be redistributed and
    -there are as many key groups as the defined maximum parallelism.
    -During execution each parallel instance of an operator gets one or more key groups.
    +With *Operator State* (or *non-keyed state*), each operator state is
    +bound to one parallel operator instance.
    +The Kafka source connector is a good motivating example for the use of Operator State
    +in Flink. Each parallel instance of this Kafka consumer maintains a map
    +of topic partitions and offsets as its Operator State.
     
    -#### Operator State
    +New interfaces in Flink 1.2 subsume the `Checkpointed` interface in Flink 1.0 and
    --- End diff --
    
    I think we should document this without referring to old versions because that will quickly become outdated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3184: [5456] [docs] state intro and new interfaces

Posted by aljoscha <gi...@git.apache.org>.
Github user aljoscha commented on the issue:

    https://github.com/apache/flink/pull/3184
  
    Thanks for your work, @alpinegizmo. \U0001f44d 
    
    Could you please close this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---