You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Underwood (Jira)" <ji...@apache.org> on 2022/05/10 02:53:00 UTC

[jira] [Created] (FLINK-27559) Some question about flink operator state

Underwood created FLINK-27559:
---------------------------------

             Summary: Some question about flink operator state
                 Key: FLINK-27559
                 URL: https://issues.apache.org/jira/browse/FLINK-27559
             Project: Flink
          Issue Type: New Feature
         Environment: Flink 1.14.4
            Reporter: Underwood


I hope to get two answers to Flink's maintenance status:

 

1. Does custompartition support saving status? In my usage scenario, the partition strategy is dynamically adjusted, which depends on the data in datastream. I hope to make different partition strategies according to different data conditions.

 

For a simple example, I want the first 100 pieces of data in datastream to be range partitioned and the rest of the data to be hash partitioned. At this time, I may need a count to identify the number of pieces of data that have been processed. However, in custompartition, this is only a local variable, so there seem to be two problems: declaring variables in this way can only be used in single concurrency, and it seems that they cannot be counted across slots; In this way, the count data will be lost during fault recovery.

 

Although Flink already has operator state and key value state, custompartition is not an operator, so I don't think it can solve this problem through state. I've considered introducing a zookeeper to save the state, but the introduction of new components will make the system bloated. I don't know whether there is a better way to solve this problem.

 

2. How to make multiple operators share the same state, and even all parallel subtasks of different operators share the same state?

 

For a simple example, my stream processing is divided into four stages: source - > mapa - > mapb - > sink. I hope to have a status count to count the total amount of data processed by all operators. For example, if the source receives one piece of data, then count + 1 when mapa is processed and count + 1 when mapb is processed. Finally, after this piece of data is processed, the value of count is 2.

 

I don't know if there is such a state saving mechanism in Flink, which can meet my scenario and recover from failure at the same time. At present, we can still think of using zookeeper. I don't know if there is a better way.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)