You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Fabian Hueske (JIRA)" <ji...@apache.org> on 2018/02/08 08:38:02 UTC

[jira] [Commented] (FLINK-8601) Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization

    [ https://issues.apache.org/jira/browse/FLINK-8601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356645#comment-16356645 ] 

Fabian Hueske commented on FLINK-8601:
--------------------------------------

Hi [~sihuazhou], thanks for opening this Jira and preparing a design document.

I see that the status of the issue is "In Progress" so I assume that you are already working on this feature.
Has there been a discussion about adding a new state primitive that I missed? I'm asking because adding a new state primitive means that it has to be supported by all state backends which is a major implementation effort. At the same time, the proposed state type seems to address only very specific use cases. Before starting to work on this, I would get the opinion of other developers whether this is a feature that we'd like to add to Flink or if there is another way to solve the problem.

[~stefanrichter83@gmail.com], [~aljoscha] What do you think?

Thanks, Fabian

> Introduce LinkedBloomFilterState for Approximate calculation and other situations of performance optimization
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8601
>                 URL: https://issues.apache.org/jira/browse/FLINK-8601
>             Project: Flink
>          Issue Type: New Feature
>          Components: Core, DataStream API
>    Affects Versions: 1.4.0
>            Reporter: Sihua Zhou
>            Assignee: Sihua Zhou
>            Priority: Major
>
> h3. Backgroud
> Bloom filter is useful in many situation, for example:
>  * 1. Approximate calculation: deduplication (eg: UV calculation)
>  * 2. Performance optimization: eg, [runtime filter join|https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]
> However, based on the current status provided by flink, it is hard to use the bloom filter for the following reasons:
>  * 1. Serialization problem: Bloom filter status can be large (for example: 100M), if implement it based on the RocksDB state, the state data will need to be serialized each time it is queried and updated, and the performance will be very poor.
>  * 2. Data skewed: Data in different key group can be skewed, and the information of data skewed can not be accurately predicted before the program is running. Therefore, it is impossible to determine how much resources bloom filter should allocate. One way to do this is to allocate space needed for the most skewed case, but this can lead to very serious waste of resources.
> h3. Requirement
> Therefore, I introduce the LinkedBloomFilterState for flink, which at least need to meet the following features:
>  * 1. Support for changing Parallelism
>  * 2. Only serialize when necessary: when performing checkpoint
>  * 3. Can deal with data skew problem: users only need to specify a LinkedBloomFilterState with the desired input, fpp, system will allocate resource dynamic.
>  * 4. Do not conflict with other state: user can use KeyedState and OperateState when using bloom filter state.
>  * 5. Support relax ttl (ie: the data survival time at least greater than the specified time)
> Design doc:  [design doc|https://docs.google.com/document/d/1yMCT2ogE0CtSjzRvldgi0ZPPxC791PpkVGkVeqaUUI8/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)