You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gary Yao (JIRA)" <ji...@apache.org> on 2019/04/26 10:27:00 UTC

[jira] [Comment Edited] (FLINK-12294) kafka consumer, data locality

    [ https://issues.apache.org/jira/browse/FLINK-12294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826858#comment-16826858 ] 

Gary Yao edited comment on FLINK-12294 at 4/26/19 10:26 AM:
------------------------------------------------------------

Link to user mailing list: [http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/kafka-partitions-data-locality-td27355.html]

Imo this is not a problem that is unique to the Kafka connector. Maybe you can update the title of this issue.


was (Author: gjy):
Link to user mailing list: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/kafka-partitions-data-locality-td27355.html

> kafka consumer, data locality
> -----------------------------
>
>                 Key: FLINK-12294
>                 URL: https://issues.apache.org/jira/browse/FLINK-12294
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / Kafka, Runtime / Coordination
>            Reporter: Sergey
>            Priority: Major
>              Labels: performance
>
> Additional flag (with default false value) controlling whether topic partitions already grouped by the key. Exclude unnecessary shuffle/resorting operation when this parameter set to true. As an example, say we have client's payment transaction in a kafka topic. We grouping by clientId (transaction with the same clientId goes to one kafka topic partition) and the task is to find max transaction per client in sliding windows. In terms of map\reduce there is no needs to shuffle data between all topic consumers, may be it`s worth to do within each consumer to gain some speedup due to increasing number of executors within each partition data. With N messages (in partition) instead of N*ln(N) (current realization with shuffle/resorting) it will be just N operations. For windows with thousands events - the tenfold gain of execution speed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)