You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Pedro Cardoso Silva (Jira)" <ji...@apache.org> on 2021/10/12 09:13:00 UTC
[jira] [Created] (KAFKA-13368) Support smart topic polling for consumer with multiple topic subscriptions

Pedro Cardoso Silva created KAFKA-13368:
-------------------------------------------

             Summary: Support smart topic polling for consumer with multiple topic subscriptions
                 Key: KAFKA-13368
                 URL: https://issues.apache.org/jira/browse/KAFKA-13368
             Project: Kafka
          Issue Type: Wish
          Components: consumer
            Reporter: Pedro Cardoso Silva


Currently there is no way to control how a Kafka consumer polls messages from a list of topics that it has subscribed to. If I understand correctly, the current approach is a round-robin polling mechanism across all topics that a consumer has subscribed to. 
This works reasonably well when the consumer's offset is aligned with the latest message offset of the topics, however if we configured the Kafka consumer to consume from the earliest offset where the topics have very distinct amounts of messages each, there is no guarantee/control on how to selectively read from topics.

Depending on the use-case it may be useful for the Kafka consumer developer to override this polling mechanism with a custom solution that makes sense for downstream applications.

Suppose you have 2 or more topics, where you want to merge the topics into a single topic but due to large differences between the topic's message rates you want to control from which topics to poll at a given time. 

As an example consider 2 topics with the following schemas:

{code:java}
Topic1 Schema: {
   timestamp: Long,
   key: String,
   col1: String,
   col2: String
}

Topic2 Schema: { 
   timestamp: Long,
   key: String,
   col3: String,
   col4: String 
}
{code}

Where Topic1 has 1,000,000 events from timestamp 0 to 1,000 (1000 ev/s) & topic2 has 50,000 events from timestamp 0 to 50,000 (1 ev/s).

Next we define a Kafka consumer that subscribes to Topic1 & Topic2. In the current situation (round robin), assuming a polling batch of 100 messages,  we would read 50,000 from each topic which maps to 50 seconds worth of messages on Topic1 and 50,000 seconds worth of messages on Topic2. 

If we then try to sort the messages by timestamp we have incorrect results, missing 500,000 messages from Topic1 that should be inserted between message 0 & 1,000 of Topic2.

The workaround solution is either to buffer the messages from Topic2 of have 1 Kafka consumer per topic which has significant overhead with periodic heartbeats, consumer registration in consumer groups, re-balancing, etc... 
For a couple of topics this approach may be OK, but it does not scale for 10's, 100's or more topics in a subscription.

The ideal solution would be to extend the Kafka consumer API to allow a user to define how to selectively poll messages from a subscription.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)