You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jamie Park <ja...@gmail.com> on 2018/03/25 04:27:48 UTC
consistent consumption rate

Hi,

I am developing Kafka stream-application using low level processor API.

I am consuming records from multiple topics, applying some joining logics,
and then  sending them off another topics.

1) How can I make sure my application consume records at consistent rate
across different topics?  If I were to distribute data across multiple
topics, would that ensure that data will be consumed consistently?  For
example,  let's say topicA, topicB,topicC has ~1000 records in 5 minute
window.  Can I assume these records will be consumed in the stream app at
the same time?  - meaning all 3000 records will land in the streaming
application consistently.

2) Is there a way to consume some records faster than others in streaming
application?  Let's say topicA has 12X-15X times records being produced at
a rate topicC produces records.  Can I consume topicA somehow faster than I
do for topicC? I am trying to overcome this topic volume imbalance by
breaking down topicA into 12 topics ( so each sub-topicA has similar volume
with topicC).

3)  My join logic depends on the same event-time window of records from
multiple topics I consume from.  How can I make sure the same event-time
window gets processed for each topic?

4) Is there a way to "hold" or "pause" on consuming records in Processor
temporarily?  I thought that if somehow one Processor consuming certain
records takes too long, or "joinable" records arrive late,  I was hoping I
could delay another Processor that is "waiting" for it's joinable records.
 -- I am achieving this by hold the records in StateStore but this wait
seems arbitrary.  I am wondering if there is better solution than this.

5)  Let's say I have 40 partitions on 10 topics I consume from and I have
40 instances of stream-application consuming from these.  In this case how
does setting "num.stream.threads" work?   I see that is still uses one
thread for "fetching" which doesn't really help the cause on "fetching"
stuff but some of these threads are being used for "consuming" from
monitoring JMX.  Since each application handles 10 partitions ( one
partition for each of 10 topics), I was setting num.stream.threads=10 but I
don't have a very clear understanding of this value.  Would you recommend
increasing num of threads here?


I'd very much appreciate any help on any of these questions.

Thank you