You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by GURU PRAVEEN <co...@gmail.com> on 2017/06/14 06:21:45 UTC

Fat partition - Kafka Spark streaming

Hi,

We have a Kafka spark streaming integrated app that listens to twitter and pushes the tweets to Kafka and which is later consumed by spark app.

We are constantly seeing one of the Kafka partitions always having more data than the other partitions. Not able to zero in on the root cause.

We use tweet id as the key and based on which we even partition. We established that tweet ids have very equal distribution (snowflake) don't see any issues with distribution (% even, % prime, % odd number of partitions). But still partition 3 has more data and the offset range of this partition is always more than the other partitions offset range.

Any suggestions or directions to debug this further would be much appreciated.

Thank you.
Gurupraveen