You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Vikas Talegaonkar <vi...@bamlabs.com> on 2018/10/16 18:05:56 UTC

Help Needed

Hello Kafka Dev,
 We need help on lagging issue we are seeing on one of the environment which doesn’t have much load.  We are running kafka on multiple environement, and on one of our environemnt we do see events are taking huge time (some time more then a day) to get process from kafka. The topic have two partition, 3 replicase and two consumers are running on it (So one to one mapping between partition and consumer). When i run kafka-consumer-group.sh to find the stats, i can see lag on one of the consumer and then lag move to another consumer after some time, and they keep switching with time and increase time to process events. So look to me rebalancing is happening but at the same time consumer-id is same so consumer not getting started in between. We also tried to restart and kafka and zookeeper but end result is same, here is the detail.


[2018-10-12 03:52:21,676] WARN Removing server circle2-kafka2:909 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils)
group-es
group-rds

[vikas@circle1-kafka1 kafka]$ ./bin/kafka-consumer-groups.sh --bootstrap-server circle1-kafka1:9092,circle2-kafka2:9092, circle1-kafka3 -describe -group group-rds
Note: This will not show information about old Zookeeper-based consumers.
[2018-10-12 03:53:06,226] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils)
[2018-10-12 03:53:06,436] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils)

TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                                                                   HOST            CLIENT-ID
topic.events    1          45471           45471           0               data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-dc1cb0e1-48fb-40c5-bd96-0e9980e1083d /172.27.4.133   data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds
topic.events    0          344987          346323          1336            data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-3a13af04-048f-40b4-9b09-b74a9600dfd8 /172.27.4.133   data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds



[vikas@circle1-kafka1 kafka]$ ./bin/kafka-consumer-groups.sh --bootstrap-server circle1-kafka1:9092,circle2-kafka2:9092,circle1-kafka3 -describe -group group-rds
Note: This will not show information about old Zookeeper-based consumers.
[2018-10-12 04:04:29,725] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils)
[2018-10-12 04:04:29,926] WARN Removing server circle2-kafka2:9092 from bootstrap.servers as DNS resolution failed for circle2-kafka2 (org.apache.kafka.clients.ClientUtils)

TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                                                                   HOST            CLIENT-ID
topic.events    1          44873           45471           598             data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-dc1cb0e1-48fb-40c5-bd96-0e9980e1083d /172.27.4.133   data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds
topic.events    0          346324          346324          0               data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds-3a13af04-048f-40b4-9b09-b74a9600dfd8 /172.27.4.133   data-consumer-i-00404a50d7551ef37-circle1-ecs2-group-rds



Here is the info of kafka env
1)Version -> kafka_2.11-1.1.0

2)Zookeeper setting -> Default

3)kafka setting -> Most of the settings are default, here are few specific changes we have done
zookeeper.connection.timeout.ms=6000
#Setting the replication for nodes under the default of 3
default.replication.factor=3
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
config.storage.replication.factor=3
offset.storage.replication.factor=3
status.storage.replication.factor=3
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.retention.hours=24

Please do let me know in case you need more detail from my end. 

Your quick help is much appreciated, in case you are not able to help or i am at wrong group then please point me at right group. 

Regards,
Vikas