You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Youssef BOUZAIENNE (Jira)" <ji...@apache.org> on 2020/06/09 09:12:00 UTC

[jira] [Commented] (KAFKA-7888) kafka cluster not recovering - Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition) continously

    [ https://issues.apache.org/jira/browse/KAFKA-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129031#comment-17129031 ] 

Youssef BOUZAIENNE commented on KAFKA-7888:
-------------------------------------------

I'm facing the same issue on 2.4.1

> kafka cluster not recovering - Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition) continously
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-7888
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7888
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, replication, zkclient
>    Affects Versions: 2.1.0
>         Environment: using kafka_2.12-2.1.0
> 3 ZKs 3 Broker cluster, using 3 boxes (1 ZK and 1 broker on each box), default.replication factor: 2, 
> offset replication factor was 1 when the error happened, increased to 2 after seeing this error by reassigning-partitions.
> compression: default (producer) on broker but sending gzip from producers.
> linux (redhat) etx4 kafka logs on single local disk
>            Reporter: Kemal ERDEN
>            Priority: Major
>         Attachments: combined.log, producer.log
>
>
> we're seeing the following repeating logs on our kafka cluster from time to time which seems to cause messages expiring on Producers and the cluster going into a non-recoverable state. The only fix seems to be to restart brokers.
> {{Shrinking ISR from 14,13 to 13 (kafka.cluster.Partition)}}
>  {{Cached zkVersion [21] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)}}
>  and later on the following log is repeated:
> {{Got user-level KeeperException when processing sessionid:0xe046aa4f8e60000 type:setData cxid:0x2df zxid:0xa000001fd txntype:-1 reqpath:n/a Error Path:/brokers/topics/ucTrade/partitions/6/state Error:KeeperErrorCode = BadVersion for /brokers/topics/ucTrade/partitions/6/state}}
> We haven't interfered with any of the brokers/zookeepers whilst this happened.
> I've attached a combined log which represents a combination of controller, server and state change logs from each broker (ids 13,14 and 15, log files have the suffix b13, b14, b15 respectively)
> We have increased the heaps from 1g to 6g for the brokers and from 512m to 4g for the zookeepers since this happened but not sure if it is relevant. the ZK logs are unfortunately overwritten so can't provide those.
> We produce varying message sizes but some messages are relatively large (6mb) but we use compression on the producers (set to gzip).
> I've attached some logs from one of our producers as well.
> producer.properties that we've changed:
> spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
> spring.kafka.producer.compression-type=gzip
> spring.kafka.producer.retries=5
> spring.kafka.producer.acks=-1
> spring.kafka.producer.batch-size=1048576
> spring.kafka.producer.properties.linger.ms=200
> spring.kafka.producer.properties.request.timeout.ms=600000
> spring.kafka.producer.properties.max.block.ms=240000
> spring.kafka.producer.properties.max.request.size=104857600
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)