You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "kaushik srinivas (Jira)" <ji...@apache.org> on 2021/08/08 08:05:00 UTC

[jira] [Created] (KAFKA-13177) partition failures and fewer shrink but a lot of isr expansions with increased num.replica.fetchers in kafka brokers

kaushik srinivas created KAFKA-13177:
----------------------------------------

             Summary: partition failures and fewer shrink but a lot of isr expansions with increased num.replica.fetchers in kafka brokers
                 Key: KAFKA-13177
                 URL: https://issues.apache.org/jira/browse/KAFKA-13177
             Project: Kafka
          Issue Type: Bug
            Reporter: kaushik srinivas


Installing 3 node kafka broker cluster (4 core cpu and 4Gi memory on k8s)

topics : 15, partitions each : 15 replication factor 3, min.insync.replicas  : 2

producers running with acks : all

Initially the num.replica.fetchers was set to 1 (default) and we observed very frequent ISR shrinks and expansions. So the setups were tuned with a higher value of 4. 

Once after this change was done, we see below behavior and warning msgs in broker logs
 # Over a period of 2 days, there are around 10 shrinks corresponding to 10 partitions, but around 700 ISR expansions corresponding to almost all partitions in the cluster(approx 50 to 60 partitions).
 # we see frequent warn msg of partitions being marked as failure in the same time span. Below is the trace --> {"type":"log", "host":"wwwwww", "level":"WARN", "neid":"kafka-wwwwww", "system":"kafka", "time":"2021-08-03T20:09:15.340", "timezone":"UTC", "log":{"message":"ReplicaFetcherThread-2-1003 - kafka.server.ReplicaFetcherThread - *[ReplicaFetcher replicaId=1001, leaderId=1003, fetcherId=2] Partition test-16 marked as failed"}}*

 

We see the above behavior continuously after increasing the num.replica.fetchers to 4 from 1. We did increase this to improve the replication performance and hence reduce the ISR shrinks.

But we see this strange behavior after the change. What would the above trace indicate and is marking partitions as failed just a WARN msgs and handled by kafka or is it something to worry about ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)