You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Calvin Lewis <Ca...@inrix.com> on 2017/08/28 19:46:40 UTC
Kafka behavior not consistent between environments

Hey Kafka Team,


We have Kafka deployed in two data centers (DC X and DC Y). DC X seems to be functioning perfectly fine while DC Y consistently produces errors from the brokers as well as the Consumer/Producer/Stream applications. The configuration is essentially the same between the DC's; the largest difference is the Zookeeper servers the brokers are utilizing.


In DC Y the broker errors we see are (consistent across all topics):


  *   [2017-08-24 20:11:31,282] ERROR [ReplicaFetcherThread-0-5], Error for partition [some-topic,183] to broker 5:org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition. (kafka.server.ReplicaFetcherThread)

  *   [2017-08-24 16:58:48,180] ERROR [ReplicaFetcherThread-0-5], Error for partition [some-topic,192] to broker 5:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)


The Consumer/Producer/Stream errors we see when topic replication is set to one (consistent across all topics and all application groups):

  *

2017-08-28 18:36:50 [kafka-producer-network-thread | some-application-group-1-StreamThread-1-producer][RecordCollectorImpl$1][ERROR]: task [0_89] Error sending record to topic some-topic

org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for some-topic-40 due to 30099 ms has passed since batch creation plus linger time

  *

[kafka-producer-network-thread | some-application-group-1-StreamThread-1-producer][RecordCollectorImpl$1][ERROR]: task [0_89] Error sending record to topic some-topic

org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.

When we set replication to two+, we see this additional Consumer/Producer/Stream error (consistent across all topics and all application groups):

  *

[kafka-producer-network-thread | some-application-group-1-StreamThread-1-producer][RecordCollectorImpl$1][ERROR]: task [0_32] Error sending record to topic pcp-out-eu1

org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.


We very rarely see any of these errors DC X - a few times every 2-4 days. In DC Y we see these errors 3million+ times a day. The errors seem to indicate some networking issues but thus far we haven't been able to identify any concrete problems. Packet loss is non-existent, ping times are short, bandwidth is available.

Currently each broker, both in DC X and DC Y, is configured with 2 disks to use for storing logs. In general the disk usage is roughly the same. Sometimes, in both DCs, we see a pretty large imbalance in disk usage on a broker. For example, one disk is 45% full while the other is 85% full. The brokers will recover and eventually even things out. So far we have not been able to correlate this with the occurrence of the errors but it is something we're investigating. When we noticed this problem we looked into potential disk and IO problems but haven't identified any that seem to affect the error.

Here is how our environments are configured -

DC X (The DC that is working) -

  *   All Servers are Centos 7 VMs hosted on Windows HyperV. We are using Nimble Storage
  *   5 Kafka Servers
     *   8 cores at 2 GHZ
     *   32 GB memory
     *   1 50GB HD (used by OS)
     *   2 250GB HD (used by kafka only)
  *   3 Zookeeper Servers
     *   4 cores a 2GHZ
     *   32 GB memory
     *   60GB HD (OS and Zookeeper)
  *   8 Unique Topics
     *   All topics are configured the same
     *   We have tried 3 different configurations to try and reproduce the DC Y problems
        *   50 Partitions with replication 2
        *   200 partitions with replication 2
        *   200 partitions with replication 1
     *   Each topic is consumed by at most 2 consumer group
     *   Each consumer groups has between 30-60 applications within it
        *   We've made sure to have the appropriate number of topic partitions given the number of consumers
  *   Most of the application use the Kafka Streams library. We do have a number of classic Producer and Consumer applications deployed on either end of the pipeline.

DC Y (The Broken DC) -

  *   All Servers are Centos 7 VMs hosted on Windows HyperV. We are using HPE 3 Par Storage
  *   5 Kafka Servers
     *   8 cores at 2 GHZ
     *   32 GB memory
     *   1 50GB SSD (used by OS)
     *   2 250GB SSD (used by kafka only)
  *   3-5 Zookeeper Servers (We've tried with both 3 and 5 Zookeeper nodes. The problem persists with either configuration)
     *   4 cores a 2GHZ
     *   16 GB memory
     *   50GB SSD (used by OS)
     *   100GB SSD (used by zookeeper)
  *   8 Unique Topics
     *   All topics are configured the same
     *   We have tried 2 different configurations
        *   200 partitions with replication 2
        *   200 partitions with replication 1
     *   Each topic is consumed by at most 2 consumer group
     *   Each consumer groups has between 30-60 applications within it
        *   We've made sure to have the appropriate number of topic partitions given the number of consumers
  *   Most of the application use the Kafka Streams library. We do have a number of classic Producer and Consumer applications deployed on either end of the pipeline.

I've attached examples of our zookeeper and kafka configuration.

Thanks,
Calvin