You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Tom Dearman <to...@gmail.com> on 2017/06/27 09:17:43 UTC

kafka producer latency

Hi,
I have a problem with the latency of my kafka producer under some circumstances.  We are running three kafka brokers on version 0.10.2.0 and 3 zookeepers on version 3.4.8.  The server properties are below, the main producer property that we change is that we require acks=all, so at least 2 will acknowledge our producer requests as we have min.insync.replicas=2.  We have It all runs on our own servers but in an OpenShift environment.  The zookeeper pods write to local storage, but the kafka broker pods write to Ceph storage in such a way that a kafka brokers data is kept and re-assigned to the same broker on restart. I am including a link of kafka producer metrics that highlights the problem (link is only valid for next 7 days):


https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO <https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO>

This link has quite a lot of metrics, but the top two are about latency, request latency and queue time (I assume that the request latency does not include the time spent in the queue). 
@09:29:20, a kafka pod was restarted, the pod was the one which was the overall zookeeper leader elector.  This caused very large latency times for our messages - average is high, but we are particularly interested in the max latency, there was also very high queue time which is just as important to us.
@09:31:00 I had to restart our test client which is causing the load to go to the producer as all 14 threads had stopped since they had waited more than 5 seconds for a producer send.
@09:34:40 I ran a manual rebalance - this hardly causes a blip in the latency.
@09:38:20 a kafka pod was restored, but this time not the one which was the overall zookeeper leader elector.  This caused a large latency for the requests and queue time.
@09:40:30 I ran a manual rebalance - again it hardly caused a blip.


What I find strange about this is that the rebalance itself seems fine, with a controlled shut down, the broker is supposed to do a rebalance before shutting down, so I would have thought everything would be off the closing broker and the latency of a controlled shut down would be no worse than when I do a manual rebalance.


Please can someone help.

Tom


Our server.properties is:

broker.id=-1                                                                    
listeners=PLAINTEXT://:9092                                                     
num.network.threads=3                                                           
num.io.threads=8                                                                
socket.send.buffer.bytes=102400                                                 
socket.receive.buffer.bytes=102400                                              
socket.request.max.bytes=104857600                                              
log.dirs=/mnt/data/logs                                                         
num.partitions=20                                                               
num.recovery.threads.per.data.dir=1                                             
log.retention.hours=168                                                         
log.segment.bytes=1073741824                                                    
log.retention.check.interval.ms=300000                                          
zookeeper.connect=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181            
zookeeper.connection.timeout.ms=6000                                            
advertised.listeners=PLAINTEXT://kafka-0:9092                                   
default.replication.factor=3                                                    
compression.type=gzip                                                           
delete.topic.enable=true                                                        
offsets.retention.minutes=10080                                                 
unclean.leader.election.enable=false                                            
min.insync.replicas=2                    
auto.leader.rebalance.enable=false                                              
leader.imbalance.check.interval.seconds=300                                     
leader.imbalance.per.broker.percentage=10                                       
                                                                                
                                                                                
inter.broker.protocol.version=0.10.2                                            
                                                                                
                                                                                
log.message.format.version=0.10.2

Re: kafka producer latency

Posted by Tom Dearman <to...@gmail.com>.

I have now moved my brokers to local disks (though not dedicated disks).  Running the tests again, although there is a significant improvement in max latency during steady state operation, still when I take down the kafka node that has the zookeeper leadership, I get 10-15 second max latencies.  I noticed that when this happens, the producer is having to refresh the meta data twice in quick succession, and looking at the metrics link, this happened here too - if you look at 09:29:20 there is a drop in the metadata age, indicating that a partition leadership change must have happened, but then another one happens at 09:29:40.  Whey would the reassignment happen twice, as this looks like it is causing a problem.

Tom
> On 27 Jun 2017, at 10:17, Tom Dearman <to...@gmail.com> wrote:
> 
> Hi,
> I have a problem with the latency of my kafka producer under some circumstances.  We are running three kafka brokers on version 0.10.2.0 and 3 zookeepers on version 3.4.8.  The server properties are below, the main producer property that we change is that we require acks=all, so at least 2 will acknowledge our producer requests as we have min.insync.replicas=2.  We have It all runs on our own servers but in an OpenShift environment.  The zookeeper pods write to local storage, but the kafka broker pods write to Ceph storage in such a way that a kafka brokers data is kept and re-assigned to the same broker on restart. I am including a link of kafka producer metrics that highlights the problem (link is only valid for next 7 days):
> 
> 
> https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO <https://snapshot.raintank.io/dashboard/snapshot/fjfBMC09aBQcWWzj54uCunqMYhNt4ggO>
> 
> This link has quite a lot of metrics, but the top two are about latency, request latency and queue time (I assume that the request latency does not include the time spent in the queue). 
> @09:29:20, a kafka pod was restarted, the pod was the one which was the overall zookeeper leader elector.  This caused very large latency times for our messages - average is high, but we are particularly interested in the max latency, there was also very high queue time which is just as important to us.
> @09:31:00 I had to restart our test client which is causing the load to go to the producer as all 14 threads had stopped since they had waited more than 5 seconds for a producer send.
> @09:34:40 I ran a manual rebalance - this hardly causes a blip in the latency.
> @09:38:20 a kafka pod was restored, but this time not the one which was the overall zookeeper leader elector.  This caused a large latency for the requests and queue time.
> @09:40:30 I ran a manual rebalance - again it hardly caused a blip.
> 
> 
> What I find strange about this is that the rebalance itself seems fine, with a controlled shut down, the broker is supposed to do a rebalance before shutting down, so I would have thought everything would be off the closing broker and the latency of a controlled shut down would be no worse than when I do a manual rebalance.
> 
> 
> Please can someone help.
> 
> Tom
> 
> 
> Our server.properties is:
> 
> broker.id=-1                                                                    
> listeners=PLAINTEXT://:9092 <PLAINTEXT://:9092>                                                     
> num.network.threads=3                                                           
> num.io.threads=8                                                                
> socket.send.buffer.bytes=102400                                                 
> socket.receive.buffer.bytes=102400                                              
> socket.request.max.bytes=104857600                                              
> log.dirs=/mnt/data/logs                                                         
> num.partitions=20                                                               
> num.recovery.threads.per.data.dir=1                                             
> log.retention.hours=168                                                         
> log.segment.bytes=1073741824                                                    
> log.retention.check.interval.ms=300000                                          
> zookeeper.connect=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181            
> zookeeper.connection.timeout.ms=6000                                            
> advertised.listeners=PLAINTEXT://kafka-0:9092 <plaintext://kafka-0:9092>                                   
> default.replication.factor=3                                                    
> compression.type=gzip                                                           
> delete.topic.enable=true                                                        
> offsets.retention.minutes=10080                                                 
> unclean.leader.election.enable=false                                            
> min.insync.replicas=2                    
> auto.leader.rebalance.enable=false                                              
> leader.imbalance.check.interval.seconds=300                                     
> leader.imbalance.per.broker.percentage=10                                       
>                                                                                 
>                                                                                 
> inter.broker.protocol.version=0.10.2                                            
>                                                                                 
>                                                                                 
> log.message.format.version=0.10.2