You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Bhavesh Mistry (JIRA)" <ji...@apache.org> on 2014/12/02 21:05:17 UTC
[jira] [Comment Edited] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

    [ https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232061#comment-14232061 ] 

Bhavesh Mistry edited comment on KAFKA-1642 at 12/2/14 8:04 PM:
----------------------------------------------------------------

Hi  [~ewencp],

I will not have time to validate this patch till next week.  

Here is my comments:

1) Producer.close() method issue is not address with patch. In event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread.  I think we need similar solution.

2) Also, can we please add JMX monitoring for IO tread to know how quick it is running.  It will great to add this and run() method will report duration to metric in nano sec.
{code}
            try{
            	ThreadMXBean bean = ManagementFactory.getThreadMXBean( );
            	if(bean.isThreadCpuTimeSupported() && bean.isThreadCpuTimeEnabled()){
            		this.ioTheadCPUTime = metrics.sensor("iothread-cpu");
                    this.ioTheadCPUTime.add("iothread-cpu-ms", "The Rate Of CPU Cycle used by iothead in NANOSECONDS", new Rate(TimeUnit.NANOSECONDS) {
                        public double measure(MetricConfig config, long now) {
                            return (now - metadata.lastUpdate()) / 1000.0;
                        }
                    });	            		
            	}
            }catch(Throwable th){
            	log.warn("Not able to set the CPU time... etc");
            }
{code}

3)  Please check the timeout final value in *pollTimeout* if it is zero for constantly then we need to slow IO thread down.

4)  Defensive check is need for back off  in run() method when IO thread is aggressive.  

{code}

        while (running) {
        	long start = time.milliseconds();
            try {
                run(time.milliseconds());        
            } catch (Exception e) {
                log.error("Uncaught error in kafka producer I/O thread: ", e);
            }finally{
            	long durationInMs = time.milliseconds() - start;
            	// TODO Fix ME HERE GET DO exponential back-off sleep etc to prevent still CPU CYCLE HERE ?????? How Much ...for the edge case...
            	if(durationInMs < 200){
            		if(client.isAllRegistredNodesAreDown()){
            			countinuousRetry++;
            			 /// TODO MAKE THIS CONSTANT CONFIGURATION..... when do we rest this interval ????? so we can try aggressive again...
            			sleepInMs = ((long) Math.pow(2, countinuousRetry) * 500);
            		}else{
            			sleepInMs =  500 ; 
            			countinuousRetry = 0;
            		}
            		
            		// Wait until the desired next time arrives using nanosecond
            		// accuracy timer (wait(time) isn't accurate enough on most platforms) 
            		try {
            			// TODO SLEEP IS NOT GOOD SOLUTON..
						Thread.sleep(sleepInMs);
					} catch (InterruptedException e) {
						log.error("While sleeping some one interupted this tread probally close method on prodcuer close () ");
					}  
            	}
            }
        }
{code}

5)  When all nodes are disconnected, do you still want to spin the IO Thread ?

6)  When you have a firewall rule that says "you can only have 2 concurrent TCP connections from Client to Brokers" and client still have live TCP connection to same node (Broker), but new TCP connections are rejected. Node State will be marked as Disconnected in initiateConnect ?  Is this case handled gracefully  ?

By the way, thank you very much for quick reply and with new patch.  I appreciate your help.

Thanks,

Bhavesh 


was (Author: bmis13):
Hi  [~ewencp],

I will not have time to validate this patch till next week.  

Here is my comments:

1) You still have not address the Producer.close() method issue that in event of network connection lost or other events happens, IO thread will not be killed and close method hangs. In patch that I have provided, I had timeout for join method and interrupted IO thread.  I think we need similar for this.
2) Also, can we please add JMX monitoring for IO tread to know how quick it is running.  It will great to add this and run() method will report duration to metric.
{code}
            try{
            	ThreadMXBean bean = ManagementFactory.getThreadMXBean( );
            	if(bean.isThreadCpuTimeSupported() && bean.isThreadCpuTimeEnabled()){
            		this.ioTheadCPUTime = metrics.sensor("iothread-cpu");
                    this.ioTheadCPUTime.add("iothread-cpu-ms", "The Rate Of CPU Cycle used by iothead in NANOSECONDS", new Rate(TimeUnit.NANOSECONDS) {
                        public double measure(MetricConfig config, long now) {
                            return (now - metadata.lastUpdate()) / 1000.0;
                        }
                    });	            		
            	}
            }catch(Throwable th){
            	log.warn("Not able to set the CPU time... etc");
            }
{code}

3)  Please check the timeout final value in *pollTimeout* if it is zero for constantly then we need to slow IO thread down.
4)  Defensive check in for back off  in run() method when IO thread is aggressive:  

5)  When all nodes are disconnected, do you still want to spin the IO Thread ?

6)  When you have a firewall rule that says "you can only have 2 concurrent TCP connections from Client to Brokers" and client still have live TCP connection to same not (Broker), but new TCP connection is rejected. Node State will be marked as Disconnected in initiateConnect ?  Are you handling that gracefully  ?

By the way, thank you very much for quick reply and with new patch.  I appreciate your help.

Thanks,

Bhavesh 

> [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1642
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1642
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 0.8.2
>            Reporter: Bhavesh Mistry
>            Assignee: Ewen Cheslack-Postava
>            Priority: Blocker
>             Fix For: 0.8.2
>
>         Attachments: 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, KAFKA-1642_2014-10-23_16:19:41.patch
>
>
> I see my CPU spike to 100% when network connection is lost for while.  It seems network  IO thread are very busy logging following error message.  Is this expected behavior ?
> 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread: 
> java.lang.IllegalStateException: No entry found for node -2
> at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
> at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
> at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
> at org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
> at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
> at java.lang.Thread.run(Thread.java:744)
> Thanks,
> Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)