You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/02/02 23:04:00 UTC

[jira] [Commented] (KAFKA-6529) Broker leaks memory and file descriptors after sudden client disconnects

    [ https://issues.apache.org/jira/browse/KAFKA-6529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351042#comment-16351042 ] 

ASF GitHub Bot commented on KAFKA-6529:
---------------------------------------

parafiend opened a new pull request #4517: KAFKA-6529: Stop file descriptor leak when client disconnects with staged receives
URL: https://github.com/apache/kafka/pull/4517
 
 
   If an exception is encountered while sending data to a client
   connection, that connection is disconnected. If there are staged
   receives for that connection, it is tracked to process those records.
   However, if the exception was encountered during processing a
   `RequestChannel.Request`, the `KafkaChannel` for that connection is
   muted and won't be processed.
   
   Add the channel to failed sends so the connection is cleaned up on those
   exceptions. This stops the leak of the memory for pending requests
   and the file descriptor of the TCP socket.
   
   Only flag channel as failed send when an exception is encountered while
   actually attempting to send something. Other socket interactions don't
   count.
   
   Test that a channel is closed when an exception is raised while writing to
   a socket that has been closed by the client. Since sending a response 
   requires acks != 0, allow specifying the required acks for test requests
   in SocketServerTest.scala.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Broker leaks memory and file descriptors after sudden client disconnects
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-6529
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6529
>             Project: Kafka
>          Issue Type: Bug
>          Components: network
>    Affects Versions: 1.0.0, 0.11.0.2
>            Reporter: Graham Campbell
>            Priority: Major
>
> If a producer forcefully disconnects from a broker while it has staged receives, that connection enters a limbo state where it is no longer processed by the SocketServer.Processor, leaking the file descriptor for the socket and the memory used for the staged recieve queue for that connection.
> We noticed this during an upgrade from 0.9.0.2 to 0.11.0.2. Immediately after the rolling restart to upgrade, open file descriptors on the brokers started climbing uncontrollably. In a few cases brokers reached our configured max open files limit of 100k and crashed before we rolled back.
> We tracked this down to a buildup of muted connections in the Selector.closingChannels list. If a client disconnects from the broker with multiple pending produce requests, when the broker attempts to send an ack to the client it recieves an IOException because the TCP socket has been closed. This triggers the Selector to close the channel, but because it still has pending requests, it adds it to Selector.closingChannels to process those requests. However, because that exception was triggered by trying to send a response, the SocketServer.Processor has marked the channel as muted and will no longer process it at all.
> *Reproduced by:*
> Starting a Kafka broker/cluster
> Client produces several messages and then disconnects abruptly (eg. _./rdkafka_performance -P -x 100 -b broker:9092 -t test_topic_)
> Broker then leaks file descriptor previously used for TCP socket and memory for unprocessed messages
> *Proposed solution (which we've implemented internally)*
> Whenever an exception is encountered when writing to a socket in Selector.pollSelectionKeys(...) record that that connection failed a send by adding the KafkaChannel ID to Selector.failedSends. Then re-raise the exception to still trigger the socket disconnection logic. Since every exception raised in this function triggers a disconnect, we also treat any exception while writing to the socket as a failed send.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)