You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2012/11/03 01:08:12 UTC

[jira] [Commented] (KAFKA-600) kafka should respond gracefully rather than crash when unable to write due to ENOSPC

    [ https://issues.apache.org/jira/browse/KAFKA-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489859#comment-13489859 ] 

Jay Kreps commented on KAFKA-600:
---------------------------------

This seems sensible but the problem is this. If you get ENOSPC, you likely accepted a partial write, which means the log itself is corrupt. To fix this we need to run recovery on the log (which checks each message in the last segment and truncates off any invalid partitial writes). This happens automatically when the server is restarted. Theoretically this could be done automatically but since the recovery process can be slow (minutes) and we can't accept writes during that time we felt the best course of action is to have the node shoot itself in the head and let other healthy nodes take over.
                
> kafka should respond gracefully rather than crash when unable to write due to ENOSPC
> ------------------------------------------------------------------------------------
>
>                 Key: KAFKA-600
>                 URL: https://issues.apache.org/jira/browse/KAFKA-600
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Tim Stilwell
>
> problem:
> user starts kafka with log.dir value set to a small partition and begins writing data to the mq.  when the disk partition is full, kafka crashes.  given that this product is used for both reading and writing operations, crashing seems rather drastic even if the error message is helpful.   something more robust would be appreciated.  perhaps, logging an error and rejecting additional write requests while accepting additional read requests?  perhaps, sending an email alert to Operations?  at least shutdown gracefully so the user is aware that received messages were saved with a helpful message providing some details of the last message received.  when tens or hundreds of thousands of messages can be processed in a second, it isn't helpful to merely log a timestamp and crash.
> steps to reproduce:
> 1) download and install kafka
> 2) modify server.properties
>     # vi /opt/kafka-0.7.2-incubating-src/config/server.properties
>     set log.dir="/var/log/kafka"
> 3) modify log4j
>     # vi /opt/kafka-0.7.2-incubating-src/config/log4j.properties
>     set fileAppender.File=/var/log/kafka/kafka-request.log
> 4) start kafka service
>     $ sudo bash
>     # ulimit -c unlimited
>     # /opt/kafka-0.7.2-incubating-src/bin/kafka-server-start.sh /opt/kafka-0.7.2-incubating-src/config/server.properties &
> 6) begin writing data to hostname:9092
> 7) review /var/log/kafka-request.log
> results:
> $ grep log.dir /opt/kafka-0.7.2-incubating-src/config/server.properties
> log.dir=/var/log/kafka
> $ df -h /var/log/kafka
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda1       4.0G  4.0G     0 100% /
> $ tail /var/log/kafka/kafka-request.log
> 17627442 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  kafka.server.KafkaZooKeeper  - Begin registering broker topic /brokers/topics/raw/0 with 1 partitions
> 17627444 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  kafka.server.KafkaZooKeeper  - End registering broker topic /brokers/topics/raw/0
> 17627445 [ZkClient-EventThread-14-10.0.20.242:2181] INFO  kafka.server.KafkaZooKeeper  - done re-registering broker
> 18337676 [kafka-processor-3] ERROR kafka.network.Processor  - Closing socket for /10.0.20.138 because of error
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:191)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
>         at kafka.utils.Utils$.read(Utils.scala:538)
>         at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Processor.read(SocketServer.scala:311)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)
> 18391974 [kafka-processor-4] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18422004 [kafka-processor-5] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18434563 [kafka-processor-6] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18485005 [kafka-processor-7] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18497083 [kafka-processor-0] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18525720 [kafka-processor-1] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18543843 [kafka-processor-2] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18563230 [kafka-processor-4] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18575613 [kafka-processor-5] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.138.
> 18677568 [kafka-processor-6] ERROR kafka.network.Processor  - Closing socket for /10.0.20.138 because of error
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:218)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:191)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:359)
>         at kafka.utils.Utils$.read(Utils.scala:538)
>         at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
>         at kafka.network.Processor.read(SocketServer.scala:311)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)
> 18828016 [kafka-processor-7] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.248.
> 18844274 [kafka-processor-0] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.248.
> 18849691 [kafka-processor-1] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.248.
> 18896883 [kafka-processor-2] INFO  kafka.network.Processor  - Closing socket connection to /10.0.20.248.
> 22383195 [kafka-processor-2] FATAL kafka.log.Log  - Halting due to unrecoverable I/O error while handling producer request
> java.io.IOException: No space left on device
>         at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>         at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:59)
>         at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
>         at sun.nio.ch.IOUtil.write(IOUtil.java:60)
>         at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195)
>         at kafka.message.ByteBufferMessageSet.writeTo(ByteBufferMessageSet.scala:76)
>         at kafka.message.FileMessageSet.append(FileMessageSet.scala:159)
>         at kafka.log.LogSegment.append(Log.scala:105)
>         at kafka.log.Log.liftedTree1$1(Log.scala:246)
>         at kafka.log.Log.append(Log.scala:242)
>         at kafka.server.KafkaRequestHandlers.kafka$server$KafkaRequestHandlers$$handleProducerRequest(KafkaRequestHandlers.scala:69)
>         at kafka.server.KafkaRequestHandlers.handleProducerRequest(KafkaRequestHandlers.scala:53)
>         at kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38)
>         at kafka.server.KafkaRequestHandlers$$anonfun$handlerFor$1.apply(KafkaRequestHandlers.scala:38)
>         at kafka.network.Processor.handle(SocketServer.scala:296)
>         at kafka.network.Processor.read(SocketServer.scala:319)
>         at kafka.network.Processor.run(SocketServer.scala:214)
>         at java.lang.Thread.run(Thread.java:722)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira