You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "M. Manna" <ma...@gmail.com> on 2018/05/11 17:24:25 UTC

LogCleaner thread failing after upgrading to 2.11-1.1.0

Hello,

Our cluster is going down one-by-one during log cleanup. This is after we
have done full upgrade from 2.10-0.10.2.1.

This is the log we receive:

[2018-05-11 17:12:21,652] WARN [ReplicaFetcher replicaId=1, leaderId=2,
fetcherId=0] Error in response for fetch request (type=FetchRequest,
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760,
fetchData={__consumer_offsets-28=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-43=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-10=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-25=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-7=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-40=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-37=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-22=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-4=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-34=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-19=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-16=(offset=0, logStartOffset=0,
maxBytes=1048576), __consumer_offsets-49=(offset=0, logStartOffset=0,
maxBytes=1048576), supplier_db-0=(offset=39, logStartOffset=0,
maxBytes=1048576), __consumer_offset
=(offset=0, logStartOffset=0, maxBytes=1048576),
__consumer_offsets-31=(offset=500, logStartOffset=0, maxBytes=1048576),
__consumer_offsets-13=(offset=0, logStartOffset=0, maxBytes=1048576),
__consumer_offsets-46=(offset=0, logStartOffset=0, maxBytes=1048576)},
isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1746351516,
epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to eg2-pp-ifs-245:9092 (id: 2 rack: null)
failed.
        at
org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70)
        at
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:91)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:220)
        at
kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
        at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:146)
        at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
log4j:ERROR Failed to rename [/logs/log-cleaner.log] to
[/logs/log-cleaner.log.2018-05-11-16].
[2018-05-11 17:12:22,323] ERROR Failed to clean up log for
__consumer_offsets-31 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-31\00000000000000000000.log.cleaned ->
C:\kafka1\__consumer_offsets-31\00000000000000000000.log.swap: The process
cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
        at
sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
        at java.nio.file.Files.move(Files.java:1395)
        at
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
        at
org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212)
        at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
        at kafka.log.Log.replaceSegments(Log.scala:1644)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
        Suppressed: java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-31\00000000000000000000.log.cleaned ->
C:\kafka1\__consumer_offsets-31\00000000000000000000.log.swap: The process
cannot access the file because it is being used by another process.

                at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
                at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
                at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
                at
sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
                at java.nio.file.Files.move(Files.java:1395)
                at
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694)
                ... 12 more
[2018-05-11 17:12:22,339] INFO [ReplicaManager broker=1] Stopping serving
replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
[2018-05-11 17:12:22,370] INFO [ReplicaFetcherManager on broker 1] Removed
fetcher for partitions
__consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,supplier_db-0,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,860-0,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_
sets-40 (kafka.server.ReplicaFetcherManager)
[2018-05-11 17:12:22,370] INFO [ReplicaAlterLogDirsManager on broker 1]
Removed fetcher for partitions
__consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,supplier_db-0,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,860-0,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__cons
r_offsets-40 (kafka.server.ReplicaAlterLogDirsManager)
[2018-05-11 17:12:22,402] INFO [ReplicaManager broker=1] Broker 1 stopped
fetcher for partitions
__consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,supplier_db-0,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,860-0,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_o
ets-40 and stopped moving logs for partitions  because they are in the
failed log directory C:\kafka1. (kafka.server.ReplicaManager)
[2018-05-11 17:12:22,402] ERROR Failed to clean up log for
__consumer_offsets-31 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-31\00000000000000000000.log.cleaned: The
process cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
        at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
        at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at
kafka.log.Cleaner.deleteCleanedFileIfExists$1(LogCleaner.scala:488)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-05-11 17:12:22,402] INFO Stopping serving logs in dir C:\kafka1
(kafka.log.LogManager)
[2018-05-11 17:12:22,417] ERROR Shutdown broker because all log dirs in
C:\kafka1 have failed (kafka.log.LogManager)

The only alternative we see is to keep the log and offset valid for
infinitely long, but that will bombard our server drives which is not
desirable.

I can see that there was a KIP which enforced atomic move of log files to
avoid such issues on Windows. Here was the pull request
https://github.com/apache/kafka/pull/771. Looks like it was an improvement
but not necessarily a guaranteed fix for Windows.

Could someone please advise what to do here? It's not possible for us to
migrate to Linux at the moment so any workaround is highly appreciated.

Regards,