You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "M. Manna" <ma...@gmail.com> on 2018/05/12 20:42:50 UTC

Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Hello,

We are still stuck with this issue where 2.11-1.1.0 distro is failing to
cleanup logs on Windows and brings the entire cluster down one by one.
Extending the retention hours and sizes don't help because they burden the
hard drive.

Here is the log

[2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1] Rolled
> new log segment at offset 45 in 105 ms. (kafka.log.Log)
> [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> Scheduling log segment [baseOffset 0, size 2290] for deletion.
> (kafka.log.Log)
> [2018-05-12 21:36:57,673] ERROR Error while deleting segments for test-0
> in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
> java.nio.file.FileSystemException:
> C:\kafka1\test-0\00000000000000000000.log ->
> C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> access the file because it is being used by another process.
>
>         at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
>         at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
>         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>         at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
>         at java.nio.file.Files.move(Files.java:1395)
>         at
> org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
>         at
> org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212)
>         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
>         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:1601)
>         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
>         at
> kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$apply$mcI$sp$1.apply(Log.scala:1170)
>         at
> kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$apply$mcI$sp$1.apply(Log.scala:1170)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at
> kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
>         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
>         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
>         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>         at kafka.log.Log.deleteSegments(Log.scala:1161)
>         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>         at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1228)
>         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>         at
> kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:854)
>         at
> kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:852)
>         at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>         at
> kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(LogManager.scala:385)
>         at
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>         Suppressed: java.nio.file.FileSystemException:
> C:\kafka1\test-0\00000000000000000000.log ->
> C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> access the file because it is being used by another process.
>
>                 at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
>                 at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
>                 at
> sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>                 at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
>                 at java.nio.file.Files.move(Files.java:1395)
>                 at
> org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694)
>                 ... 32 more
> [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping serving
> replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled task
> 'kafka-log-retention' (kafka.utils.KafkaScheduler)
> org.apache.kafka.common.errors.KafkaStorageException: Error while deleting
> segments for test-0 in dir C:\kafka1
> Caused by: java.nio.file.FileSystemException:
> C:\kafka1\test-0\00000000000000000000.log ->
> C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> access the file because it is being used by another process.
>
>         at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
>         at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
>         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>         at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
>         at java.nio.file.Files.move(Files.java:1395)
>         at
> org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
>         at
> org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212)
>         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
>         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:1601)
>         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
>         at
> kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$apply$mcI$sp$1.apply(Log.scala:1170)
>         at
> kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$apply$mcI$sp$1.apply(Log.scala:1170)
>         at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>         at
> kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
>         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
>         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
>         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>         at kafka.log.Log.deleteSegments(Log.scala:1161)
>         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>         at kafka.log.Log.deleteRetentionMsBreachedSegments(Log.scala:1228)
>         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>         at
> kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:854)
>         at
> kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:852)
>         at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>         at
> kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(LogManager.scala:385)
>         at
> kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>         Suppressed: java.nio.file.FileSystemException:
> C:\kafka1\test-0\00000000000000000000.log ->
> C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> access the file because it is being used by another process.
>
>                 at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
>                 at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
>                 at
> sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>                 at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
>                 at java.nio.file.Files.move(Files.java:1395)
>                 at
> org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694)
>                 ... 32 more
> [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1] Removed
> fetcher for partitions
> __consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-40
> (kafka.server.ReplicaFetcherManager)
> [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on broker 1]
> Removed fetcher for partitions
> __consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-40
> (kafka.server.ReplicaAlterLogDirsManager)
> [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1 stopped
> fetcher for partitions
> __consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-40
> and stopped moving logs for partitions  because they are in the failed log
> directory C:\kafka1. (kafka.server.ReplicaManager)
> [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir C:\kafka1
> (kafka.log.LogManager)
> [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log dirs in
> C:\kafka1 have failed (kafka.log.LogManager)
>

Could someone please share some ideas how to rectify this on Windows? If
this will never be supported on Windows, could we get some official
communication perhaps?

Regards,

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by "M. Manna" <ma...@gmail.com>.
Hey Ted,

I spent some time looking at the code, and found the following (seems to be
expected, but please correct me if not):

LogManager sets up some job (e.g. log cleaning, flushing, offset check,
etc.) during the startup. Part of which keeps the log segment files open.

However, when cleanupLogs() is called, and eventually hitting
changeFileSuffixes() - there is no guarantee that the file is actually
closed. And this will happen at startup as well as run time.

Attached are the logs which is after I have added some stack trace capture
code into LogSegment (when Open() is called). I apologise for fiddling with
the release code, but it was necessary.

After having a look at the logs, If I understand correctly, the following
should happen for both Windows/Linux:

1) If a file is being moved/renamed - close it safely. Possibly using
CoreUtils#swallow() ?
2) Do the same for memory maps - using CoreUtils#swallow() and some NPE
check?
3) Perform atomic move or replace operation.


Please let me know what you think.

Regards,

On 14 May 2018 at 18:18, Ted Yu <yu...@gmail.com> wrote:

> I found the source code for Windows related jdk:
>
> https://github.com/frohoff/jdk8u-jdk/blob/master/src/
> windows/classes/sun/nio/fs/WindowsFileSystemProvider.java
>
> It has been about 11 years since I last worked on Windows code :-)
>
> I think the only way to gain more information is to modify Kafka source
> code.
> In Utils#atomicMoveWithFallback :
>
>             } catch (IOException inner) {
>                 inner.addSuppressed(outer);
>                 throw inner;
>             }
> We throw the exception out. Maybe prior to that, we can add some code to
> figure who else (thread) is holding onto the file.
>
> FYI
>
> On Mon, May 14, 2018 at 9:23 AM, M. Manna <ma...@gmail.com> wrote:
>
> > Hey Ted,
> >
> > Thanks again for reaching out. Yes the screengrab seems to have got lost.
> > Once the cluster is down based because of file handling related issues, I
> > did a full start, and here is the log:
> >
> >
> > [2018-05-14 16:35:28,580] INFO Starting log flusher with a default period
> > of 9223372036854775807 ms. (kafka.log.LogManager)
> > [2018-05-14 16:35:28,814] ERROR Failed to clean up log for
> > __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> > LogDirFailureChannel)
> > java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> > 21\00000000000000000000.log.cleaned -> C:\kafka1\__consumer_offsets-
> > 21\00000000000000000000.log.swap: The process cannot access the file
> > because it is being used by another process.
> >
> >         at sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> >         at sun.nio.fs.WindowsFileSystemProvider.move(
> > WindowsFileSystemProvider.java:287)
> >         at java.nio.file.Files.move(Files.java:1395)
> >         at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:697)
> >         at org.apache.kafka.common.record.FileRecords.renameTo(
> > FileRecords.java:212)
> >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
> >         at kafka.log.Log.replaceSegments(Log.scala:1644)
> >         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:462)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:461)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
> >         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
> >         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> > LogCleaner.scala:305)
> >         at kafka.log.LogCleaner$CleanerThread.doWork(
> LogCleaner.scala:291)
> >         at kafka.utils.ShutdownableThread.run(
> ShutdownableThread.scala:82)
> >         Suppressed: java.nio.file.FileSystemException:
> > C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned ->
> > C:\kafka1\__consumer_offsets-21\00000000000000000000.log.swap: The
> > process cannot access the file because it is being used by another
> process.
> >
> >                 at sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> >                 at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> >                 at sun.nio.fs.WindowsFileCopy.
> > move(WindowsFileCopy.java:301)
> >                 at sun.nio.fs.WindowsFileSystemProvider.move(
> > WindowsFileSystemProvider.java:287)
> >                 at java.nio.file.Files.move(Files.java:1395)
> >                 at org.apache.kafka.common.utils.
> > Utils.atomicMoveWithFallback(Utils.java:694)
> >                 ... 12 more
> > [2018-05-14 16:35:28,939] INFO Awaiting socket connections on
> 0.0.0.0:9092.
> > (kafka.network.Acceptor)
> > [2018-05-14 16:35:28,986] ERROR Error while accepting connection
> > (kafka.network.Acceptor)
> > java.lang.ArithmeticException: / by zero
> >         at kafka.network.Acceptor.run(SocketServer.scala:354)
> >         at java.lang.Thread.run(Thread.java:748)
> > [2018-05-14 16:35:29,002] ERROR Error while accepting connection
> > (kafka.network.Acceptor)
> > java.lang.ArithmeticException: / by zero
> >         at kafka.network.Acceptor.run(SocketServer.scala:354)
> >         at java.lang.Thread.run(Thread.java:748)
> > [2018-05-14 16:35:29,017] ERROR Error while accepting connection
> > (kafka.network.Acceptor)
> > java.lang.ArithmeticException: / by zero
> >         at kafka.network.Acceptor.run(SocketServer.scala:354)
> >         at java.lang.Thread.run(Thread.java:748)
> > [2018-05-14 16:35:28,986] ERROR Failed to clean up log for
> > __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> > LogDirFailureChannel)
> > java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> > 21\00000000000000000000.log.cleaned: The process cannot access the file
> > because it is being used by another process.
> >
> >         at sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:102)
> >         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> > WindowsFileSystemProvider.java:269)
> >         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> > AbstractFileSystemProvider.java:108)
> >         at java.nio.file.Files.deleteIfExists(Files.java:1165)
> >         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> > LogCleaner.scala:488)
> >         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:462)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:461)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
> >         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
> >         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> > LogCleaner.scala:305)
> >         at kafka.log.LogCleaner$CleanerThread.doWork(
> LogCleaner.scala:291)
> >         at kafka.utils.ShutdownableThread.run(
> ShutdownableThread.scala:82)
> > [2018-05-14 16:35:29,064] INFO [SocketServer brokerId=1] Started 1
> > acceptor threads (kafka.network.SocketServer)
> > [2018-05-14 16:35:29,017] ERROR Error while accepting connection
> > (kafka.network.Acceptor)
> > java.lang.ArithmeticException: / by zero
> >         at kafka.network.Acceptor.run(SocketServer.scala:354)
> >         at java.lang.Thread.run(Thread.java:748)
> > [2018-05-14 16:35:29,189] INFO [ExpirationReaper-1-DeleteRecords]:
> > Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> > [2018-05-14 16:35:29,205] INFO [ExpirationReaper-1-Produce]: Starting
> > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> > [2018-05-14 16:35:29,220] INFO [ExpirationReaper-1-Fetch]: Starting
> > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> > [2018-05-14 16:35:29,205] ERROR Failed to clean up log for
> > __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> > LogDirFailureChannel)
> > java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> > 21\00000000000000000000.log.cleaned: The process cannot access the file
> > because it is being used by another process.
> >
> >         at sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:102)
> >         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> > WindowsFileSystemProvider.java:269)
> >         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> > AbstractFileSystemProvider.java:108)
> >         at java.nio.file.Files.deleteIfExists(Files.java:1165)
> >         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> > LogCleaner.scala:488)
> >         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:462)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:461)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
> >         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
> >         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> > LogCleaner.scala:305)
> >         at kafka.log.LogCleaner$CleanerThread.doWork(
> LogCleaner.scala:291)
> >         at kafka.utils.ShutdownableThread.run(
> ShutdownableThread.scala:82)
> > [2018-05-14 16:35:29,220] INFO [LogDirFailureHandler]: Starting
> > (kafka.server.ReplicaManager$LogDirFailureHandler)
> > [2018-05-14 16:35:29,330] INFO Creating /brokers/ids/1 (is it secure?
> > false) (kafka.zk.KafkaZkClient)
> > [2018-05-14 16:35:29,330] INFO [ReplicaManager broker=1] Stopping serving
> > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> > [2018-05-14 16:35:29,345] INFO Result of znode creation at /brokers/ids/1
> > is: OK (kafka.zk.KafkaZkClient)
> > [2018-05-14 16:35:29,345] ERROR Failed to clean up log for
> > __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> > LogDirFailureChannel)
> > java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> > 21\00000000000000000000.log.cleaned: The process cannot access the file
> > because it is being used by another process.
> >
> >         at sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:102)
> >         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> > WindowsFileSystemProvider.java:269)
> >         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> > AbstractFileSystemProvider.java:108)
> >         at java.nio.file.Files.deleteIfExists(Files.java:1165)
> >         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> > LogCleaner.scala:488)
> >         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:462)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:461)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
> >         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
> >         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> > LogCleaner.scala:305)
> >         at kafka.log.LogCleaner$CleanerThread.doWork(
> LogCleaner.scala:291)
> >         at kafka.utils.ShutdownableThread.run(
> ShutdownableThread.scala:82)
> > [2018-05-14 16:35:29,345] INFO [ReplicaFetcherManager on broker 1]
> Removed
> > fetcher for partitions  (kafka.server.ReplicaFetcherManager)
> > [2018-05-14 16:35:29,361] INFO Registered broker 1 at path /brokers/ids/1
> > with addresses: ArrayBuffer(EndPoint(localhost,9092,ListenerName(
> PLAINTEXT),PLAINTEXT))
> > (kafka.zk.KafkaZkClient)
> > [2018-05-14 16:35:29,392] INFO [ReplicaAlterLogDirsManager on broker 1]
> > Removed fetcher for partitions  (kafka.server.
> ReplicaAlterLogDirsManager)
> > [2018-05-14 16:35:29,424] INFO [ReplicaManager broker=1] Broker 1 stopped
> > fetcher for partitions  and stopped moving logs for partitions  because
> > they are in the failed log directory C:\kafka1.
> > (kafka.server.ReplicaManager)
> > [2018-05-14 16:35:29,424] ERROR Failed to clean up log for
> > __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> > LogDirFailureChannel)
> > java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> > 21\00000000000000000000.log.cleaned: The process cannot access the file
> > because it is being used by another process.
> >
> >         at sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> >         at sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:102)
> >         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> > WindowsFileSystemProvider.java:269)
> >         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> > AbstractFileSystemProvider.java:108)
> >         at java.nio.file.Files.deleteIfExists(Files.java:1165)
> >         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> > LogCleaner.scala:488)
> >         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:462)
> >         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> > scala:461)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
> >         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
> >         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> > LogCleaner.scala:305)
> >         at kafka.log.LogCleaner$CleanerThread.doWork(
> LogCleaner.scala:291)
> >         at kafka.utils.ShutdownableThread.run(
> ShutdownableThread.scala:82)
> > [2018-05-14 16:35:29,439] INFO Stopping serving logs in dir C:\kafka1
> > (kafka.log.LogManager)
> > [2018-05-14 16:35:29,517] INFO [ExpirationReaper-1-topic]: Starting
> > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> > [2018-05-14 16:35:29,517] ERROR Shutdown broker because all log dirs in
> > C:\kafka1 have failed (kafka.log.LogManager)
> >
> > Attached is the full log upon kafka-server-start.bat run. Meanwhile, i am
> > experimenting whether closing a log and txnIndex (if not closed already)
> > and unmapping the other segments before renaming helps. The problem is
> that
> > if .append() is having a clash then this approach might not help
> > (Potentially ClosedChannelException if the fileChannel is being closed
> > twice).
> > Once again, thanks a lot for helping out on this.
> >
> > Regards,
> >
> >
> > On 14 May 2018 at 16:55, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Looks like you were trying to show some screen which didn't go thru.
> >>
> >> Examining LogSegment.scala, I wonder if the time index was being
> appended
> >> to (which caused the deletion to fail):
> >>
> >>       // append an entry to the index (if needed)
> >>       if(bytesSinceLastIndexEntry > indexIntervalBytes) {
> >>         offsetIndex.append(largestOffset, physicalPosition)
> >>         timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)
> >>
> >> Can you capture the stack trace and pastebin it ?
> >>
> >> If LogSegment shows up in the stack trace, we may have some idea for the
> >> root cause.
> >>
> >> Thanks
> >>
> >> On Mon, May 14, 2018 at 8:31 AM, M. Manna <ma...@gmail.com> wrote:
> >>
> >> > I am having some difficulties debugging the cluster. IN CWIKI for
> kafka,
> >> > I can see a setup guide for eclipse, but nothing as such for debugger
> >> > setup. The issue is that every time the log cleaner thread is run,
> it's
> >> > having a FATAL shutdown.
> >> > I tried to close all producers and consumers, but it still locks the
> log
> >> > and Transaction index segments. Forcefully unmapping the offset and
> time
> >> > offset files work normally, but log and transaction index files don't
> >> work
> >> > properly.
> >> >
> >> > Could anyone please advise how to setup the kafka debugging on eclipse
> >> ? I
> >> > have added the sdb jar noted in scala-debugger.org site. But I
> couldn't
> >> > connect to port 5005 at all.
> >> >
> >> >
> >> >
> >> >
> >> > On 13 May 2018 at 10:00, M. Manna <ma...@gmail.com> wrote:
> >> >
> >> >> Hi Ted,
> >> >>
> >> >> I highly appreciate the response over the weekend, and thanks for
> >> >> pointing out the JIRAs.
> >> >>
> >> >> I don't believe the processes are responsible, but individual threads
> >> >> which are still holding the log/index files using IO streams. I am
> >> trying
> >> >> walk a single node setup through debugger to find out which thread is
> >> >> locking the file. Apologise but it's a huge application so it might
> >> get me
> >> >> some time to get around it :)
> >> >>
> >> >> Please do update if you find something new.
> >> >>
> >> >> Regards,
> >> >>
> >> >> On 12 May 2018 at 22:15, Ted Yu <yu...@gmail.com> wrote:
> >> >>
> >> >>> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which
> bear
> >> >>> some
> >> >>> resemblance.
> >> >>>
> >> >>> Can you try to find how the other process uses the file being
> deleted
> >> ?
> >> >>>
> >> >>> https://superuser.com/questions/117902/find-out-which-proces
> >> >>> s-is-locking-a-file-or-folder-in-windows
> >> >>> https://www.computerhope.com/issues/ch000714.htm
> >> >>>
> >> >>> Cheers
> >> >>>
> >> >>> On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com>
> wrote:
> >> >>>
> >> >>> > Hello,
> >> >>> >
> >> >>> > We are still stuck with this issue where 2.11-1.1.0 distro is
> >> failing
> >> >>> to
> >> >>> > cleanup logs on Windows and brings the entire cluster down one by
> >> one.
> >> >>> > Extending the retention hours and sizes don't help because they
> >> burden
> >> >>> the
> >> >>> > hard drive.
> >> >>> >
> >> >>> > Here is the log
> >> >>> >
> >> >>> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0,
> dir=C:\kafka1]
> >> >>> Rolled
> >> >>> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
> >> >>> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0,
> >> dir=C:\kafka1]
> >> >>> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
> >> >>> > > (kafka.log.Log)
> >> >>> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments
> for
> >> >>> test-0
> >> >>> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
> >> >>> > > java.nio.file.FileSystemException:
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >> >>> cannot
> >> >>> > > access the file because it is being used by another process.
> >> >>> > >
> >> >>> > >         at
> >> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >> >>> > WindowsException.java:86)
> >> >>> > >         at
> >> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >> >>> > WindowsException.java:97)
> >> >>> > >         at sun.nio.fs.WindowsFileCopy.mov
> >> e(WindowsFileCopy.java:387)
> >> >>> > >         at
> >> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
> >> rovider.
> >> >>> > java:287)
> >> >>> > >         at java.nio.file.Files.move(Files.java:1395)
> >> >>> > >         at
> >> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >> >>> > Utils.java:697)
> >> >>> > >         at
> >> >>> > > org.apache.kafka.common.record.FileRecords.renameTo(
> >> >>> > FileRecords.java:212)
> >> >>> > >         at kafka.log.LogSegment.changeFil
> >> >>> eSuffixes(LogSegment.scala:415)
> >> >>> > >         at kafka.log.Log.kafka$log$Log$$a
> >> >>> syncDeleteSegment(Log.scala:
> >> >>> > 1601)
> >> >>> > >         at kafka.log.Log.kafka$log$Log$$d
> >> >>> eleteSegment(Log.scala:1588)
> >> >>> > >         at
> >> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >> >>> > >         at
> >> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >> >>> > >         at
> >> >>> > > scala.collection.mutable.ResizableArray$class.foreach(
> >> >>> > ResizableArray.scala:59)
> >> >>> > >         at
> >> >>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
> >> scala:48)
> >> >>> > >         at
> >> >>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.
> >> scala:1170)
> >> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >> >>> egments$1.apply(Log.scala:1161)
> >> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >> >>> egments$1.apply(Log.scala:1161)
> >> >>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> >> >>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> >> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> >> >>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> >> >>> > nts(Log.scala:1228)
> >> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> >> >>> > >         at
> >> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >> >>> .scala:854)
> >> >>> > >         at
> >> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >> >>> .scala:852)
> >> >>> > >         at
> >> >>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
> >> >>> $1.apply(
> >> >>> > TraversableLike.scala:733)
> >> >>> > >         at scala.collection.immutable.Lis
> >> t.foreach(List.scala:392)
> >> >>> > >         at
> >> >>> > > scala.collection.TraversableLike$WithFilter.
> >> >>> > foreach(TraversableLike.scala:732)
> >> >>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:
> 852)
> >> >>> > >         at
> >> >>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> >> >>> > LogManager.scala:385)
> >> >>> > >         at
> >> >>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> >> >>> > KafkaScheduler.scala:110)
> >> >>> > >         at kafka.utils.CoreUtils$$anon$1.
> run(CoreUtils.scala:62)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
> >> >>> s.java:511)
> >> >>>
> >> >>> > >         at java.util.concurrent.FutureTask.runAndReset(
> >> >>> > FutureTask.java:308)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> >>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.j
> >> ava:180)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> >>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> >>> > ThreadPoolExecutor.java:1149)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> >>> > ThreadPoolExecutor.java:624)
> >> >>> > >         at java.lang.Thread.run(Thread.java:748)
> >> >>> > >         Suppressed: java.nio.file.FileSystemException:
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >> >>> cannot
> >> >>> > > access the file because it is being used by another process.
> >> >>> > >
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >> >>> > WindowsException.java:86)
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >> >>> > WindowsException.java:97)
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
> >> rovider.
> >> >>> > java:287)
> >> >>> > >                 at java.nio.file.Files.move(Files.java:1395)
> >> >>> > >                 at
> >> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >> >>> > Utils.java:694)
> >> >>> > >                 ... 32 more
> >> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1]
> Stopping
> >> >>> serving
> >> >>> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> >> >>> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled
> >> task
> >> >>> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
> >> >>> > > org.apache.kafka.common.errors.KafkaStorageException: Error
> while
> >> >>> > deleting
> >> >>> > > segments for test-0 in dir C:\kafka1
> >> >>> > > Caused by: java.nio.file.FileSystemException:
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >> >>> cannot
> >> >>> > > access the file because it is being used by another process.
> >> >>> > >
> >> >>> > >         at
> >> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >> >>> > WindowsException.java:86)
> >> >>> > >         at
> >> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >> >>> > WindowsException.java:97)
> >> >>> > >         at sun.nio.fs.WindowsFileCopy.mov
> >> e(WindowsFileCopy.java:387)
> >> >>> > >         at
> >> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
> >> rovider.
> >> >>> > java:287)
> >> >>> > >         at java.nio.file.Files.move(Files.java:1395)
> >> >>> > >         at
> >> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >> >>> > Utils.java:697)
> >> >>> > >         at
> >> >>> > > org.apache.kafka.common.record.FileRecords.renameTo(
> >> >>> > FileRecords.java:212)
> >> >>> > >         at kafka.log.LogSegment.changeFil
> >> >>> eSuffixes(LogSegment.scala:415)
> >> >>> > >         at kafka.log.Log.kafka$log$Log$$a
> >> >>> syncDeleteSegment(Log.scala:
> >> >>> > 1601)
> >> >>> > >         at kafka.log.Log.kafka$log$Log$$d
> >> >>> eleteSegment(Log.scala:1588)
> >> >>> > >         at
> >> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >> >>> > >         at
> >> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >> >>> > >         at
> >> >>> > > scala.collection.mutable.ResizableArray$class.foreach(
> >> >>> > ResizableArray.scala:59)
> >> >>> > >         at
> >> >>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
> >> scala:48)
> >> >>> > >         at
> >> >>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.
> >> scala:1170)
> >> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >> >>> egments$1.apply(Log.scala:1161)
> >> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >> >>> egments$1.apply(Log.scala:1161)
> >> >>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> >> >>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> >> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> >> >>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> >> >>> > nts(Log.scala:1228)
> >> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> >> >>> > >         at
> >> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >> >>> .scala:854)
> >> >>> > >         at
> >> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >> >>> .scala:852)
> >> >>> > >         at
> >> >>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
> >> >>> $1.apply(
> >> >>> > TraversableLike.scala:733)
> >> >>> > >         at scala.collection.immutable.Lis
> >> t.foreach(List.scala:392)
> >> >>> > >         at
> >> >>> > > scala.collection.TraversableLike$WithFilter.
> >> >>> > foreach(TraversableLike.scala:732)
> >> >>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:
> 852)
> >> >>> > >         at
> >> >>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> >> >>> > LogManager.scala:385)
> >> >>> > >         at
> >> >>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> >> >>> > KafkaScheduler.scala:110)
> >> >>> > >         at kafka.utils.CoreUtils$$anon$1.
> run(CoreUtils.scala:62)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
> >> >>> s.java:511)
> >> >>>
> >> >>> > >         at java.util.concurrent.FutureTask.runAndReset(
> >> >>> > FutureTask.java:308)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> >>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.j
> >> ava:180)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> >>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> >>> > ThreadPoolExecutor.java:1149)
> >> >>> > >         at
> >> >>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> >>> > ThreadPoolExecutor.java:624)
> >> >>> > >         at java.lang.Thread.run(Thread.java:748)
> >> >>> > >         Suppressed: java.nio.file.FileSystemException:
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >> >>> cannot
> >> >>> > > access the file because it is being used by another process.
> >> >>> > >
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >> >>> > WindowsException.java:86)
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >> >>> > WindowsException.java:97)
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> >> >>> > >                 at
> >> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
> >> rovider.
> >> >>> > java:287)
> >> >>> > >                 at java.nio.file.Files.move(Files.java:1395)
> >> >>> > >                 at
> >> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >> >>> > Utils.java:694)
> >> >>> > >                 ... 32 more
> >> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker
> 1]
> >> >>> > Removed
> >> >>> > > fetcher for partitions
> >> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
> >> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> >> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> >> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> >> >>> > consumer_offsets-35,__consumer_offsets-41,__
> consumer_offsets-33,__
> >> >>> > consumer_offsets-23,__consumer_offsets-49,__
> consumer_offsets-47,__
> >> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
> >> >>> offsets-31,__
> >> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
> >> >>> -3,__consumer_
> >> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> >> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
> >> >>> fsets-17,__
> >> >>> > consumer_offsets-48,__consumer_offsets-19,__
> consumer_offsets-11,__
> >> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> >> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> >> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> >> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> >> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> >> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> >> >>> > consumer_offsets-34,__consumer_offsets-10,__
> consumer_offsets-32,__
> >> >>> > consumer_offsets-40
> >> >>> > > (kafka.server.ReplicaFetcherManager)
> >> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on
> >> broker
> >> >>> 1]
> >> >>> > > Removed fetcher for partitions
> >> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
> >> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> >> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> >> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> >> >>> > consumer_offsets-35,__consumer_offsets-41,__
> consumer_offsets-33,__
> >> >>> > consumer_offsets-23,__consumer_offsets-49,__
> consumer_offsets-47,__
> >> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
> >> >>> offsets-31,__
> >> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
> >> >>> -3,__consumer_
> >> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> >> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
> >> >>> fsets-17,__
> >> >>> > consumer_offsets-48,__consumer_offsets-19,__
> consumer_offsets-11,__
> >> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> >> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> >> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> >> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> >> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> >> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> >> >>> > consumer_offsets-34,__consumer_offsets-10,__
> consumer_offsets-32,__
> >> >>> > consumer_offsets-40
> >> >>> > > (kafka.server.ReplicaAlterLogDirsManager)
> >> >>> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker
> 1
> >> >>> stopped
> >> >>> > > fetcher for partitions
> >> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
> >> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> >> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> >> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> >> >>> > consumer_offsets-35,__consumer_offsets-41,__
> consumer_offsets-33,__
> >> >>> > consumer_offsets-23,__consumer_offsets-49,__
> consumer_offsets-47,__
> >> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
> >> >>> offsets-31,__
> >> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
> >> >>> -3,__consumer_
> >> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> >> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
> >> >>> fsets-17,__
> >> >>> > consumer_offsets-48,__consumer_offsets-19,__
> consumer_offsets-11,__
> >> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> >> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> >> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> >> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> >> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> >> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> >> >>> > consumer_offsets-34,__consumer_offsets-10,__
> consumer_offsets-32,__
> >> >>> > consumer_offsets-40
> >> >>> > > and stopped moving logs for partitions  because they are in the
> >> >>> failed
> >> >>> > log
> >> >>> > > directory C:\kafka1. (kafka.server.ReplicaManager)
> >> >>> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir
> >> C:\kafka1
> >> >>> > > (kafka.log.LogManager)
> >> >>> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log
> >> dirs
> >> >>> in
> >> >>> > > C:\kafka1 have failed (kafka.log.LogManager)
> >> >>> > >
> >> >>> >
> >> >>> > Could someone please share some ideas how to rectify this on
> >> Windows?
> >> >>> If
> >> >>> > this will never be supported on Windows, could we get some
> official
> >> >>> > communication perhaps?
> >> >>> >
> >> >>> > Regards,
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
> >
>

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by Ted Yu <yu...@gmail.com>.
I found the source code for Windows related jdk:

https://github.com/frohoff/jdk8u-jdk/blob/master/src/windows/classes/sun/nio/fs/WindowsFileSystemProvider.java

It has been about 11 years since I last worked on Windows code :-)

I think the only way to gain more information is to modify Kafka source
code.
In Utils#atomicMoveWithFallback :

            } catch (IOException inner) {
                inner.addSuppressed(outer);
                throw inner;
            }
We throw the exception out. Maybe prior to that, we can add some code to
figure who else (thread) is holding onto the file.

FYI

On Mon, May 14, 2018 at 9:23 AM, M. Manna <ma...@gmail.com> wrote:

> Hey Ted,
>
> Thanks again for reaching out. Yes the screengrab seems to have got lost.
> Once the cluster is down based because of file handling related issues, I
> did a full start, and here is the log:
>
>
> [2018-05-14 16:35:28,580] INFO Starting log flusher with a default period
> of 9223372036854775807 ms. (kafka.log.LogManager)
> [2018-05-14 16:35:28,814] ERROR Failed to clean up log for
> __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> LogDirFailureChannel)
> java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> 21\00000000000000000000.log.cleaned -> C:\kafka1\__consumer_offsets-
> 21\00000000000000000000.log.swap: The process cannot access the file
> because it is being used by another process.
>
>         at sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
>         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>         at sun.nio.fs.WindowsFileSystemProvider.move(
> WindowsFileSystemProvider.java:287)
>         at java.nio.file.Files.move(Files.java:1395)
>         at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> Utils.java:697)
>         at org.apache.kafka.common.record.FileRecords.renameTo(
> FileRecords.java:212)
>         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
>         at kafka.log.Log.replaceSegments(Log.scala:1644)
>         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:462)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:461)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
>         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
>         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> LogCleaner.scala:305)
>         at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>         Suppressed: java.nio.file.FileSystemException:
> C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned ->
> C:\kafka1\__consumer_offsets-21\00000000000000000000.log.swap: The
> process cannot access the file because it is being used by another process.
>
>                 at sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
>                 at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
>                 at sun.nio.fs.WindowsFileCopy.
> move(WindowsFileCopy.java:301)
>                 at sun.nio.fs.WindowsFileSystemProvider.move(
> WindowsFileSystemProvider.java:287)
>                 at java.nio.file.Files.move(Files.java:1395)
>                 at org.apache.kafka.common.utils.
> Utils.atomicMoveWithFallback(Utils.java:694)
>                 ... 12 more
> [2018-05-14 16:35:28,939] INFO Awaiting socket connections on 0.0.0.0:9092.
> (kafka.network.Acceptor)
> [2018-05-14 16:35:28,986] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.lang.ArithmeticException: / by zero
>         at kafka.network.Acceptor.run(SocketServer.scala:354)
>         at java.lang.Thread.run(Thread.java:748)
> [2018-05-14 16:35:29,002] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.lang.ArithmeticException: / by zero
>         at kafka.network.Acceptor.run(SocketServer.scala:354)
>         at java.lang.Thread.run(Thread.java:748)
> [2018-05-14 16:35:29,017] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.lang.ArithmeticException: / by zero
>         at kafka.network.Acceptor.run(SocketServer.scala:354)
>         at java.lang.Thread.run(Thread.java:748)
> [2018-05-14 16:35:28,986] ERROR Failed to clean up log for
> __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> LogDirFailureChannel)
> java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> 21\00000000000000000000.log.cleaned: The process cannot access the file
> because it is being used by another process.
>
>         at sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:102)
>         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> WindowsFileSystemProvider.java:269)
>         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> AbstractFileSystemProvider.java:108)
>         at java.nio.file.Files.deleteIfExists(Files.java:1165)
>         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> LogCleaner.scala:488)
>         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:462)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:461)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
>         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
>         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> LogCleaner.scala:305)
>         at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> [2018-05-14 16:35:29,064] INFO [SocketServer brokerId=1] Started 1
> acceptor threads (kafka.network.SocketServer)
> [2018-05-14 16:35:29,017] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.lang.ArithmeticException: / by zero
>         at kafka.network.Acceptor.run(SocketServer.scala:354)
>         at java.lang.Thread.run(Thread.java:748)
> [2018-05-14 16:35:29,189] INFO [ExpirationReaper-1-DeleteRecords]:
> Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2018-05-14 16:35:29,205] INFO [ExpirationReaper-1-Produce]: Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2018-05-14 16:35:29,220] INFO [ExpirationReaper-1-Fetch]: Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2018-05-14 16:35:29,205] ERROR Failed to clean up log for
> __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> LogDirFailureChannel)
> java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> 21\00000000000000000000.log.cleaned: The process cannot access the file
> because it is being used by another process.
>
>         at sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:102)
>         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> WindowsFileSystemProvider.java:269)
>         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> AbstractFileSystemProvider.java:108)
>         at java.nio.file.Files.deleteIfExists(Files.java:1165)
>         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> LogCleaner.scala:488)
>         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:462)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:461)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
>         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
>         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> LogCleaner.scala:305)
>         at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> [2018-05-14 16:35:29,220] INFO [LogDirFailureHandler]: Starting
> (kafka.server.ReplicaManager$LogDirFailureHandler)
> [2018-05-14 16:35:29,330] INFO Creating /brokers/ids/1 (is it secure?
> false) (kafka.zk.KafkaZkClient)
> [2018-05-14 16:35:29,330] INFO [ReplicaManager broker=1] Stopping serving
> replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> [2018-05-14 16:35:29,345] INFO Result of znode creation at /brokers/ids/1
> is: OK (kafka.zk.KafkaZkClient)
> [2018-05-14 16:35:29,345] ERROR Failed to clean up log for
> __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> LogDirFailureChannel)
> java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> 21\00000000000000000000.log.cleaned: The process cannot access the file
> because it is being used by another process.
>
>         at sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:102)
>         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> WindowsFileSystemProvider.java:269)
>         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> AbstractFileSystemProvider.java:108)
>         at java.nio.file.Files.deleteIfExists(Files.java:1165)
>         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> LogCleaner.scala:488)
>         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:462)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:461)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
>         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
>         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> LogCleaner.scala:305)
>         at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> [2018-05-14 16:35:29,345] INFO [ReplicaFetcherManager on broker 1] Removed
> fetcher for partitions  (kafka.server.ReplicaFetcherManager)
> [2018-05-14 16:35:29,361] INFO Registered broker 1 at path /brokers/ids/1
> with addresses: ArrayBuffer(EndPoint(localhost,9092,ListenerName(PLAINTEXT),PLAINTEXT))
> (kafka.zk.KafkaZkClient)
> [2018-05-14 16:35:29,392] INFO [ReplicaAlterLogDirsManager on broker 1]
> Removed fetcher for partitions  (kafka.server.ReplicaAlterLogDirsManager)
> [2018-05-14 16:35:29,424] INFO [ReplicaManager broker=1] Broker 1 stopped
> fetcher for partitions  and stopped moving logs for partitions  because
> they are in the failed log directory C:\kafka1.
> (kafka.server.ReplicaManager)
> [2018-05-14 16:35:29,424] ERROR Failed to clean up log for
> __consumer_offsets-21 in dir C:\kafka1 due to IOException (kafka.server.
> LogDirFailureChannel)
> java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-
> 21\00000000000000000000.log.cleaned: The process cannot access the file
> because it is being used by another process.
>
>         at sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
>         at sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:102)
>         at sun.nio.fs.WindowsFileSystemProvider.implDelete(
> WindowsFileSystemProvider.java:269)
>         at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(
> AbstractFileSystemProvider.java:108)
>         at java.nio.file.Files.deleteIfExists(Files.java:1165)
>         at kafka.log.Cleaner.deleteCleanedFileIfExists$1(
> LogCleaner.scala:488)
>         at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:462)
>         at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.
> scala:461)
>         at scala.collection.immutable.List.foreach(List.scala:392)
>         at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
>         at kafka.log.Cleaner.clean(LogCleaner.scala:438)
>         at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(
> LogCleaner.scala:305)
>         at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> [2018-05-14 16:35:29,439] INFO Stopping serving logs in dir C:\kafka1
> (kafka.log.LogManager)
> [2018-05-14 16:35:29,517] INFO [ExpirationReaper-1-topic]: Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2018-05-14 16:35:29,517] ERROR Shutdown broker because all log dirs in
> C:\kafka1 have failed (kafka.log.LogManager)
>
> Attached is the full log upon kafka-server-start.bat run. Meanwhile, i am
> experimenting whether closing a log and txnIndex (if not closed already)
> and unmapping the other segments before renaming helps. The problem is that
> if .append() is having a clash then this approach might not help
> (Potentially ClosedChannelException if the fileChannel is being closed
> twice).
> Once again, thanks a lot for helping out on this.
>
> Regards,
>
>
> On 14 May 2018 at 16:55, Ted Yu <yu...@gmail.com> wrote:
>
>> Looks like you were trying to show some screen which didn't go thru.
>>
>> Examining LogSegment.scala, I wonder if the time index was being appended
>> to (which caused the deletion to fail):
>>
>>       // append an entry to the index (if needed)
>>       if(bytesSinceLastIndexEntry > indexIntervalBytes) {
>>         offsetIndex.append(largestOffset, physicalPosition)
>>         timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)
>>
>> Can you capture the stack trace and pastebin it ?
>>
>> If LogSegment shows up in the stack trace, we may have some idea for the
>> root cause.
>>
>> Thanks
>>
>> On Mon, May 14, 2018 at 8:31 AM, M. Manna <ma...@gmail.com> wrote:
>>
>> > I am having some difficulties debugging the cluster. IN CWIKI for kafka,
>> > I can see a setup guide for eclipse, but nothing as such for debugger
>> > setup. The issue is that every time the log cleaner thread is run, it's
>> > having a FATAL shutdown.
>> > I tried to close all producers and consumers, but it still locks the log
>> > and Transaction index segments. Forcefully unmapping the offset and time
>> > offset files work normally, but log and transaction index files don't
>> work
>> > properly.
>> >
>> > Could anyone please advise how to setup the kafka debugging on eclipse
>> ? I
>> > have added the sdb jar noted in scala-debugger.org site. But I couldn't
>> > connect to port 5005 at all.
>> >
>> >
>> >
>> >
>> > On 13 May 2018 at 10:00, M. Manna <ma...@gmail.com> wrote:
>> >
>> >> Hi Ted,
>> >>
>> >> I highly appreciate the response over the weekend, and thanks for
>> >> pointing out the JIRAs.
>> >>
>> >> I don't believe the processes are responsible, but individual threads
>> >> which are still holding the log/index files using IO streams. I am
>> trying
>> >> walk a single node setup through debugger to find out which thread is
>> >> locking the file. Apologise but it's a huge application so it might
>> get me
>> >> some time to get around it :)
>> >>
>> >> Please do update if you find something new.
>> >>
>> >> Regards,
>> >>
>> >> On 12 May 2018 at 22:15, Ted Yu <yu...@gmail.com> wrote:
>> >>
>> >>> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear
>> >>> some
>> >>> resemblance.
>> >>>
>> >>> Can you try to find how the other process uses the file being deleted
>> ?
>> >>>
>> >>> https://superuser.com/questions/117902/find-out-which-proces
>> >>> s-is-locking-a-file-or-folder-in-windows
>> >>> https://www.computerhope.com/issues/ch000714.htm
>> >>>
>> >>> Cheers
>> >>>
>> >>> On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com> wrote:
>> >>>
>> >>> > Hello,
>> >>> >
>> >>> > We are still stuck with this issue where 2.11-1.1.0 distro is
>> failing
>> >>> to
>> >>> > cleanup logs on Windows and brings the entire cluster down one by
>> one.
>> >>> > Extending the retention hours and sizes don't help because they
>> burden
>> >>> the
>> >>> > hard drive.
>> >>> >
>> >>> > Here is the log
>> >>> >
>> >>> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
>> >>> Rolled
>> >>> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
>> >>> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0,
>> dir=C:\kafka1]
>> >>> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
>> >>> > > (kafka.log.Log)
>> >>> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for
>> >>> test-0
>> >>> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
>> >>> > > java.nio.file.FileSystemException:
>> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>> >>> cannot
>> >>> > > access the file because it is being used by another process.
>> >>> > >
>> >>> > >         at
>> >>> > > sun.nio.fs.WindowsException.translateToIOException(
>> >>> > WindowsException.java:86)
>> >>> > >         at
>> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> >>> > WindowsException.java:97)
>> >>> > >         at sun.nio.fs.WindowsFileCopy.mov
>> e(WindowsFileCopy.java:387)
>> >>> > >         at
>> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
>> rovider.
>> >>> > java:287)
>> >>> > >         at java.nio.file.Files.move(Files.java:1395)
>> >>> > >         at
>> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> >>> > Utils.java:697)
>> >>> > >         at
>> >>> > > org.apache.kafka.common.record.FileRecords.renameTo(
>> >>> > FileRecords.java:212)
>> >>> > >         at kafka.log.LogSegment.changeFil
>> >>> eSuffixes(LogSegment.scala:415)
>> >>> > >         at kafka.log.Log.kafka$log$Log$$a
>> >>> syncDeleteSegment(Log.scala:
>> >>> > 1601)
>> >>> > >         at kafka.log.Log.kafka$log$Log$$d
>> >>> eleteSegment(Log.scala:1588)
>> >>> > >         at
>> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> >>> > >         at
>> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> >>> > >         at
>> >>> > > scala.collection.mutable.ResizableArray$class.foreach(
>> >>> > ResizableArray.scala:59)
>> >>> > >         at
>> >>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
>> scala:48)
>> >>> > >         at
>> >>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.
>> scala:1170)
>> >>> > >         at kafka.log.Log$$anonfun$deleteS
>> >>> egments$1.apply(Log.scala:1161)
>> >>> > >         at kafka.log.Log$$anonfun$deleteS
>> >>> egments$1.apply(Log.scala:1161)
>> >>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>> >>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
>> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>> >>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
>> >>> > nts(Log.scala:1228)
>> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>> >>> > >         at
>> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> >>> .scala:854)
>> >>> > >         at
>> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> >>> .scala:852)
>> >>> > >         at
>> >>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>> >>> $1.apply(
>> >>> > TraversableLike.scala:733)
>> >>> > >         at scala.collection.immutable.Lis
>> t.foreach(List.scala:392)
>> >>> > >         at
>> >>> > > scala.collection.TraversableLike$WithFilter.
>> >>> > foreach(TraversableLike.scala:732)
>> >>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>> >>> > >         at
>> >>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
>> >>> > LogManager.scala:385)
>> >>> > >         at
>> >>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
>> >>> > KafkaScheduler.scala:110)
>> >>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>> >>> > >         at
>> >>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
>> >>> s.java:511)
>> >>>
>> >>> > >         at java.util.concurrent.FutureTask.runAndReset(
>> >>> > FutureTask.java:308)
>> >>> > >         at
>> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> >>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.j
>> ava:180)
>> >>> > >         at
>> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> >>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>> >>> > >         at
>> >>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> >>> > ThreadPoolExecutor.java:1149)
>> >>> > >         at
>> >>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> >>> > ThreadPoolExecutor.java:624)
>> >>> > >         at java.lang.Thread.run(Thread.java:748)
>> >>> > >         Suppressed: java.nio.file.FileSystemException:
>> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>> >>> cannot
>> >>> > > access the file because it is being used by another process.
>> >>> > >
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsException.translateToIOException(
>> >>> > WindowsException.java:86)
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> >>> > WindowsException.java:97)
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
>> rovider.
>> >>> > java:287)
>> >>> > >                 at java.nio.file.Files.move(Files.java:1395)
>> >>> > >                 at
>> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> >>> > Utils.java:694)
>> >>> > >                 ... 32 more
>> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping
>> >>> serving
>> >>> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
>> >>> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled
>> task
>> >>> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
>> >>> > > org.apache.kafka.common.errors.KafkaStorageException: Error while
>> >>> > deleting
>> >>> > > segments for test-0 in dir C:\kafka1
>> >>> > > Caused by: java.nio.file.FileSystemException:
>> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>> >>> cannot
>> >>> > > access the file because it is being used by another process.
>> >>> > >
>> >>> > >         at
>> >>> > > sun.nio.fs.WindowsException.translateToIOException(
>> >>> > WindowsException.java:86)
>> >>> > >         at
>> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> >>> > WindowsException.java:97)
>> >>> > >         at sun.nio.fs.WindowsFileCopy.mov
>> e(WindowsFileCopy.java:387)
>> >>> > >         at
>> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
>> rovider.
>> >>> > java:287)
>> >>> > >         at java.nio.file.Files.move(Files.java:1395)
>> >>> > >         at
>> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> >>> > Utils.java:697)
>> >>> > >         at
>> >>> > > org.apache.kafka.common.record.FileRecords.renameTo(
>> >>> > FileRecords.java:212)
>> >>> > >         at kafka.log.LogSegment.changeFil
>> >>> eSuffixes(LogSegment.scala:415)
>> >>> > >         at kafka.log.Log.kafka$log$Log$$a
>> >>> syncDeleteSegment(Log.scala:
>> >>> > 1601)
>> >>> > >         at kafka.log.Log.kafka$log$Log$$d
>> >>> eleteSegment(Log.scala:1588)
>> >>> > >         at
>> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> >>> > >         at
>> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> >>> > >         at
>> >>> > > scala.collection.mutable.ResizableArray$class.foreach(
>> >>> > ResizableArray.scala:59)
>> >>> > >         at
>> >>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.
>> scala:48)
>> >>> > >         at
>> >>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.
>> scala:1170)
>> >>> > >         at kafka.log.Log$$anonfun$deleteS
>> >>> egments$1.apply(Log.scala:1161)
>> >>> > >         at kafka.log.Log$$anonfun$deleteS
>> >>> egments$1.apply(Log.scala:1161)
>> >>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>> >>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
>> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>> >>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
>> >>> > nts(Log.scala:1228)
>> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>> >>> > >         at
>> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> >>> .scala:854)
>> >>> > >         at
>> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> >>> .scala:852)
>> >>> > >         at
>> >>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>> >>> $1.apply(
>> >>> > TraversableLike.scala:733)
>> >>> > >         at scala.collection.immutable.Lis
>> t.foreach(List.scala:392)
>> >>> > >         at
>> >>> > > scala.collection.TraversableLike$WithFilter.
>> >>> > foreach(TraversableLike.scala:732)
>> >>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>> >>> > >         at
>> >>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
>> >>> > LogManager.scala:385)
>> >>> > >         at
>> >>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
>> >>> > KafkaScheduler.scala:110)
>> >>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>> >>> > >         at
>> >>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
>> >>> s.java:511)
>> >>>
>> >>> > >         at java.util.concurrent.FutureTask.runAndReset(
>> >>> > FutureTask.java:308)
>> >>> > >         at
>> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> >>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.j
>> ava:180)
>> >>> > >         at
>> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> >>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>> >>> > >         at
>> >>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> >>> > ThreadPoolExecutor.java:1149)
>> >>> > >         at
>> >>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> >>> > ThreadPoolExecutor.java:624)
>> >>> > >         at java.lang.Thread.run(Thread.java:748)
>> >>> > >         Suppressed: java.nio.file.FileSystemException:
>> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>> >>> cannot
>> >>> > > access the file because it is being used by another process.
>> >>> > >
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsException.translateToIOException(
>> >>> > WindowsException.java:86)
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> >>> > WindowsException.java:97)
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>> >>> > >                 at
>> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemP
>> rovider.
>> >>> > java:287)
>> >>> > >                 at java.nio.file.Files.move(Files.java:1395)
>> >>> > >                 at
>> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> >>> > Utils.java:694)
>> >>> > >                 ... 32 more
>> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
>> >>> > Removed
>> >>> > > fetcher for partitions
>> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
>> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>> >>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>> >>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>> >>> offsets-31,__
>> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
>> >>> -3,__consumer_
>> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
>> >>> fsets-17,__
>> >>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>> >>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>> >>> > consumer_offsets-40
>> >>> > > (kafka.server.ReplicaFetcherManager)
>> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on
>> broker
>> >>> 1]
>> >>> > > Removed fetcher for partitions
>> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
>> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>> >>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>> >>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>> >>> offsets-31,__
>> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
>> >>> -3,__consumer_
>> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
>> >>> fsets-17,__
>> >>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>> >>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>> >>> > consumer_offsets-40
>> >>> > > (kafka.server.ReplicaAlterLogDirsManager)
>> >>> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1
>> >>> stopped
>> >>> > > fetcher for partitions
>> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
>> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>> >>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>> >>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>> >>> offsets-31,__
>> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
>> >>> -3,__consumer_
>> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
>> >>> fsets-17,__
>> >>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>> >>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>> >>> > consumer_offsets-40
>> >>> > > and stopped moving logs for partitions  because they are in the
>> >>> failed
>> >>> > log
>> >>> > > directory C:\kafka1. (kafka.server.ReplicaManager)
>> >>> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir
>> C:\kafka1
>> >>> > > (kafka.log.LogManager)
>> >>> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log
>> dirs
>> >>> in
>> >>> > > C:\kafka1 have failed (kafka.log.LogManager)
>> >>> > >
>> >>> >
>> >>> > Could someone please share some ideas how to rectify this on
>> Windows?
>> >>> If
>> >>> > this will never be supported on Windows, could we get some official
>> >>> > communication perhaps?
>> >>> >
>> >>> > Regards,
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by "M. Manna" <ma...@gmail.com>.
Hey Ted,

Thanks again for reaching out. Yes the screengrab seems to have got lost.
Once the cluster is down based because of file handling related issues, I
did a full start, and here is the log:


[2018-05-14 16:35:28,580] INFO Starting log flusher with a default period
of 9223372036854775807 ms. (kafka.log.LogManager)
[2018-05-14 16:35:28,814] ERROR Failed to clean up log for
__consumer_offsets-21 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned ->
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.swap: The process
cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
        at
sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
        at java.nio.file.Files.move(Files.java:1395)
        at
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697)
        at
org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212)
        at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
        at kafka.log.Log.replaceSegments(Log.scala:1644)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
        Suppressed: java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned ->
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.swap: The process
cannot access the file because it is being used by another process.

                at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
                at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
                at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
                at
sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287)
                at java.nio.file.Files.move(Files.java:1395)
                at
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694)
                ... 12 more
[2018-05-14 16:35:28,939] INFO Awaiting socket connections on 0.0.0.0:9092.
(kafka.network.Acceptor)
[2018-05-14 16:35:28,986] ERROR Error while accepting connection
(kafka.network.Acceptor)
java.lang.ArithmeticException: / by zero
        at kafka.network.Acceptor.run(SocketServer.scala:354)
        at java.lang.Thread.run(Thread.java:748)
[2018-05-14 16:35:29,002] ERROR Error while accepting connection
(kafka.network.Acceptor)
java.lang.ArithmeticException: / by zero
        at kafka.network.Acceptor.run(SocketServer.scala:354)
        at java.lang.Thread.run(Thread.java:748)
[2018-05-14 16:35:29,017] ERROR Error while accepting connection
(kafka.network.Acceptor)
java.lang.ArithmeticException: / by zero
        at kafka.network.Acceptor.run(SocketServer.scala:354)
        at java.lang.Thread.run(Thread.java:748)
[2018-05-14 16:35:28,986] ERROR Failed to clean up log for
__consumer_offsets-21 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned: The
process cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
        at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
        at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at
kafka.log.Cleaner.deleteCleanedFileIfExists$1(LogCleaner.scala:488)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-05-14 16:35:29,064] INFO [SocketServer brokerId=1] Started 1 acceptor
threads (kafka.network.SocketServer)
[2018-05-14 16:35:29,017] ERROR Error while accepting connection
(kafka.network.Acceptor)
java.lang.ArithmeticException: / by zero
        at kafka.network.Acceptor.run(SocketServer.scala:354)
        at java.lang.Thread.run(Thread.java:748)
[2018-05-14 16:35:29,189] INFO [ExpirationReaper-1-DeleteRecords]: Starting
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2018-05-14 16:35:29,205] INFO [ExpirationReaper-1-Produce]: Starting
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2018-05-14 16:35:29,220] INFO [ExpirationReaper-1-Fetch]: Starting
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2018-05-14 16:35:29,205] ERROR Failed to clean up log for
__consumer_offsets-21 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned: The
process cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
        at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
        at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at
kafka.log.Cleaner.deleteCleanedFileIfExists$1(LogCleaner.scala:488)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-05-14 16:35:29,220] INFO [LogDirFailureHandler]: Starting
(kafka.server.ReplicaManager$LogDirFailureHandler)
[2018-05-14 16:35:29,330] INFO Creating /brokers/ids/1 (is it secure?
false) (kafka.zk.KafkaZkClient)
[2018-05-14 16:35:29,330] INFO [ReplicaManager broker=1] Stopping serving
replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
[2018-05-14 16:35:29,345] INFO Result of znode creation at /brokers/ids/1
is: OK (kafka.zk.KafkaZkClient)
[2018-05-14 16:35:29,345] ERROR Failed to clean up log for
__consumer_offsets-21 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned: The
process cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
        at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
        at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at
kafka.log.Cleaner.deleteCleanedFileIfExists$1(LogCleaner.scala:488)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-05-14 16:35:29,345] INFO [ReplicaFetcherManager on broker 1] Removed
fetcher for partitions  (kafka.server.ReplicaFetcherManager)
[2018-05-14 16:35:29,361] INFO Registered broker 1 at path /brokers/ids/1
with addresses:
ArrayBuffer(EndPoint(localhost,9092,ListenerName(PLAINTEXT),PLAINTEXT))
(kafka.zk.KafkaZkClient)
[2018-05-14 16:35:29,392] INFO [ReplicaAlterLogDirsManager on broker 1]
Removed fetcher for partitions  (kafka.server.ReplicaAlterLogDirsManager)
[2018-05-14 16:35:29,424] INFO [ReplicaManager broker=1] Broker 1 stopped
fetcher for partitions  and stopped moving logs for partitions  because
they are in the failed log directory C:\kafka1.
(kafka.server.ReplicaManager)
[2018-05-14 16:35:29,424] ERROR Failed to clean up log for
__consumer_offsets-21 in dir C:\kafka1 due to IOException
(kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException:
C:\kafka1\__consumer_offsets-21\00000000000000000000.log.cleaned: The
process cannot access the file because it is being used by another process.

        at
sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97)
        at
sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102)
        at
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269)
        at
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
        at java.nio.file.Files.deleteIfExists(Files.java:1165)
        at
kafka.log.Cleaner.deleteCleanedFileIfExists$1(LogCleaner.scala:488)
        at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462)
        at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at kafka.log.Cleaner.doClean(LogCleaner.scala:461)
        at kafka.log.Cleaner.clean(LogCleaner.scala:438)
        at
kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305)
        at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
[2018-05-14 16:35:29,439] INFO Stopping serving logs in dir C:\kafka1
(kafka.log.LogManager)
[2018-05-14 16:35:29,517] INFO [ExpirationReaper-1-topic]: Starting
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2018-05-14 16:35:29,517] ERROR Shutdown broker because all log dirs in
C:\kafka1 have failed (kafka.log.LogManager)

Attached is the full log upon kafka-server-start.bat run. Meanwhile, i am
experimenting whether closing a log and txnIndex (if not closed already)
and unmapping the other segments before renaming helps. The problem is that
if .append() is having a clash then this approach might not help
(Potentially ClosedChannelException if the fileChannel is being closed
twice).
Once again, thanks a lot for helping out on this.

Regards,


On 14 May 2018 at 16:55, Ted Yu <yu...@gmail.com> wrote:

> Looks like you were trying to show some screen which didn't go thru.
>
> Examining LogSegment.scala, I wonder if the time index was being appended
> to (which caused the deletion to fail):
>
>       // append an entry to the index (if needed)
>       if(bytesSinceLastIndexEntry > indexIntervalBytes) {
>         offsetIndex.append(largestOffset, physicalPosition)
>         timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)
>
> Can you capture the stack trace and pastebin it ?
>
> If LogSegment shows up in the stack trace, we may have some idea for the
> root cause.
>
> Thanks
>
> On Mon, May 14, 2018 at 8:31 AM, M. Manna <ma...@gmail.com> wrote:
>
> > I am having some difficulties debugging the cluster. IN CWIKI for kafka,
> > I can see a setup guide for eclipse, but nothing as such for debugger
> > setup. The issue is that every time the log cleaner thread is run, it's
> > having a FATAL shutdown.
> > I tried to close all producers and consumers, but it still locks the log
> > and Transaction index segments. Forcefully unmapping the offset and time
> > offset files work normally, but log and transaction index files don't
> work
> > properly.
> >
> > Could anyone please advise how to setup the kafka debugging on eclipse ?
> I
> > have added the sdb jar noted in scala-debugger.org site. But I couldn't
> > connect to port 5005 at all.
> >
> >
> >
> >
> > On 13 May 2018 at 10:00, M. Manna <ma...@gmail.com> wrote:
> >
> >> Hi Ted,
> >>
> >> I highly appreciate the response over the weekend, and thanks for
> >> pointing out the JIRAs.
> >>
> >> I don't believe the processes are responsible, but individual threads
> >> which are still holding the log/index files using IO streams. I am
> trying
> >> walk a single node setup through debugger to find out which thread is
> >> locking the file. Apologise but it's a huge application so it might get
> me
> >> some time to get around it :)
> >>
> >> Please do update if you find something new.
> >>
> >> Regards,
> >>
> >> On 12 May 2018 at 22:15, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear
> >>> some
> >>> resemblance.
> >>>
> >>> Can you try to find how the other process uses the file being deleted ?
> >>>
> >>> https://superuser.com/questions/117902/find-out-which-proces
> >>> s-is-locking-a-file-or-folder-in-windows
> >>> https://www.computerhope.com/issues/ch000714.htm
> >>>
> >>> Cheers
> >>>
> >>> On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > We are still stuck with this issue where 2.11-1.1.0 distro is failing
> >>> to
> >>> > cleanup logs on Windows and brings the entire cluster down one by
> one.
> >>> > Extending the retention hours and sizes don't help because they
> burden
> >>> the
> >>> > hard drive.
> >>> >
> >>> > Here is the log
> >>> >
> >>> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> >>> Rolled
> >>> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
> >>> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0,
> dir=C:\kafka1]
> >>> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
> >>> > > (kafka.log.Log)
> >>> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for
> >>> test-0
> >>> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
> >>> > > java.nio.file.FileSystemException:
> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >>> cannot
> >>> > > access the file because it is being used by another process.
> >>> > >
> >>> > >         at
> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >>> > WindowsException.java:86)
> >>> > >         at
> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >>> > WindowsException.java:97)
> >>> > >         at sun.nio.fs.WindowsFileCopy.
> move(WindowsFileCopy.java:387)
> >>> > >         at
> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(
> WindowsFileSystemProvider.
> >>> > java:287)
> >>> > >         at java.nio.file.Files.move(Files.java:1395)
> >>> > >         at
> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >>> > Utils.java:697)
> >>> > >         at
> >>> > > org.apache.kafka.common.record.FileRecords.renameTo(
> >>> > FileRecords.java:212)
> >>> > >         at kafka.log.LogSegment.changeFil
> >>> eSuffixes(LogSegment.scala:415)
> >>> > >         at kafka.log.Log.kafka$log$Log$$a
> >>> syncDeleteSegment(Log.scala:
> >>> > 1601)
> >>> > >         at kafka.log.Log.kafka$log$Log$$d
> >>> eleteSegment(Log.scala:1588)
> >>> > >         at
> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >>> > >         at
> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >>> > >         at
> >>> > > scala.collection.mutable.ResizableArray$class.foreach(
> >>> > ResizableArray.scala:59)
> >>> > >         at
> >>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> >>> > >         at
> >>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(
> Log.scala:1170)
> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >>> egments$1.apply(Log.scala:1161)
> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >>> egments$1.apply(Log.scala:1161)
> >>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> >>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> >>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> >>> > nts(Log.scala:1228)
> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> >>> > >         at
> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >>> .scala:854)
> >>> > >         at
> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >>> .scala:852)
> >>> > >         at
> >>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
> >>> $1.apply(
> >>> > TraversableLike.scala:733)
> >>> > >         at scala.collection.immutable.List.foreach(List.scala:392)
> >>> > >         at
> >>> > > scala.collection.TraversableLike$WithFilter.
> >>> > foreach(TraversableLike.scala:732)
> >>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> >>> > >         at
> >>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> >>> > LogManager.scala:385)
> >>> > >         at
> >>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> >>> > KafkaScheduler.scala:110)
> >>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> >>> > >         at
> >>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
> >>> s.java:511)
> >>>
> >>> > >         at java.util.concurrent.FutureTask.runAndReset(
> >>> > FutureTask.java:308)
> >>> > >         at
> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> >>> > >         at
> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >>> > >         at
> >>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>> > ThreadPoolExecutor.java:1149)
> >>> > >         at
> >>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>> > ThreadPoolExecutor.java:624)
> >>> > >         at java.lang.Thread.run(Thread.java:748)
> >>> > >         Suppressed: java.nio.file.FileSystemException:
> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >>> cannot
> >>> > > access the file because it is being used by another process.
> >>> > >
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >>> > WindowsException.java:86)
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >>> > WindowsException.java:97)
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(
> WindowsFileSystemProvider.
> >>> > java:287)
> >>> > >                 at java.nio.file.Files.move(Files.java:1395)
> >>> > >                 at
> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >>> > Utils.java:694)
> >>> > >                 ... 32 more
> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping
> >>> serving
> >>> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> >>> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled
> task
> >>> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
> >>> > > org.apache.kafka.common.errors.KafkaStorageException: Error while
> >>> > deleting
> >>> > > segments for test-0 in dir C:\kafka1
> >>> > > Caused by: java.nio.file.FileSystemException:
> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >>> cannot
> >>> > > access the file because it is being used by another process.
> >>> > >
> >>> > >         at
> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >>> > WindowsException.java:86)
> >>> > >         at
> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >>> > WindowsException.java:97)
> >>> > >         at sun.nio.fs.WindowsFileCopy.
> move(WindowsFileCopy.java:387)
> >>> > >         at
> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(
> WindowsFileSystemProvider.
> >>> > java:287)
> >>> > >         at java.nio.file.Files.move(Files.java:1395)
> >>> > >         at
> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >>> > Utils.java:697)
> >>> > >         at
> >>> > > org.apache.kafka.common.record.FileRecords.renameTo(
> >>> > FileRecords.java:212)
> >>> > >         at kafka.log.LogSegment.changeFil
> >>> eSuffixes(LogSegment.scala:415)
> >>> > >         at kafka.log.Log.kafka$log$Log$$a
> >>> syncDeleteSegment(Log.scala:
> >>> > 1601)
> >>> > >         at kafka.log.Log.kafka$log$Log$$d
> >>> eleteSegment(Log.scala:1588)
> >>> > >         at
> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >>> > >         at
> >>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> >>> > apply$mcI$sp$1.apply(Log.scala:1170)
> >>> > >         at
> >>> > > scala.collection.mutable.ResizableArray$class.foreach(
> >>> > ResizableArray.scala:59)
> >>> > >         at
> >>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> >>> > >         at
> >>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(
> Log.scala:1170)
> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >>> egments$1.apply(Log.scala:1161)
> >>> > >         at kafka.log.Log$$anonfun$deleteS
> >>> egments$1.apply(Log.scala:1161)
> >>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> >>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> >>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> >>> > nts(Log.scala:1228)
> >>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> >>> > >         at
> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >>> .scala:854)
> >>> > >         at
> >>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
> >>> .scala:852)
> >>> > >         at
> >>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
> >>> $1.apply(
> >>> > TraversableLike.scala:733)
> >>> > >         at scala.collection.immutable.List.foreach(List.scala:392)
> >>> > >         at
> >>> > > scala.collection.TraversableLike$WithFilter.
> >>> > foreach(TraversableLike.scala:732)
> >>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> >>> > >         at
> >>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> >>> > LogManager.scala:385)
> >>> > >         at
> >>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> >>> > KafkaScheduler.scala:110)
> >>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> >>> > >         at
> >>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
> >>> s.java:511)
> >>>
> >>> > >         at java.util.concurrent.FutureTask.runAndReset(
> >>> > FutureTask.java:308)
> >>> > >         at
> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> >>> > >         at
> >>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> >>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >>> > >         at
> >>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >>> > ThreadPoolExecutor.java:1149)
> >>> > >         at
> >>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >>> > ThreadPoolExecutor.java:624)
> >>> > >         at java.lang.Thread.run(Thread.java:748)
> >>> > >         Suppressed: java.nio.file.FileSystemException:
> >>> > > C:\kafka1\test-0\00000000000000000000.log ->
> >>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
> >>> cannot
> >>> > > access the file because it is being used by another process.
> >>> > >
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsException.translateToIOException(
> >>> > WindowsException.java:86)
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> >>> > WindowsException.java:97)
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> >>> > >                 at
> >>> > > sun.nio.fs.WindowsFileSystemProvider.move(
> WindowsFileSystemProvider.
> >>> > java:287)
> >>> > >                 at java.nio.file.Files.move(Files.java:1395)
> >>> > >                 at
> >>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> >>> > Utils.java:694)
> >>> > >                 ... 32 more
> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
> >>> > Removed
> >>> > > fetcher for partitions
> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> >>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> >>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
> >>> offsets-31,__
> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
> >>> -3,__consumer_
> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
> >>> fsets-17,__
> >>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> >>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> >>> > consumer_offsets-40
> >>> > > (kafka.server.ReplicaFetcherManager)
> >>> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on
> broker
> >>> 1]
> >>> > > Removed fetcher for partitions
> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> >>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> >>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
> >>> offsets-31,__
> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
> >>> -3,__consumer_
> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
> >>> fsets-17,__
> >>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> >>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> >>> > consumer_offsets-40
> >>> > > (kafka.server.ReplicaAlterLogDirsManager)
> >>> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1
> >>> stopped
> >>> > > fetcher for partitions
> >>> > > __consumer_offsets-22,__consumer_offsets-30,__
> >>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> >>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> >>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> >>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> >>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> >>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
> >>> offsets-31,__
> >>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
> >>> -3,__consumer_
> >>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> >>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
> >>> fsets-17,__
> >>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> >>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> >>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> >>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> >>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> >>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> >>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> >>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> >>> > consumer_offsets-40
> >>> > > and stopped moving logs for partitions  because they are in the
> >>> failed
> >>> > log
> >>> > > directory C:\kafka1. (kafka.server.ReplicaManager)
> >>> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir
> C:\kafka1
> >>> > > (kafka.log.LogManager)
> >>> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log
> dirs
> >>> in
> >>> > > C:\kafka1 have failed (kafka.log.LogManager)
> >>> > >
> >>> >
> >>> > Could someone please share some ideas how to rectify this on Windows?
> >>> If
> >>> > this will never be supported on Windows, could we get some official
> >>> > communication perhaps?
> >>> >
> >>> > Regards,
> >>> >
> >>>
> >>
> >>
> >
>

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by Ted Yu <yu...@gmail.com>.
Looks like you were trying to show some screen which didn't go thru.

Examining LogSegment.scala, I wonder if the time index was being appended
to (which caused the deletion to fail):

      // append an entry to the index (if needed)
      if(bytesSinceLastIndexEntry > indexIntervalBytes) {
        offsetIndex.append(largestOffset, physicalPosition)
        timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestamp)

Can you capture the stack trace and pastebin it ?

If LogSegment shows up in the stack trace, we may have some idea for the
root cause.

Thanks

On Mon, May 14, 2018 at 8:31 AM, M. Manna <ma...@gmail.com> wrote:

> I am having some difficulties debugging the cluster. IN CWIKI for kafka,
> I can see a setup guide for eclipse, but nothing as such for debugger
> setup. The issue is that every time the log cleaner thread is run, it's
> having a FATAL shutdown.
> I tried to close all producers and consumers, but it still locks the log
> and Transaction index segments. Forcefully unmapping the offset and time
> offset files work normally, but log and transaction index files don't work
> properly.
>
> Could anyone please advise how to setup the kafka debugging on eclipse ? I
> have added the sdb jar noted in scala-debugger.org site. But I couldn't
> connect to port 5005 at all.
>
>
>
>
> On 13 May 2018 at 10:00, M. Manna <ma...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> I highly appreciate the response over the weekend, and thanks for
>> pointing out the JIRAs.
>>
>> I don't believe the processes are responsible, but individual threads
>> which are still holding the log/index files using IO streams. I am trying
>> walk a single node setup through debugger to find out which thread is
>> locking the file. Apologise but it's a huge application so it might get me
>> some time to get around it :)
>>
>> Please do update if you find something new.
>>
>> Regards,
>>
>> On 12 May 2018 at 22:15, Ted Yu <yu...@gmail.com> wrote:
>>
>>> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear
>>> some
>>> resemblance.
>>>
>>> Can you try to find how the other process uses the file being deleted ?
>>>
>>> https://superuser.com/questions/117902/find-out-which-proces
>>> s-is-locking-a-file-or-folder-in-windows
>>> https://www.computerhope.com/issues/ch000714.htm
>>>
>>> Cheers
>>>
>>> On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com> wrote:
>>>
>>> > Hello,
>>> >
>>> > We are still stuck with this issue where 2.11-1.1.0 distro is failing
>>> to
>>> > cleanup logs on Windows and brings the entire cluster down one by one.
>>> > Extending the retention hours and sizes don't help because they burden
>>> the
>>> > hard drive.
>>> >
>>> > Here is the log
>>> >
>>> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
>>> Rolled
>>> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
>>> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
>>> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
>>> > > (kafka.log.Log)
>>> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for
>>> test-0
>>> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
>>> > > java.nio.file.FileSystemException:
>>> > > C:\kafka1\test-0\00000000000000000000.log ->
>>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>>> cannot
>>> > > access the file because it is being used by another process.
>>> > >
>>> > >         at
>>> > > sun.nio.fs.WindowsException.translateToIOException(
>>> > WindowsException.java:86)
>>> > >         at
>>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>>> > WindowsException.java:97)
>>> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>>> > >         at
>>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>>> > java:287)
>>> > >         at java.nio.file.Files.move(Files.java:1395)
>>> > >         at
>>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>>> > Utils.java:697)
>>> > >         at
>>> > > org.apache.kafka.common.record.FileRecords.renameTo(
>>> > FileRecords.java:212)
>>> > >         at kafka.log.LogSegment.changeFil
>>> eSuffixes(LogSegment.scala:415)
>>> > >         at kafka.log.Log.kafka$log$Log$$a
>>> syncDeleteSegment(Log.scala:
>>> > 1601)
>>> > >         at kafka.log.Log.kafka$log$Log$$d
>>> eleteSegment(Log.scala:1588)
>>> > >         at
>>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>>> > apply$mcI$sp$1.apply(Log.scala:1170)
>>> > >         at
>>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>>> > apply$mcI$sp$1.apply(Log.scala:1170)
>>> > >         at
>>> > > scala.collection.mutable.ResizableArray$class.foreach(
>>> > ResizableArray.scala:59)
>>> > >         at
>>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>> > >         at
>>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
>>> > >         at kafka.log.Log$$anonfun$deleteS
>>> egments$1.apply(Log.scala:1161)
>>> > >         at kafka.log.Log$$anonfun$deleteS
>>> egments$1.apply(Log.scala:1161)
>>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
>>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
>>> > nts(Log.scala:1228)
>>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>>> > >         at
>>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>>> .scala:854)
>>> > >         at
>>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>>> .scala:852)
>>> > >         at
>>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>> $1.apply(
>>> > TraversableLike.scala:733)
>>> > >         at scala.collection.immutable.List.foreach(List.scala:392)
>>> > >         at
>>> > > scala.collection.TraversableLike$WithFilter.
>>> > foreach(TraversableLike.scala:732)
>>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>>> > >         at
>>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
>>> > LogManager.scala:385)
>>> > >         at
>>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
>>> > KafkaScheduler.scala:110)
>>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>>> > >         at
>>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
>>> s.java:511)
>>>
>>> > >         at java.util.concurrent.FutureTask.runAndReset(
>>> > FutureTask.java:308)
>>> > >         at
>>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>> > >         at
>>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>> > >         at
>>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> > ThreadPoolExecutor.java:1149)
>>> > >         at
>>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> > ThreadPoolExecutor.java:624)
>>> > >         at java.lang.Thread.run(Thread.java:748)
>>> > >         Suppressed: java.nio.file.FileSystemException:
>>> > > C:\kafka1\test-0\00000000000000000000.log ->
>>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>>> cannot
>>> > > access the file because it is being used by another process.
>>> > >
>>> > >                 at
>>> > > sun.nio.fs.WindowsException.translateToIOException(
>>> > WindowsException.java:86)
>>> > >                 at
>>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>>> > WindowsException.java:97)
>>> > >                 at
>>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>>> > >                 at
>>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>>> > java:287)
>>> > >                 at java.nio.file.Files.move(Files.java:1395)
>>> > >                 at
>>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>>> > Utils.java:694)
>>> > >                 ... 32 more
>>> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping
>>> serving
>>> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
>>> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled task
>>> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
>>> > > org.apache.kafka.common.errors.KafkaStorageException: Error while
>>> > deleting
>>> > > segments for test-0 in dir C:\kafka1
>>> > > Caused by: java.nio.file.FileSystemException:
>>> > > C:\kafka1\test-0\00000000000000000000.log ->
>>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>>> cannot
>>> > > access the file because it is being used by another process.
>>> > >
>>> > >         at
>>> > > sun.nio.fs.WindowsException.translateToIOException(
>>> > WindowsException.java:86)
>>> > >         at
>>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>>> > WindowsException.java:97)
>>> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>>> > >         at
>>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>>> > java:287)
>>> > >         at java.nio.file.Files.move(Files.java:1395)
>>> > >         at
>>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>>> > Utils.java:697)
>>> > >         at
>>> > > org.apache.kafka.common.record.FileRecords.renameTo(
>>> > FileRecords.java:212)
>>> > >         at kafka.log.LogSegment.changeFil
>>> eSuffixes(LogSegment.scala:415)
>>> > >         at kafka.log.Log.kafka$log$Log$$a
>>> syncDeleteSegment(Log.scala:
>>> > 1601)
>>> > >         at kafka.log.Log.kafka$log$Log$$d
>>> eleteSegment(Log.scala:1588)
>>> > >         at
>>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>>> > apply$mcI$sp$1.apply(Log.scala:1170)
>>> > >         at
>>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>>> > apply$mcI$sp$1.apply(Log.scala:1170)
>>> > >         at
>>> > > scala.collection.mutable.ResizableArray$class.foreach(
>>> > ResizableArray.scala:59)
>>> > >         at
>>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>> > >         at
>>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
>>> > >         at kafka.log.Log$$anonfun$deleteS
>>> egments$1.apply(Log.scala:1161)
>>> > >         at kafka.log.Log$$anonfun$deleteS
>>> egments$1.apply(Log.scala:1161)
>>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
>>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
>>> > nts(Log.scala:1228)
>>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>>> > >         at
>>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>>> .scala:854)
>>> > >         at
>>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>>> .scala:852)
>>> > >         at
>>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach
>>> $1.apply(
>>> > TraversableLike.scala:733)
>>> > >         at scala.collection.immutable.List.foreach(List.scala:392)
>>> > >         at
>>> > > scala.collection.TraversableLike$WithFilter.
>>> > foreach(TraversableLike.scala:732)
>>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>>> > >         at
>>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
>>> > LogManager.scala:385)
>>> > >         at
>>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
>>> > KafkaScheduler.scala:110)
>>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>>> > >         at
>>> > > java.util.concurrent.Executors$RunnableAdapter.call(Executor
>>> s.java:511)
>>>
>>> > >         at java.util.concurrent.FutureTask.runAndReset(
>>> > FutureTask.java:308)
>>> > >         at
>>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>>> > >         at
>>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>>> > >         at
>>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> > ThreadPoolExecutor.java:1149)
>>> > >         at
>>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> > ThreadPoolExecutor.java:624)
>>> > >         at java.lang.Thread.run(Thread.java:748)
>>> > >         Suppressed: java.nio.file.FileSystemException:
>>> > > C:\kafka1\test-0\00000000000000000000.log ->
>>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process
>>> cannot
>>> > > access the file because it is being used by another process.
>>> > >
>>> > >                 at
>>> > > sun.nio.fs.WindowsException.translateToIOException(
>>> > WindowsException.java:86)
>>> > >                 at
>>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>>> > WindowsException.java:97)
>>> > >                 at
>>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>>> > >                 at
>>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>>> > java:287)
>>> > >                 at java.nio.file.Files.move(Files.java:1395)
>>> > >                 at
>>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>>> > Utils.java:694)
>>> > >                 ... 32 more
>>> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
>>> > Removed
>>> > > fetcher for partitions
>>> > > __consumer_offsets-22,__consumer_offsets-30,__
>>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>>> offsets-31,__
>>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
>>> -3,__consumer_
>>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
>>> fsets-17,__
>>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>>> > consumer_offsets-40
>>> > > (kafka.server.ReplicaFetcherManager)
>>> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on broker
>>> 1]
>>> > > Removed fetcher for partitions
>>> > > __consumer_offsets-22,__consumer_offsets-30,__
>>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>>> offsets-31,__
>>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
>>> -3,__consumer_
>>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
>>> fsets-17,__
>>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>>> > consumer_offsets-40
>>> > > (kafka.server.ReplicaAlterLogDirsManager)
>>> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1
>>> stopped
>>> > > fetcher for partitions
>>> > > __consumer_offsets-22,__consumer_offsets-30,__
>>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>>> offsets-31,__
>>> > consumer_offsets-36,__consumer_offsets-42,__consumer_offsets
>>> -3,__consumer_
>>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_of
>>> fsets-17,__
>>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>>> > consumer_offsets-40
>>> > > and stopped moving logs for partitions  because they are in the
>>> failed
>>> > log
>>> > > directory C:\kafka1. (kafka.server.ReplicaManager)
>>> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir C:\kafka1
>>> > > (kafka.log.LogManager)
>>> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log dirs
>>> in
>>> > > C:\kafka1 have failed (kafka.log.LogManager)
>>> > >
>>> >
>>> > Could someone please share some ideas how to rectify this on Windows?
>>> If
>>> > this will never be supported on Windows, could we get some official
>>> > communication perhaps?
>>> >
>>> > Regards,
>>> >
>>>
>>
>>
>

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by "M. Manna" <ma...@gmail.com>.
I am having some difficulties debugging the cluster. IN CWIKI for kafka,
I can see a setup guide for eclipse, but nothing as such for debugger
setup. The issue is that every time the log cleaner thread is run, it's
having a FATAL shutdown.
I tried to close all producers and consumers, but it still locks the log
and Transaction index segments. Forcefully unmapping the offset and time
offset files work normally, but log and transaction index files don't work
properly.

Could anyone please advise how to setup the kafka debugging on eclipse ? I
have added the sdb jar noted in scala-debugger.org site. But I couldn't
connect to port 5005 at all.




On 13 May 2018 at 10:00, M. Manna <ma...@gmail.com> wrote:

> Hi Ted,
>
> I highly appreciate the response over the weekend, and thanks for pointing
> out the JIRAs.
>
> I don't believe the processes are responsible, but individual threads
> which are still holding the log/index files using IO streams. I am trying
> walk a single node setup through debugger to find out which thread is
> locking the file. Apologise but it's a huge application so it might get me
> some time to get around it :)
>
> Please do update if you find something new.
>
> Regards,
>
> On 12 May 2018 at 22:15, Ted Yu <yu...@gmail.com> wrote:
>
>> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear
>> some
>> resemblance.
>>
>> Can you try to find how the other process uses the file being deleted ?
>>
>> https://superuser.com/questions/117902/find-out-which-
>> process-is-locking-a-file-or-folder-in-windows
>> https://www.computerhope.com/issues/ch000714.htm
>>
>> Cheers
>>
>> On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > We are still stuck with this issue where 2.11-1.1.0 distro is failing to
>> > cleanup logs on Windows and brings the entire cluster down one by one.
>> > Extending the retention hours and sizes don't help because they burden
>> the
>> > hard drive.
>> >
>> > Here is the log
>> >
>> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
>> Rolled
>> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
>> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
>> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
>> > > (kafka.log.Log)
>> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for
>> test-0
>> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
>> > > java.nio.file.FileSystemException:
>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
>> > > access the file because it is being used by another process.
>> > >
>> > >         at
>> > > sun.nio.fs.WindowsException.translateToIOException(
>> > WindowsException.java:86)
>> > >         at
>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> > WindowsException.java:97)
>> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>> > >         at
>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>> > java:287)
>> > >         at java.nio.file.Files.move(Files.java:1395)
>> > >         at
>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> > Utils.java:697)
>> > >         at
>> > > org.apache.kafka.common.record.FileRecords.renameTo(
>> > FileRecords.java:212)
>> > >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:
>> 415)
>> > >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
>> > 1601)
>> > >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
>> > >         at
>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> > >         at
>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> > >         at
>> > > scala.collection.mutable.ResizableArray$class.foreach(
>> > ResizableArray.scala:59)
>> > >         at
>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> > >         at
>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
>> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:
>> 1161)
>> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:
>> 1161)
>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
>> > nts(Log.scala:1228)
>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>> > >         at
>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> .scala:854)
>> > >         at
>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> .scala:852)
>> > >         at
>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
>> > TraversableLike.scala:733)
>> > >         at scala.collection.immutable.List.foreach(List.scala:392)
>> > >         at
>> > > scala.collection.TraversableLike$WithFilter.
>> > foreach(TraversableLike.scala:732)
>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>> > >         at
>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
>> > LogManager.scala:385)
>> > >         at
>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
>> > KafkaScheduler.scala:110)
>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>> > >         at
>> > > java.util.concurrent.Executors$RunnableAdapter.call(
>> Executors.java:511)
>> > >         at java.util.concurrent.FutureTask.runAndReset(
>> > FutureTask.java:308)
>> > >         at
>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>> > >         at
>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>> > >         at
>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> > ThreadPoolExecutor.java:1149)
>> > >         at
>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> > ThreadPoolExecutor.java:624)
>> > >         at java.lang.Thread.run(Thread.java:748)
>> > >         Suppressed: java.nio.file.FileSystemException:
>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
>> > > access the file because it is being used by another process.
>> > >
>> > >                 at
>> > > sun.nio.fs.WindowsException.translateToIOException(
>> > WindowsException.java:86)
>> > >                 at
>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> > WindowsException.java:97)
>> > >                 at
>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>> > >                 at
>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>> > java:287)
>> > >                 at java.nio.file.Files.move(Files.java:1395)
>> > >                 at
>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> > Utils.java:694)
>> > >                 ... 32 more
>> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping
>> serving
>> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
>> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled task
>> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
>> > > org.apache.kafka.common.errors.KafkaStorageException: Error while
>> > deleting
>> > > segments for test-0 in dir C:\kafka1
>> > > Caused by: java.nio.file.FileSystemException:
>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
>> > > access the file because it is being used by another process.
>> > >
>> > >         at
>> > > sun.nio.fs.WindowsException.translateToIOException(
>> > WindowsException.java:86)
>> > >         at
>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> > WindowsException.java:97)
>> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
>> > >         at
>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>> > java:287)
>> > >         at java.nio.file.Files.move(Files.java:1395)
>> > >         at
>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> > Utils.java:697)
>> > >         at
>> > > org.apache.kafka.common.record.FileRecords.renameTo(
>> > FileRecords.java:212)
>> > >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:
>> 415)
>> > >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
>> > 1601)
>> > >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
>> > >         at
>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> > >         at
>> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
>> > apply$mcI$sp$1.apply(Log.scala:1170)
>> > >         at
>> > > scala.collection.mutable.ResizableArray$class.foreach(
>> > ResizableArray.scala:59)
>> > >         at
>> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> > >         at
>> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
>> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:
>> 1161)
>> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:
>> 1161)
>> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
>> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
>> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
>> > nts(Log.scala:1228)
>> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
>> > >         at
>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> .scala:854)
>> > >         at
>> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager
>> .scala:852)
>> > >         at
>> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
>> > TraversableLike.scala:733)
>> > >         at scala.collection.immutable.List.foreach(List.scala:392)
>> > >         at
>> > > scala.collection.TraversableLike$WithFilter.
>> > foreach(TraversableLike.scala:732)
>> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
>> > >         at
>> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
>> > LogManager.scala:385)
>> > >         at
>> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
>> > KafkaScheduler.scala:110)
>> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
>> > >         at
>> > > java.util.concurrent.Executors$RunnableAdapter.call(
>> Executors.java:511)
>> > >         at java.util.concurrent.FutureTask.runAndReset(
>> > FutureTask.java:308)
>> > >         at
>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>> > >         at
>> > > java.util.concurrent.ScheduledThreadPoolExecutor$
>> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>> > >         at
>> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> > ThreadPoolExecutor.java:1149)
>> > >         at
>> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> > ThreadPoolExecutor.java:624)
>> > >         at java.lang.Thread.run(Thread.java:748)
>> > >         Suppressed: java.nio.file.FileSystemException:
>> > > C:\kafka1\test-0\00000000000000000000.log ->
>> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
>> > > access the file because it is being used by another process.
>> > >
>> > >                 at
>> > > sun.nio.fs.WindowsException.translateToIOException(
>> > WindowsException.java:86)
>> > >                 at
>> > > sun.nio.fs.WindowsException.rethrowAsIOException(
>> > WindowsException.java:97)
>> > >                 at
>> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
>> > >                 at
>> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
>> > java:287)
>> > >                 at java.nio.file.Files.move(Files.java:1395)
>> > >                 at
>> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
>> > Utils.java:694)
>> > >                 ... 32 more
>> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
>> > Removed
>> > > fetcher for partitions
>> > > __consumer_offsets-22,__consumer_offsets-30,__
>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>> offsets-31,__
>> > consumer_offsets-36,__consumer_offsets-42,__consumer_
>> offsets-3,__consumer_
>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>> > consumer_offsets-40
>> > > (kafka.server.ReplicaFetcherManager)
>> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on broker
>> 1]
>> > > Removed fetcher for partitions
>> > > __consumer_offsets-22,__consumer_offsets-30,__
>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>> offsets-31,__
>> > consumer_offsets-36,__consumer_offsets-42,__consumer_
>> offsets-3,__consumer_
>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>> > consumer_offsets-40
>> > > (kafka.server.ReplicaAlterLogDirsManager)
>> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1
>> stopped
>> > > fetcher for partitions
>> > > __consumer_offsets-22,__consumer_offsets-30,__
>> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
>> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
>> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
>> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
>> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
>> > consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_
>> offsets-31,__
>> > consumer_offsets-36,__consumer_offsets-42,__consumer_
>> offsets-3,__consumer_
>> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
>> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
>> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
>> > consumer_offsets-13,__consumer_offsets-2,__consumer_
>> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
>> > consumer_offsets-20,__consumer_offsets-0,__consumer_
>> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
>> > consumer_offsets-45,__consumer_offsets-1,__consumer_
>> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
>> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
>> > consumer_offsets-40
>> > > and stopped moving logs for partitions  because they are in the failed
>> > log
>> > > directory C:\kafka1. (kafka.server.ReplicaManager)
>> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir C:\kafka1
>> > > (kafka.log.LogManager)
>> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log dirs
>> in
>> > > C:\kafka1 have failed (kafka.log.LogManager)
>> > >
>> >
>> > Could someone please share some ideas how to rectify this on Windows? If
>> > this will never be supported on Windows, could we get some official
>> > communication perhaps?
>> >
>> > Regards,
>> >
>>
>
>

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by "M. Manna" <ma...@gmail.com>.
Hi Ted,

I highly appreciate the response over the weekend, and thanks for pointing
out the JIRAs.

I don't believe the processes are responsible, but individual threads which
are still holding the log/index files using IO streams. I am trying walk a
single node setup through debugger to find out which thread is locking the
file. Apologise but it's a huge application so it might get me some time to
get around it :)

Please do update if you find something new.

Regards,

On 12 May 2018 at 22:15, Ted Yu <yu...@gmail.com> wrote:

> There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear some
> resemblance.
>
> Can you try to find how the other process uses the file being deleted ?
>
> https://superuser.com/questions/117902/find-out-
> which-process-is-locking-a-file-or-folder-in-windows
> https://www.computerhope.com/issues/ch000714.htm
>
> Cheers
>
> On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com> wrote:
>
> > Hello,
> >
> > We are still stuck with this issue where 2.11-1.1.0 distro is failing to
> > cleanup logs on Windows and brings the entire cluster down one by one.
> > Extending the retention hours and sizes don't help because they burden
> the
> > hard drive.
> >
> > Here is the log
> >
> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> Rolled
> > > new log segment at offset 45 in 105 ms. (kafka.log.Log)
> > > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> > > Scheduling log segment [baseOffset 0, size 2290] for deletion.
> > > (kafka.log.Log)
> > > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for
> test-0
> > > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
> > > java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >         at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >         at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > >         at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >         at java.nio.file.Files.move(Files.java:1395)
> > >         at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:697)
> > >         at
> > > org.apache.kafka.common.record.FileRecords.renameTo(
> > FileRecords.java:212)
> > >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.
> scala:415)
> > >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
> > 1601)
> > >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > scala.collection.mutable.ResizableArray$class.foreach(
> > ResizableArray.scala:59)
> > >         at
> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> > nts(Log.scala:1228)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:854)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:852)
> > >         at
> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> > TraversableLike.scala:733)
> > >         at scala.collection.immutable.List.foreach(List.scala:392)
> > >         at
> > > scala.collection.TraversableLike$WithFilter.
> > foreach(TraversableLike.scala:732)
> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> > >         at
> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> > LogManager.scala:385)
> > >         at
> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> > KafkaScheduler.scala:110)
> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > >         at java.util.concurrent.FutureTask.runAndReset(
> > FutureTask.java:308)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1149)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > >         at java.lang.Thread.run(Thread.java:748)
> > >         Suppressed: java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >                 at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >                 at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >                 at
> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> > >                 at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >                 at java.nio.file.Files.move(Files.java:1395)
> > >                 at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:694)
> > >                 ... 32 more
> > > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping
> serving
> > > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> > > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled task
> > > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
> > > org.apache.kafka.common.errors.KafkaStorageException: Error while
> > deleting
> > > segments for test-0 in dir C:\kafka1
> > > Caused by: java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >         at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >         at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > >         at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >         at java.nio.file.Files.move(Files.java:1395)
> > >         at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:697)
> > >         at
> > > org.apache.kafka.common.record.FileRecords.renameTo(
> > FileRecords.java:212)
> > >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.
> scala:415)
> > >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
> > 1601)
> > >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> > apply$mcI$sp$1.apply(Log.scala:1170)
> > >         at
> > > scala.collection.mutable.ResizableArray$class.foreach(
> > ResizableArray.scala:59)
> > >         at
> > > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> > >         at
> > > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.
> scala:1161)
> > >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> > >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> > >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> > nts(Log.scala:1228)
> > >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:854)
> > >         at
> > > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(
> LogManager.scala:852)
> > >         at
> > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> > TraversableLike.scala:733)
> > >         at scala.collection.immutable.List.foreach(List.scala:392)
> > >         at
> > > scala.collection.TraversableLike$WithFilter.
> > foreach(TraversableLike.scala:732)
> > >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> > >         at
> > > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> > LogManager.scala:385)
> > >         at
> > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> > KafkaScheduler.scala:110)
> > >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> > >         at
> > > java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> > >         at java.util.concurrent.FutureTask.runAndReset(
> > FutureTask.java:308)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > >         at
> > > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1149)
> > >         at
> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:624)
> > >         at java.lang.Thread.run(Thread.java:748)
> > >         Suppressed: java.nio.file.FileSystemException:
> > > C:\kafka1\test-0\00000000000000000000.log ->
> > > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > > access the file because it is being used by another process.
> > >
> > >                 at
> > > sun.nio.fs.WindowsException.translateToIOException(
> > WindowsException.java:86)
> > >                 at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(
> > WindowsException.java:97)
> > >                 at
> > > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> > >                 at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> > java:287)
> > >                 at java.nio.file.Files.move(Files.java:1395)
> > >                 at
> > > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> > Utils.java:694)
> > >                 ... 32 more
> > > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
> > Removed
> > > fetcher for partitions
> > > __consumer_offsets-22,__consumer_offsets-30,__
> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> > consumer_offsets-16,test-0,__consumer_offsets-28,__
> consumer_offsets-31,__
> > consumer_offsets-36,__consumer_offsets-42,__
> consumer_offsets-3,__consumer_
> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> > consumer_offsets-40
> > > (kafka.server.ReplicaFetcherManager)
> > > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on broker 1]
> > > Removed fetcher for partitions
> > > __consumer_offsets-22,__consumer_offsets-30,__
> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> > consumer_offsets-16,test-0,__consumer_offsets-28,__
> consumer_offsets-31,__
> > consumer_offsets-36,__consumer_offsets-42,__
> consumer_offsets-3,__consumer_
> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> > consumer_offsets-40
> > > (kafka.server.ReplicaAlterLogDirsManager)
> > > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1
> stopped
> > > fetcher for partitions
> > > __consumer_offsets-22,__consumer_offsets-30,__
> > consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> > 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> > offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> > consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> > consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> > consumer_offsets-16,test-0,__consumer_offsets-28,__
> consumer_offsets-31,__
> > consumer_offsets-36,__consumer_offsets-42,__
> consumer_offsets-3,__consumer_
> > offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> > 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> > consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> > consumer_offsets-13,__consumer_offsets-2,__consumer_
> > offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> > consumer_offsets-20,__consumer_offsets-0,__consumer_
> > offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> > consumer_offsets-45,__consumer_offsets-1,__consumer_
> > offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> > consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> > consumer_offsets-40
> > > and stopped moving logs for partitions  because they are in the failed
> > log
> > > directory C:\kafka1. (kafka.server.ReplicaManager)
> > > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir C:\kafka1
> > > (kafka.log.LogManager)
> > > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log dirs in
> > > C:\kafka1 have failed (kafka.log.LogManager)
> > >
> >
> > Could someone please share some ideas how to rectify this on Windows? If
> > this will never be supported on Windows, could we get some official
> > communication perhaps?
> >
> > Regards,
> >
>

Re: Kafka 2.11-1.1.0 crashes brokers and brings cluster down on Windows

Posted by Ted Yu <yu...@gmail.com>.
There are two outstanding issues: KAFKA-6059 and KAFKA-6200 which bear some
resemblance.

Can you try to find how the other process uses the file being deleted ?

https://superuser.com/questions/117902/find-out-which-process-is-locking-a-file-or-folder-in-windows
https://www.computerhope.com/issues/ch000714.htm

Cheers

On Sat, May 12, 2018 at 1:42 PM, M. Manna <ma...@gmail.com> wrote:

> Hello,
>
> We are still stuck with this issue where 2.11-1.1.0 distro is failing to
> cleanup logs on Windows and brings the entire cluster down one by one.
> Extending the retention hours and sizes don't help because they burden the
> hard drive.
>
> Here is the log
>
> [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1] Rolled
> > new log segment at offset 45 in 105 ms. (kafka.log.Log)
> > [2018-05-12 21:36:57,673] INFO [Log partition=test-0, dir=C:\kafka1]
> > Scheduling log segment [baseOffset 0, size 2290] for deletion.
> > (kafka.log.Log)
> > [2018-05-12 21:36:57,673] ERROR Error while deleting segments for test-0
> > in dir C:\kafka1 (kafka.server.LogDirFailureChannel)
> > java.nio.file.FileSystemException:
> > C:\kafka1\test-0\00000000000000000000.log ->
> > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > access the file because it is being used by another process.
> >
> >         at
> > sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
> >         at
> > sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
> >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> >         at
> > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> java:287)
> >         at java.nio.file.Files.move(Files.java:1395)
> >         at
> > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> Utils.java:697)
> >         at
> > org.apache.kafka.common.record.FileRecords.renameTo(
> FileRecords.java:212)
> >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
> >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
> 1601)
> >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
> >         at
> > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> apply$mcI$sp$1.apply(Log.scala:1170)
> >         at
> > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> apply$mcI$sp$1.apply(Log.scala:1170)
> >         at
> > scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> >         at
> > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> >         at
> > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
> >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
> >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
> >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> nts(Log.scala:1228)
> >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> >         at
> > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:854)
> >         at
> > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:852)
> >         at
> > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> TraversableLike.scala:733)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at
> > scala.collection.TraversableLike$WithFilter.
> foreach(TraversableLike.scala:732)
> >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> >         at
> > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> LogManager.scala:385)
> >         at
> > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> KafkaScheduler.scala:110)
> >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >         at java.util.concurrent.FutureTask.runAndReset(
> FutureTask.java:308)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> >         Suppressed: java.nio.file.FileSystemException:
> > C:\kafka1\test-0\00000000000000000000.log ->
> > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > access the file because it is being used by another process.
> >
> >                 at
> > sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
> >                 at
> > sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
> >                 at
> > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> >                 at
> > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> java:287)
> >                 at java.nio.file.Files.move(Files.java:1395)
> >                 at
> > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> Utils.java:694)
> >                 ... 32 more
> > [2018-05-12 21:36:57,689] INFO [ReplicaManager broker=1] Stopping serving
> > replicas in dir C:\kafka1 (kafka.server.ReplicaManager)
> > [2018-05-12 21:36:57,689] ERROR Uncaught exception in scheduled task
> > 'kafka-log-retention' (kafka.utils.KafkaScheduler)
> > org.apache.kafka.common.errors.KafkaStorageException: Error while
> deleting
> > segments for test-0 in dir C:\kafka1
> > Caused by: java.nio.file.FileSystemException:
> > C:\kafka1\test-0\00000000000000000000.log ->
> > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > access the file because it is being used by another process.
> >
> >         at
> > sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
> >         at
> > sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
> >         at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> >         at
> > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> java:287)
> >         at java.nio.file.Files.move(Files.java:1395)
> >         at
> > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> Utils.java:697)
> >         at
> > org.apache.kafka.common.record.FileRecords.renameTo(
> FileRecords.java:212)
> >         at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415)
> >         at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:
> 1601)
> >         at kafka.log.Log.kafka$log$Log$$deleteSegment(Log.scala:1588)
> >         at
> > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> apply$mcI$sp$1.apply(Log.scala:1170)
> >         at
> > kafka.log.Log$$anonfun$deleteSegments$1$$anonfun$
> apply$mcI$sp$1.apply(Log.scala:1170)
> >         at
> > scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> >         at
> > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> >         at
> > kafka.log.Log$$anonfun$deleteSegments$1.apply$mcI$sp(Log.scala:1170)
> >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
> >         at kafka.log.Log$$anonfun$deleteSegments$1.apply(Log.scala:1161)
> >         at kafka.log.Log.maybeHandleIOException(Log.scala:1678)
> >         at kafka.log.Log.deleteSegments(Log.scala:1161)
> >         at kafka.log.Log.deleteOldSegments(Log.scala:1156)
> >         at kafka.log.Log.deleteRetentionMsBreachedSegme
> nts(Log.scala:1228)
> >         at kafka.log.Log.deleteOldSegments(Log.scala:1222)
> >         at
> > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:854)
> >         at
> > kafka.log.LogManager$$anonfun$cleanupLogs$3.apply(LogManager.scala:852)
> >         at
> > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> TraversableLike.scala:733)
> >         at scala.collection.immutable.List.foreach(List.scala:392)
> >         at
> > scala.collection.TraversableLike$WithFilter.
> foreach(TraversableLike.scala:732)
> >         at kafka.log.LogManager.cleanupLogs(LogManager.scala:852)
> >         at
> > kafka.log.LogManager$$anonfun$startup$1.apply$mcV$sp(
> LogManager.scala:385)
> >         at
> > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(
> KafkaScheduler.scala:110)
> >         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
> >         at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >         at java.util.concurrent.FutureTask.runAndReset(
> FutureTask.java:308)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> >         at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)
> >         at java.lang.Thread.run(Thread.java:748)
> >         Suppressed: java.nio.file.FileSystemException:
> > C:\kafka1\test-0\00000000000000000000.log ->
> > C:\kafka1\test-0\00000000000000000000.log.deleted: The process cannot
> > access the file because it is being used by another process.
> >
> >                 at
> > sun.nio.fs.WindowsException.translateToIOException(
> WindowsException.java:86)
> >                 at
> > sun.nio.fs.WindowsException.rethrowAsIOException(
> WindowsException.java:97)
> >                 at
> > sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301)
> >                 at
> > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.
> java:287)
> >                 at java.nio.file.Files.move(Files.java:1395)
> >                 at
> > org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(
> Utils.java:694)
> >                 ... 32 more
> > [2018-05-12 21:36:57,689] INFO [ReplicaFetcherManager on broker 1]
> Removed
> > fetcher for partitions
> > __consumer_offsets-22,__consumer_offsets-30,__
> consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_offsets-31,__
> consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_
> offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> consumer_offsets-13,__consumer_offsets-2,__consumer_
> offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> consumer_offsets-20,__consumer_offsets-0,__consumer_
> offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> consumer_offsets-45,__consumer_offsets-1,__consumer_
> offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> consumer_offsets-40
> > (kafka.server.ReplicaFetcherManager)
> > [2018-05-12 21:36:57,689] INFO [ReplicaAlterLogDirsManager on broker 1]
> > Removed fetcher for partitions
> > __consumer_offsets-22,__consumer_offsets-30,__
> consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_offsets-31,__
> consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_
> offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> consumer_offsets-13,__consumer_offsets-2,__consumer_
> offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> consumer_offsets-20,__consumer_offsets-0,__consumer_
> offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> consumer_offsets-45,__consumer_offsets-1,__consumer_
> offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> consumer_offsets-40
> > (kafka.server.ReplicaAlterLogDirsManager)
> > [2018-05-12 21:36:57,751] INFO [ReplicaManager broker=1] Broker 1 stopped
> > fetcher for partitions
> > __consumer_offsets-22,__consumer_offsets-30,__
> consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-
> 4,__consumer_offsets-27,__consumer_offsets-7,__consumer_
> offsets-9,__consumer_offsets-46,__consumer_offsets-25,__
> consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__
> consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__
> consumer_offsets-16,test-0,__consumer_offsets-28,__consumer_offsets-31,__
> consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_
> offsets-18,test1-0,__consumer_offsets-37,__consumer_offsets-
> 15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__
> consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__
> consumer_offsets-13,__consumer_offsets-2,__consumer_
> offsets-43,__consumer_offsets-6,__consumer_offsets-14,__
> consumer_offsets-20,__consumer_offsets-0,__consumer_
> offsets-44,__consumer_offsets-39,__consumer_offsets-12,__
> consumer_offsets-45,__consumer_offsets-1,__consumer_
> offsets-5,__consumer_offsets-26,__consumer_offsets-29,__
> consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__
> consumer_offsets-40
> > and stopped moving logs for partitions  because they are in the failed
> log
> > directory C:\kafka1. (kafka.server.ReplicaManager)
> > [2018-05-12 21:36:57,751] INFO Stopping serving logs in dir C:\kafka1
> > (kafka.log.LogManager)
> > [2018-05-12 21:36:57,767] ERROR Shutdown broker because all log dirs in
> > C:\kafka1 have failed (kafka.log.LogManager)
> >
>
> Could someone please share some ideas how to rectify this on Windows? If
> this will never be supported on Windows, could we get some official
> communication perhaps?
>
> Regards,
>