You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ignite TC Bot (Jira)" <ji...@apache.org> on 2019/09/05 12:45:00 UTC

[jira] [Commented] (IGNITE-12127) WAL writer may close file IO with unflushed changes when MMAP is disabled

    [ https://issues.apache.org/jira/browse/IGNITE-12127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923392#comment-16923392 ] 

Ignite TC Bot commented on IGNITE-12127:
----------------------------------------

{panel:title=Branch: [pull/6840/head] Base: [master] : No blockers found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
[TeamCity *--&gt; Run :: All* Results|https://ci.ignite.apache.org/viewLog.html?buildId=4567088&amp;buildTypeId=IgniteTests24Java8_RunAll]

> WAL writer may close file IO with unflushed changes when MMAP is disabled
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-12127
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12127
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Dmitriy Govorukhin
>            Assignee: Dmitriy Govorukhin
>            Priority: Critical
>             Fix For: 2.7.6
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Most likely the issue manifests itself as the following critical error:
> {code}
> 2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT : Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Failed to write buffer.]]
> org.apache.ignite.internal.processors.cache.persistence.StorageException: Failed to write buffer.
>     at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444) [ignite-core-2.5.7.jar!/:2.5.7]
>     at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249) [ignite-core-2.5.7.jar!/:2.5.7]
>     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) [ignite-core-2.5.7.jar!/:2.5.7]
>     at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
> Caused by: java.nio.channels.ClosedChannelException: null
>     at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110) ~[na:1.8.0_201]
>     at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253) ~[na:1.8.0_201]
>     at org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48) ~[ignite-core-2.5.7.jar!/:2.5.7]
>     at org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41) ~[ignite-core-2.5.7.jar!/:2.5.7]
>     at org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111) ~[ignite-core-2.5.7.jar!/:2.5.7]
>     at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437) [ignite-core-2.5.7.jar!/:2.5.7]
>     ... 3 common frames omitted
> {code}
> It appears that there following sequence is possible:
>  * Thread A attempts to log a large record which does not fit segment, {{addRecord}} fails and the thread A starts segment rollover. I successfully runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment record
>  * Thread B attempts to log another record, which fits exactly till the end of the current segment. The record is added to the buffer
>  * Thread A resumes and fails to add the switch segment record. No flush is performed and the thread immediately proceeds for wal-writer close
>  * WAL writer thread wakes up, sees that there is a CLOSE request, closes the file IO and immediately proceeds to write unflushed changes causing the exception.
> Unconditional flush after switch segment record write should fix the issue.
> Besides the bug itself, I suggest the following changes to the {{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
>  * There is an {{fsync(filePtr)}} call inside {{close()}}; however, {{fsync()}} checks the {{stop}} flag (which is set inside {{close}}) and returns immediately after {{flushOrWait()}} if the flag is set - this is very confusing. After all, the {{close()}} itself explicitly calls {{force}} after flush
>  * There is an ignored IO exception in mmap mode - this should be propagated to the failure handler
>  * In WAL writer, we check for file CLOSE and then attemp to write to (possibly) the same write handle - write should be always before close
>  * In WAL writer, there are racy reads of current handle - it would be better if we read the current handle once and then operate on it during the whole loop iteration



--
This message was sent by Atlassian Jira
(v8.3.2#803003)