You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-issues@jackrabbit.apache.org by "Francesco Mari (JIRA)" <ji...@apache.org> on 2017/12/01 14:11:00 UTC

[jira] [Comment Edited] (OAK-6884) TarMK disk space check is not synchronized with FileStore opened state

    [ https://issues.apache.org/jira/browse/OAK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274443#comment-16274443 ] 

Francesco Mari edited comment on OAK-6884 at 12/1/17 2:10 PM:
--------------------------------------------------------------

[~mduerig], I don't think that's sufficient. {{ShutDown#shutDown}} might be called from the main thread after the check on {{ShutDown#shutDownRequested}} in the disk space check thread, but before the call to {{FileStore#size}}. 

To implement a safer solution it's necessary to create another synchronization primitive in {{ShutDown}} that allows us to do the following in the disk space thread.

{noformat}
try (ShutDownCloser ignored = shutDown.tryKeepAlive()) {
    if (shutDown.shutDownRequested()) {
        return;
    }
    checkDiskSpace(gcOptions);
}
{noformat}

The implementation of {{tryKeepAlive}} (naming could be improved) is equivalent to {{keepAlive}}, except for the lack of the check on {{shutdownRequested}} and the {{IllegalStateException}}. {{tryKeepAlive}} is necessary to properly synchronize with {{ShutDown#shutDown}}. Callers of {{tryKeepAlive}} must always check {{ShutDown#shutDownRequested}} as soon as they enter in the try-with-resource block, and exit if the check is positive. There are only two possible reasons why the check could be positive. First, a user asked the {{FileStore}} to shut down, but nothing has been closed yet. Second, the {{FileStore}} has already been shut down. Whatever the reason, it is necessary to bail out either for responsiveness or for correctness.

By the way, while looking at the code I figured out that the flush thread suffers from the same race condition.


was (Author: frm):
[~mduerig], I don't think that's sufficient. {{ShutDown#shutDown}} might be called from the main thread after the check on {{ShutDown#shutDownRequested}} in the disk space check thread, but before the call to {{FileStore#size}}. 

To implement a safer solution it's necessary to create another synchronization primitive in {{ShutDown}} that allows us to do the following in the disk space thread.

{noformat}
try (ShutDownCloser ignored = shutDown.tryKeepAlive()) {
    if (shutDown.shutDownRequested()) {
        return;
    }
    checkDiskSpace(gcOptions);
}
(noformat}

The implementation of {{tryKeepAlive}} (naming could be improved) is equivalent to {{keepAlive}}, except for the lack of the check on {{shutdownRequested}} and the {{IllegalStateException}}. {{tryKeepAlive}} is necessary to properly synchronize with {{ShutDown#shutDown}}. Callers of {{tryKeepAlive}} must always check {{ShutDown#shutDownRequested}} as soon as they enter in the try-with-resource block, and exit if the check is positive. There are only two possible reasons why the check could be positive. First, a user asked the {{FileStore}} to shut down, but nothing has been closed yet. Second, the {{FileStore}} has already been shut down. Whatever the reason, it is necessary to bail out either for responsiveness or for correctness.

By the way, while looking at the code I figured out that the flush thread suffers from the same race condition.

> TarMK disk space check is not synchronized with FileStore opened state
> ----------------------------------------------------------------------
>
>                 Key: OAK-6884
>                 URL: https://issues.apache.org/jira/browse/OAK-6884
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: segment-tar, upgrade
>    Affects Versions: 1.7.9
>            Reporter: Arek Kita
>            Assignee: Michael Dürig
>            Priority: Minor
>              Labels: concurrency, production
>             Fix For: 1.8, 1.7.14
>
>
> It seems that the disk space check is not properly synchronized with {{FileStore}} as I revealed a race condition while using oak-upgrade during migration to {{segment-tar}}.
> The {{FileStore}} instance is closed while TarMK disk check tries to execute and it seems it is dependent on the state of segment ({{org.apache.jackrabbit.oak.segment.file.FileStore.checkDiskSpace(FileStore.java:541)}} that needs to be opened. 
> {noformat}
> 30.10.2017 11:26:05.834 WARN   o.a.j.o.s.f.Scheduler: The scheduler FileStore background tasks takes too long to shut down
> 30.10.2017 11:26:11.674 INFO   o.a.j.o.s.f.FileStore: TarMK closed: /data/cq/crx-quickstart/repository-segment-tar-20171030-112401/segmentstore
> 30.10.2017 11:26:11.676 ERROR  o.a.j.o.s.f.SafeRunnable: Uncaught exception in TarMK disk space check [/data/cq/crx-quickstart/repository-segment-tar-20171030-112401/segmentstore]
> java.lang.IllegalStateException: already shut down
>     at org.apache.jackrabbit.oak.segment.file.ShutDown.keepAlive(ShutDown.java:42)
>     at org.apache.jackrabbit.oak.segment.file.FileStore.size(FileStore.java:302)
>     at org.apache.jackrabbit.oak.segment.file.FileStore.checkDiskSpace(FileStore.java:541)
>     at org.apache.jackrabbit.oak.segment.file.FileStore.access$300(FileStore.java:102)
>     at org.apache.jackrabbit.oak.segment.file.FileStore$3.run(FileStore.java:237)
>     at org.apache.jackrabbit.oak.segment.file.SafeRunnable.run(SafeRunnable.java:67)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)