You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ilya Kazakov (Jira)" <ji...@apache.org> on 2021/08/11 07:46:00 UTC
[jira] [Updated] (IGNITE-15192) Fix race in Checkpointer listeners invocation and illegal Checkpointer-heartbeat update from different threads

     [ https://issues.apache.org/jira/browse/IGNITE-15192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilya Kazakov updated IGNITE-15192:
----------------------------------
    Description: 
It is about race which was detected in https://issues.apache.org/jira/browse/IGNITE-15099.


The fix from the ticket above fixed the wrong heartbeat, but do not fix a race, which allows checkpointer thread go ahead and do not await on ctx0.awaitPendingTasksFinished() (in ChecpointWorkflow.markCheckpointBegin ), which leads to:
 - checkpointer thread enter in blocking section
 - and after this a checkpointer thread hertbeat could be updated by prallel thread.

 
{code:java}
// CheckpointContextImpl#executor

@Override public Executor executor() {
        return asyncRunner == null ? null : cmd -> {
            try {
                GridFutureAdapter<?> res = new GridFutureAdapter<>();

                res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // Listener is invoked concurrently with pending future finish

                asyncRunner.execute(U.wrapIgniteFuture(cmd, res));

                pendingTaskFuture.add(res);
            }
            catch (RejectedExecutionException e) {
                assert false : "A task should never be rejected by async runner";
            }
        };
    }


{code}
 
{code:java}
// Checkpointer#waitCheckpointEvent

try {
    synchronized (this) {
        long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());

        while (remaining > 0 && !isCancelled()) {
            blockingSectionBegin();

            try {
                wait(remaining); 
                // At this point and till blockingSectionEnd call heartbeat should be equal to Long.MAX_VALUE

                remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());
            }
            finally {
                blockingSectionEnd();
            }
        }
    }
}
{code}
Discussion is here: [https://lists.apache.org/thread.html/r789abd9005d70a8fa1de29d3af394069e859ca6e1eea8bfd3e3e0494%40%3Cdev.ignite.apache.org%3E]

  was:
It is about race which was detected in https://issues.apache.org/jira/browse/IGNITE-15099.

 
{code:java}
// CheckpointContextImpl#executor

@Override public Executor executor() {
        return asyncRunner == null ? null : cmd -> {
            try {
                GridFutureAdapter<?> res = new GridFutureAdapter<>();

                res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // Listener is invoked concurrently with pending future finish

                asyncRunner.execute(U.wrapIgniteFuture(cmd, res));

                pendingTaskFuture.add(res);
            }
            catch (RejectedExecutionException e) {
                assert false : "A task should never be rejected by async runner";
            }
        };
    }


{code}
 
{code:java}
// Checkpointer#waitCheckpointEvent

try {
    synchronized (this) {
        long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());

        while (remaining > 0 && !isCancelled()) {
            blockingSectionBegin();

            try {
                wait(remaining); 
                // At this point and till blockingSectionEnd call heartbeat should be equal to Long.MAX_VALUE

                remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());
            }
            finally {
                blockingSectionEnd();
            }
        }
    }
}
{code}



Discussion is here: https://lists.apache.org/thread.html/r789abd9005d70a8fa1de29d3af394069e859ca6e1eea8bfd3e3e0494%40%3Cdev.ignite.apache.org%3E


> Fix race in Checkpointer listeners invocation and illegal Checkpointer-heartbeat update from different threads
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-15192
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15192
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.10
>            Reporter: Ilya Kazakov
>            Assignee: Ilya Kazakov
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> It is about race which was detected in https://issues.apache.org/jira/browse/IGNITE-15099.
> The fix from the ticket above fixed the wrong heartbeat, but do not fix a race, which allows checkpointer thread go ahead and do not await on ctx0.awaitPendingTasksFinished() (in ChecpointWorkflow.markCheckpointBegin ), which leads to:
>  - checkpointer thread enter in blocking section
>  - and after this a checkpointer thread hertbeat could be updated by prallel thread.
>  
> {code:java}
> // CheckpointContextImpl#executor
> @Override public Executor executor() {
>         return asyncRunner == null ? null : cmd -> {
>             try {
>                 GridFutureAdapter<?> res = new GridFutureAdapter<>();
>                 res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // Listener is invoked concurrently with pending future finish
>                 asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
>                 pendingTaskFuture.add(res);
>             }
>             catch (RejectedExecutionException e) {
>                 assert false : "A task should never be rejected by async runner";
>             }
>         };
>     }
> {code}
>  
> {code:java}
> // Checkpointer#waitCheckpointEvent
> try {
>     synchronized (this) {
>         long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());
>         while (remaining > 0 && !isCancelled()) {
>             blockingSectionBegin();
>             try {
>                 wait(remaining); 
>                 // At this point and till blockingSectionEnd call heartbeat should be equal to Long.MAX_VALUE
>                 remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());
>             }
>             finally {
>                 blockingSectionEnd();
>             }
>         }
>     }
> }
> {code}
> Discussion is here: [https://lists.apache.org/thread.html/r789abd9005d70a8fa1de29d3af394069e859ca6e1eea8bfd3e3e0494%40%3Cdev.ignite.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)