You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ilya Shishkov (Jira)" <ji...@apache.org> on 2021/07/09 15:23:00 UTC

[jira] [Updated] (IGNITE-15099) Wrong heartbeat update while waiting for a checkpoint by timeout

     [ https://issues.apache.org/jira/browse/IGNITE-15099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ilya Shishkov updated IGNITE-15099:
-----------------------------------
    Labels: ise  (was: )

> Wrong heartbeat update while waiting for a checkpoint by timeout
> ----------------------------------------------------------------
>
>                 Key: IGNITE-15099
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15099
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.11, 2.12
>            Reporter: Ilya Shishkov
>            Priority: Minor
>              Labels: ise
>
> This problem occurs under these conditions:
>  * native persistence is turned on
>  * failureDetectionTimeout < checkpointFrequency
>  * checkpoints are sometimes skipped by timeout (more often, more probable the problem occurrence)
> There is a race condition between a listener execution and finishing of a pending future (see CheckpointContextImpl#executor body [1]). In some cases future can finish before listener closure, therefore updating of a heartbeat in listener can occur after call of _blockingSectionBegin_ in Checkpointer#waitCheckpointEvent, i.e. after Checkpointer started to wait for next checkpoint (see [2]).
> {code:java|title=CheckpointContextImpl#executor}
>     @Override public Executor executor() {
>         return asyncRunner == null ? null : cmd -> {
>             try {
>                 GridFutureAdapter<?> res = new GridFutureAdapter<>();
>                 res.listen(fut -> heartbeatUpdater.updateHeartbeat()); // Listener is invoked concurrently with pending future finish
>                 asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
>                 pendingTaskFuture.add(res);
>             }
>             catch (RejectedExecutionException e) {
>                 assert false : "A task should never be rejected by async runner";
>             }
>         };
>     }
> {code}
> {code:java|title=Checkpointer#waitCheckpointEvent}
> try {
>     synchronized (this) {
>         long remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());
>         while (remaining > 0 && !isCancelled()) {
>             blockingSectionBegin();
>             try {
>                 wait(remaining); 
>                 // At this point and till blockingSectionEnd call heartbeat should be equal to Long.MAX_VALUE
>                 remaining = U.nanosToMillis(scheduledCp.nextCpNanos - System.nanoTime());
>             }
>             finally {
>                 blockingSectionEnd();
>             }
>         }
>     }
> }
> {code}
>  
> If interval between checkpoints (_checkpointFrequency_) is greater than the _failureDetectionTimeout_, then update of heartbeat in _blockingSectionEnd_ may cause an error message in log, because a checkpoint thread is treated as blocked (but in fact it was not).
>  
> Links:
>  # [https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointContextImpl.java#L104]
>  # [https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/Checkpointer.java#L816]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)