You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Ilya Kasnacheev (JIRA)" <ji...@apache.org> on 2018/01/25 16:15:00 UTC

[jira] [Commented] (IGNITE-7540) Sequential checkpoints cause overwrite of already cleaned & freed offheap page

    [ https://issues.apache.org/jira/browse/IGNITE-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339438#comment-16339438 ] 

Ilya Kasnacheev commented on IGNITE-7540:
-----------------------------------------

Proposed fix is marking caches for destruction in GridCacheProcessor.onExchangeDone() before waiting for checkpoint to finish, to avoid touching these caches during next checkpoint.

> Sequential checkpoints cause overwrite of already cleaned & freed offheap page
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-7540
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7540
>             Project: Ignite
>          Issue Type: Bug
>          Components: persistence
>    Affects Versions: 2.4
>            Reporter: Ilya Kasnacheev
>            Assignee: Alexey Goncharuk
>            Priority: Major
>         Attachments: IgnitePdsDestroyCacheTest.java
>
>
> The sequence of events as follows:
> in GridCacheProcessor.onExchangeDone(), {color:#660e7a}sharedCtx{color}.database().waitForCheckpoint({color:#008000}"caches stop"{color}) is peformed and then cache is destroyed and all its pages are freed and cleared asynchronously.
> However, it is entirely possible that after waitForCheckpoint(), next checkpoint will start immediately. It is typical when a lot of data being loaded into Ignite, leading to rapid checkpoint buffer depletion, as well as with artificially increased checkpoint frequency, as used in reproducer.
> Then, checkpointer will save (overwrite) metadata page:
> {code:java}
>     at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1330)
>     at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:428)
>     at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:422)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:375)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.onCheckpointBegin(GridCacheOffheapManager.java:163)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.markCheckpointBegin(GridCacheDatabaseSharedManager.java:2309)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2088)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2013)
>     at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>     at java.lang.Thread.run(Thread.java:748){code}
> This will happen after cache is already destroyed and even after the page is already zeroed by PageMemoryImpl$ClearSegmentRunnable.run().
> Then, some new cache is being created, and in GridCacheOffheapManager$GridCacheDataStore.getOrAllocatePartitionMetas(), pageMem.acquirePage() will return this page, expected zeroed, but actually containing metadata for old cache's partition. Then, type == PageIO.T_PART_META check will return true and the following exception is issued, leading to cache state inconsistency and data loss:
> {code:java}
> Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
>     at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
>     at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
>     at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.init(PagesList.java:175)
>     at org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.<init>(FreeListImpl.java:370)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore$1.<init>(GridCacheOffheapManager.java:932)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.init0(GridCacheOffheapManager.java:929)
>     at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1295)
>     at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:344)
>     at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:3191)
>     at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:2571)
>     at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2096)
>     at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:140)
>     at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:397)
>     at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:302)
>     at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:59)
>     at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:89)
>     ... 6 more{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)