You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Emmanuel Marchand <em...@exensa.com> on 2018/06/05 09:05:21 UTC

Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Hi,

I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE* 
and while performance seems better I failed to restart the cluster after 
what I think is a proper shutdown (using top -deactivate then kill -k 
from visor).

When I try to reactivate the cluster (using top -activate from visor) I 
get the following exception on each nodes :

[09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi] 
Accepted incoming communication connection [locAddr=/192.168.1.1:47100, 
rmtAddr=/192.168.1.102:44646]
[09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending 
activate request with BaselineTopology null
[09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor] 
Received activate request with BaselineTopology: null
[09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor] 
Started state transition: true
[09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init 
[topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true, 
evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60, 
customEvt=ChangeGlobalStateMessage 
[id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd, 
reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c, 
initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true, 
baselineTopology=null, forceChangeBaselineTopology=false, 
timestamp=1528183297656], allowMerge=false]
[09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture] 
Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, 
client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
[09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved 
page store work directory: 
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
[09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager] 
Resolved write ahead log work directory: 
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
[09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager] 
Resolved write ahead log archive directory: 
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
[09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager] 
Started write-ahead log manager in NONE mode, persisted data may be lost 
in a case of unexpected node failure. Make sure to deactivate the 
cluster before shutdown.
[09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page 
memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB, 
checkpointBuffer=100.0 MiB]
[09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page 
memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB, 
checkpointBuffer=2.0 GiB]
[09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page 
memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB, 
checkpointBuffer=100.0 MiB]
[09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager] 
Read checkpoint status 
[startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin, 
endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
[09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager] 
Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0, 
len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0], 
lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
*[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error 
detected. Will be handled accordingly to configured handler [hnd=class 
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: 
Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0, 
fileOff=0, len=0]. Can't restore memory - critical part of WAL archive 
is missing.]]**
**class org.apache.ignite.internal.pagemem.wal.StorageException: Restore 
wal pointer = null, while status.endPtr = FileWALPointer [idx=0, 
fileOff=0, len=0]. Can't restore memory - critical part of WAL archive 
is missing.**
**    at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
**    at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
**    at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
**    at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
**    at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
**    at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
**    at java.lang.Thread.run(Thread.java:748)**
**[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted 
immediately due to the failure: [failureCtx=FailureContext 
[type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: 
Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0, 
fileOff=0, len=0]. Can't restore memory - critical part of WAL archive 
is missing.]]*

This operation ended successfully with version 2.4.

The exception was introduced by IGNITE-8393 
<https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8> 
but the problem may lie deeper, like in the checkpoint process which 
store a status.endPtr = FileWALPointer even when walMode = NONE.

Any idea ?


Regards,
---
Emmanuel.

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Emmanuel Marchand <em...@exensa.com>.

Hi,

Here are some more informations :

  * the region are as in the provided configuration file. I guess there
    is only one.
  * I have 3 caches and I update all 3 using one datastreamer with
    custom receiver for each.
  * all caches are not using any backup and atomicity mode is set to ATOMIC
  * caches are indeed partitioned, keys use affinityKeyMapped and
    implement Comparable<K> and Externalizable
  * values are using indexing

About your test :

  * do you add enough data to trigger a checkpoint ? May be try to
    reduce setMaxSize ?
  * I don't set checkpointPageBufferSize
  * are you sure the checkpoint folder is not empty before restart ?

Thanks for your help.

Regards,
---
Emmanuel.

On 05/06/18 17:26, Andrey Mashenkov wrote:
> Hi,
>
> I've make a simple test and still can't reproduce this.
>
> Would you please take a look if I missed smth?
> Is it possible you have more than one region and\or cache?
>
>
>
> On Tue, Jun 5, 2018 at 5:41 PM, Emmanuel Marchand 
> <emmanuel.marchand@exensa.com <ma...@exensa.com>> 
> wrote:
>
>     Hi,
>
>     I'm not sure I can provide a reproducer, but here is some
>     informations :
>
>       * configuration file attached
>       * 2 server nodes, 1 client (+ visor)
>       * I'm streaming enough data to trigger a checkpoint with a
>         reason 'too many dirty pages'
>           o [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
>             Checkpoint started
>             [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
>             startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
>             checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
>             walCpRecordFsyncDuration=0ms, pages=1508362, reason='too
>             many dirty pages']
>       * no error occurs
>       * deactivate cluster then stop nodes using visor
>       * restart nodes
>       * activate cluster using visor -> crash
>
>     Workaround : if I delete (or rename) the checkpoint folder (which
>     is *not* empty), the activation completes successfully.
>
>     Regards,
>     ---
>     Emmanuel.
>
>
>
>     On 05/06/18 15:40, Andrey Mashenkov wrote:
>>     Hi,
>>
>>     I can't reproduce the issue.
>>     Is it possible grid configuration was changed between runs?
>>     Is it possible to share a reproducer?
>>
>>
>>     On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand
>>     <emmanuel.marchand@exensa.com
>>     <ma...@exensa.com>> wrote:
>>
>>         Hi,
>>
>>         I'm testing v2.5 vs v2.4 for persisted dataregion with
>>         *walModel = NONE* and while performance seems better I failed
>>         to restart the cluster after what I think is a proper
>>         shutdown (using top -deactivate then kill -k from visor).
>>
>>         When I try to reactivate the cluster (using top -activate
>>         from visor) I get the following exception on each nodes :
>>
>>         [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>>         Accepted incoming communication connection
>>         [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
>>         rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
>>         [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor]
>>         Sending activate request with BaselineTopology null
>>         [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>>         Received activate request with BaselineTopology: null
>>         [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>>         Started state transition: true
>>         [09:21:37,687][INFO][exchange-worker-#52][time] Started
>>         exchange init [topVer=AffinityTopologyVersion [topVer=69,
>>         minorTopVer=1], crd=true, evt=DISCOVERY_CUSTOM_EVT,
>>         evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>>         customEvt=ChangeGlobalStateMessage
>>         [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>>         reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>>         initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>>         activate=true, baselineTopology=null,
>>         forceChangeBaselineTopology=false, timestamp=1528183297656],
>>         allowMerge=false]
>>         [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>>         Start activation process
>>         [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
>>         topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>>         [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
>>         Resolved page store work directory:
>>         /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>>         [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>>         Resolved write ahead log work directory:
>>         /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>>         [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>>         Resolved write ahead log archive directory:
>>         /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>>         [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>>         Started write-ahead log manager in NONE mode, persisted data
>>         may be lost in a case of unexpected node failure. Make sure
>>         to deactivate the cluster before shutdown.
>>         [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl]
>>         Started page memory [memoryAllocated=100.0 MiB, pages=24804,
>>         tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>>         [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl]
>>         Started page memory [memoryAllocated=8.0 GiB, pages=2032836,
>>         tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
>>         [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl]
>>         Started page memory [memoryAllocated=100.0 MiB, pages=24804,
>>         tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>>         [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>>         Read checkpoint status
>>         [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
>>         endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
>>         [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>>         Checking memory state [lastValidPos=FileWALPointer [idx=0,
>>         fileOff=0, len=0], lastMarked=FileWALPointer [idx=0,
>>         fileOff=0, len=0],
>>         lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>>         *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical
>>         system error detected. Will be handled accordingly to
>>         configured handler [hnd=class
>>         o.a.i.failure.StopNodeOrHaltFailureHandler,
>>         failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>>         o.a.i.i.pagemem.wal.StorageException: Restore wal pointer =
>>         null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>>         len=0]. Can't restore memory - critical part of WAL archive
>>         is missing.]]**
>>         **class
>>         org.apache.ignite.internal.pagemem.wal.StorageException:
>>         Restore wal pointer = null, while status.endPtr =
>>         FileWALPointer [idx=0, fileOff=0, len=0]. Can't restore
>>         memory - critical part of WAL archive is missing.**
>>         **    at org.apache.ignite.internal.pro
>>         <http://org.apache.ignite.internal.pro>cessors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
>>         **    at org.apache.ignite.internal.pro
>>         <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
>>         **    at org.apache.ignite.internal.pro
>>         <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
>>         **    at org.apache.ignite.internal.pro
>>         <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
>>         **    at org.apache.ignite.internal.pro
>>         <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
>>         **    at
>>         org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
>>         **    at java.lang.Thread.run(Thread.java:748)**
>>         **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be
>>         halted immediately due to the failure:
>>         [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>>         o.a.i.i.pagemem.wal.StorageException: Restore wal pointer =
>>         null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>>         len=0]. Can't restore memory - critical part of WAL archive
>>         is missing.]]*
>>
>>         This operation ended successfully with version 2.4.
>>
>>         The exception was introduced by IGNITE-8393
>>         <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>>         but the problem may lie deeper, like in the checkpoint
>>         process which store a status.endPtr = FileWALPointer even
>>         when walMode = NONE.
>>
>>         Any idea ?
>>
>>
>>         Regards,
>>         ---
>>         Emmanuel.
>>
>>
>>
>>
>>     -- 
>>     Best regards,
>>     Andrey V. Mashenkov
>
>
>
>
> -- 
> Best regards,
> Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi,

I've make a simple test and still can't reproduce this.

Would you please take a look if I missed smth?
Is it possible you have more than one region and\or cache?



On Tue, Jun 5, 2018 at 5:41 PM, Emmanuel Marchand <
emmanuel.marchand@exensa.com> wrote:

> Hi,
>
> I'm not sure I can provide a reproducer, but here is some informations :
>
>    - configuration file attached
>    - 2 server nodes, 1 client (+ visor)
>    - I'm streaming enough data to trigger a checkpoint with a reason 'too
>    many dirty pages'
>       - [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
>       Checkpoint started [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
>       startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
>       checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
>       walCpRecordFsyncDuration=0ms, pages=1508362, reason='too many dirty pages']
>       - no error occurs
>    - deactivate cluster then stop nodes using visor
>    - restart nodes
>    - activate cluster using visor -> crash
>
> Workaround : if I delete (or rename) the checkpoint folder (which is *not*
> empty), the activation completes successfully.
> Regards,
> ---
> Emmanuel.
>
>
>
> On 05/06/18 15:40, Andrey Mashenkov wrote:
>
> Hi,
>
> I can't reproduce the issue.
> Is it possible grid configuration was changed between runs?
> Is it possible to share a reproducer?
>
>
> On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand <
> emmanuel.marchand@exensa.com> wrote:
>
>> Hi,
>>
>> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
>> and while performance seems better I failed to restart the cluster after
>> what I think is a proper shutdown (using top -deactivate then kill -k from
>> visor).
>>
>> When I try to reactivate the cluster (using top -activate from visor) I
>> get the following exception on each nodes :
>> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
>> rmtAddr=/192.168.1.102:44646]
>> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
>> activate request with BaselineTopology null
>> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Received activate request with BaselineTopology: null
>> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Started state transition: true
>> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
>> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
>> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> customEvt=ChangeGlobalStateMessage [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true,
>> baselineTopology=null, forceChangeBaselineTopology=false,
>> timestamp=1528183297656], allowMerge=false]
>> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
>> page store work directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log work directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log archive directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-
>> 9ebd-90b873711c19
>> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>> Started write-ahead log manager in NONE mode, persisted data may be lost in
>> a case of unexpected node failure. Make sure to deactivate the cluster
>> before shutdown.
>> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
>> checkpointBuffer=100.0 MiB]
>> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
>> checkpointBuffer=2.0 GiB]
>> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
>> checkpointBuffer=100.0 MiB]
>> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Read checkpoint status [startMarker=/usr/share/apache
>> -ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-
>> 9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-
>> 4b64-b328-87cc09d3d460-START.bin, endMarker=/usr/share/apache-ig
>> nite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd
>> -90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-
>> 87cc09d3d460-END.bin]
>> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
>> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
>> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
>> detected. Will be handled accordingly to configured handler [hnd=class
>> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
>> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
>> missing.]]*
>> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
>> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>> len=0]. Can't restore memory - critical part of WAL archive is missing.*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
>> *    at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
>> *    at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
>> *    at java.lang.Thread.run(Thread.java:748)*
>> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
>> immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
>> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
>> missing.]]*
>>
>> This operation ended successfully with version 2.4.
>>
>> The exception was introduced by IGNITE-8393
>> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>> but the problem may lie deeper, like in the checkpoint process which store
>> a status.endPtr = FileWALPointer even when walMode = NONE.
>>
>> Any idea ?
>>
>>
>> Regards,
>> ---
>> Emmanuel.
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>
>
>


-- 
Best regards,
Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Emmanuel Marchand <em...@exensa.com>.

Hi,

I'm not sure I can provide a reproducer, but here is some informations :

  * configuration file attached
  * 2 server nodes, 1 client (+ visor)
  * I'm streaming enough data to trigger a checkpoint with a reason 'too
    many dirty pages'
      o [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
        Checkpoint started
        [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
        startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
        checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
        walCpRecordFsyncDuration=0ms, pages=1508362, reason='too many
        dirty pages']
  * no error occurs
  * deactivate cluster then stop nodes using visor
  * restart nodes
  * activate cluster using visor -> crash

Workaround : if I delete (or rename) the checkpoint folder (which is 
*not* empty), the activation completes successfully.

Regards,
---
Emmanuel.


On 05/06/18 15:40, Andrey Mashenkov wrote:
> Hi,
>
> I can't reproduce the issue.
> Is it possible grid configuration was changed between runs?
> Is it possible to share a reproducer?
>
>
> On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand 
> <emmanuel.marchand@exensa.com <ma...@exensa.com>> 
> wrote:
>
>     Hi,
>
>     I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel =
>     NONE* and while performance seems better I failed to restart the
>     cluster after what I think is a proper shutdown (using top
>     -deactivate then kill -k from visor).
>
>     When I try to reactivate the cluster (using top -activate from
>     visor) I get the following exception on each nodes :
>
>     [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>     Accepted incoming communication connection
>     [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
>     rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
>     [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
>     activate request with BaselineTopology null
>     [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>     Received activate request with BaselineTopology: null
>     [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>     Started state transition: true
>     [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange
>     init [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1],
>     crd=true, evt=DISCOVERY_CUSTOM_EVT,
>     evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>     customEvt=ChangeGlobalStateMessage
>     [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>     reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>     initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>     activate=true, baselineTopology=null,
>     forceChangeBaselineTopology=false, timestamp=1528183297656],
>     allowMerge=false]
>     [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>     Start activation process
>     [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
>     topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>     [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
>     Resolved page store work directory:
>     /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>     [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>     Resolved write ahead log work directory:
>     /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>     [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>     Resolved write ahead log archive directory:
>     /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>     [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>     Started write-ahead log manager in NONE mode, persisted data may
>     be lost in a case of unexpected node failure. Make sure to
>     deactivate the cluster before shutdown.
>     [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started
>     page memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9
>     MiB, checkpointBuffer=100.0 MiB]
>     [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started
>     page memory [memoryAllocated=8.0 GiB, pages=2032836,
>     tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
>     [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started
>     page memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9
>     MiB, checkpointBuffer=100.0 MiB]
>     [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>     Read checkpoint status
>     [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
>     endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
>     [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>     Checking memory state [lastValidPos=FileWALPointer [idx=0,
>     fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0,
>     len=0], lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>     *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system
>     error detected. Will be handled accordingly to configured handler
>     [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
>     failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>     o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
>     while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
>     Can't restore memory - critical part of WAL archive is missing.]]**
>     **class org.apache.ignite.internal.pagemem.wal.StorageException:
>     Restore wal pointer = null, while status.endPtr = FileWALPointer
>     [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of
>     WAL archive is missing.**
>     **    at
>     org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
>     **    at
>     org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
>     **    at
>     org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
>     **    at
>     org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
>     **    at
>     org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
>     **    at
>     org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
>     **    at java.lang.Thread.run(Thread.java:748)**
>     **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
>     immediately due to the failure: [failureCtx=FailureContext
>     [type=CRITICAL_ERROR, err=class
>     o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
>     while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
>     Can't restore memory - critical part of WAL archive is missing.]]*
>
>     This operation ended successfully with version 2.4.
>
>     The exception was introduced by IGNITE-8393
>     <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>     but the problem may lie deeper, like in the checkpoint process
>     which store a status.endPtr = FileWALPointer even when walMode = NONE.
>
>     Any idea ?
>
>
>     Regards,
>     ---
>     Emmanuel.
>
>
>
>
> -- 
> Best regards,
> Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi,

I can't reproduce the issue.
Is it possible grid configuration was changed between runs?
Is it possible to share a reproducer?


On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand <
emmanuel.marchand@exensa.com> wrote:

> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
> and while performance seems better I failed to restart the cluster after
> what I think is a proper shutdown (using top -deactivate then kill -k from
> visor).
>
> When I try to reactivate the cluster (using top -activate from visor) I
> get the following exception on each nodes :
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
> rmtAddr=/192.168.1.102:44646]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending activate
> request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
> customEvt=ChangeGlobalStateMessage [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c, initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> activate=true, baselineTopology=null, forceChangeBaselineTopology=false,
> timestamp=1528183297656], allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
> page store work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log archive directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-
> 41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
> Started write-ahead log manager in NONE mode, persisted data may be lost in
> a case of unexpected node failure. Make sure to deactivate the cluster
> before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
> checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Read checkpoint status [startMarker=/usr/share/apache-ignite-fabric-2.5.0-
> bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/
> cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/
> db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/
> 1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
> detected. Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
> len=0]. Can't restore memory - critical part of WAL archive is missing.*
> *    at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
> *    at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
> *    at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
> *    at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
> *    at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
> *    at java.lang.Thread.run(Thread.java:748)*
> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
> immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
> but the problem may lie deeper, like in the checkpoint process which store
> a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.
>



-- 
Best regards,
Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Emmanuel Marchand <em...@exensa.com>.

Hi Andrey,

Thanks for your time.

Regards,
---
Emmanuel.

On 08/06/18 16:42, Andrey Mashenkov wrote:
> Hi Emmanuel,
>
> Sorry for late answer.
> I've found I check your test against master branch and test passes, 
> but it fails on 2.5 branch.
> It looks like already fixed.
>
> On Wed, Jun 6, 2018 at 3:47 PM, Emmanuel Marchand 
> <emmanuel.marchand@exensa.com <ma...@exensa.com>> 
> wrote:
>
>     I was wrong on the introduction of the exception. I guess it was
>     added by a fix about IGNITE-8066
>     <https://github.com/apache/ignite/commit/55c50a1c4309246e9cd254ed1f25b41b89f9a959#diff-cacbb66ea1331a1869d9da2bcd039ef1>.
>
>     Regards,
>     ---
>     Emmanuel.
>
>
>     On 05/06/18 11:05, Emmanuel Marchand wrote:
>>
>>     Hi,
>>
>>     I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel
>>     = NONE* and while performance seems better I failed to restart
>>     the cluster after what I think is a proper shutdown (using top
>>     -deactivate then kill -k from visor).
>>
>>     When I try to reactivate the cluster (using top -activate from
>>     visor) I get the following exception on each nodes :
>>
>>     [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>>     Accepted incoming communication connection
>>     [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
>>     rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
>>     [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
>>     activate request with BaselineTopology null
>>     [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>>     Received activate request with BaselineTopology: null
>>     [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>>     Started state transition: true
>>     [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange
>>     init [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1],
>>     crd=true, evt=DISCOVERY_CUSTOM_EVT,
>>     evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>>     customEvt=ChangeGlobalStateMessage
>>     [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>>     reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>>     initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>>     activate=true, baselineTopology=null,
>>     forceChangeBaselineTopology=false, timestamp=1528183297656],
>>     allowMerge=false]
>>     [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>>     Start activation process
>>     [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
>>     topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>>     [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
>>     Resolved page store work directory:
>>     /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>>     [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>>     Resolved write ahead log work directory:
>>     /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>>     [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>>     Resolved write ahead log archive directory:
>>     /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>>     [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>>     Started write-ahead log manager in NONE mode, persisted data may
>>     be lost in a case of unexpected node failure. Make sure to
>>     deactivate the cluster before shutdown.
>>     [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started
>>     page memory [memoryAllocated=100.0 MiB, pages=24804,
>>     tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>>     [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started
>>     page memory [memoryAllocated=8.0 GiB, pages=2032836,
>>     tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
>>     [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started
>>     page memory [memoryAllocated=100.0 MiB, pages=24804,
>>     tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>>     [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>>     Read checkpoint status
>>     [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
>>     endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
>>     [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>>     Checking memory state [lastValidPos=FileWALPointer [idx=0,
>>     fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0,
>>     len=0], lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>>     *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system
>>     error detected. Will be handled accordingly to configured handler
>>     [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
>>     failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>>     o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
>>     while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
>>     Can't restore memory - critical part of WAL archive is missing.]]**
>>     **class org.apache.ignite.internal.pagemem.wal.StorageException:
>>     Restore wal pointer = null, while status.endPtr = FileWALPointer
>>     [idx=0, fileOff=0, len=0]. Can't restore memory - critical part
>>     of WAL archive is missing.**
>>     **    at
>>     org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
>>     **    at
>>     org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
>>     **    at
>>     org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
>>     **    at
>>     org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
>>     **    at
>>     org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
>>     **    at
>>     org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
>>     **    at java.lang.Thread.run(Thread.java:748)**
>>     **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be
>>     halted immediately due to the failure: [failureCtx=FailureContext
>>     [type=CRITICAL_ERROR, err=class
>>     o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
>>     while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
>>     Can't restore memory - critical part of WAL archive is missing.]]*
>>
>>     This operation ended successfully with version 2.4.
>>
>>     The exception was introduced by IGNITE-8393
>>     <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>>     but the problem may lie deeper, like in the checkpoint process
>>     which store a status.endPtr = FileWALPointer even when walMode =
>>     NONE.
>>
>>     Any idea ?
>>
>>
>>     Regards,
>>     ---
>>     Emmanuel.
>
>
>
>
> -- 
> Best regards,
> Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Andrey Mashenkov <an...@gmail.com>.

Hi Emmanuel,

Sorry for late answer.
I've found I check your test against master branch and test passes, but it
fails on 2.5 branch.
It looks like already fixed.

On Wed, Jun 6, 2018 at 3:47 PM, Emmanuel Marchand <
emmanuel.marchand@exensa.com> wrote:

> I was wrong on the introduction of the exception. I guess it was added by
> a fix about IGNITE-8066
> <https://github.com/apache/ignite/commit/55c50a1c4309246e9cd254ed1f25b41b89f9a959#diff-cacbb66ea1331a1869d9da2bcd039ef1>
> .
> Regards,
> ---
> Emmanuel.
>
>
> On 05/06/18 11:05, Emmanuel Marchand wrote:
>
> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
> and while performance seems better I failed to restart the cluster after
> what I think is a proper shutdown (using top -deactivate then kill -k from
> visor).
>
> When I try to reactivate the cluster (using top -activate from visor) I
> get the following exception on each nodes :
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
> rmtAddr=/192.168.1.102:44646]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending activate
> request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
> customEvt=ChangeGlobalStateMessage [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c, initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> activate=true, baselineTopology=null, forceChangeBaselineTopology=false,
> timestamp=1528183297656], allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
> page store work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log archive directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-
> 41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
> Started write-ahead log manager in NONE mode, persisted data may be lost in
> a case of unexpected node failure. Make sure to deactivate the cluster
> before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
> checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Read checkpoint status [startMarker=/usr/share/apache-ignite-fabric-2.5.0-
> bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/
> cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/
> db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/
> 1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
> detected. Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
> len=0]. Can't restore memory - critical part of WAL archive is missing.*
> *    at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
> *    at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
> *    at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
> *    at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
> *    at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
> *    at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
> *    at java.lang.Thread.run(Thread.java:748)*
> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
> immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
> but the problem may lie deeper, like in the checkpoint process which store
> a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.
>
>
>


-- 
Best regards,
Andrey V. Mashenkov

Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive is missing with walMode=NONE

Posted by Emmanuel Marchand <em...@exensa.com>.

I was wrong on the introduction of the exception. I guess it was added 
by a fix about IGNITE-8066 
<https://github.com/apache/ignite/commit/55c50a1c4309246e9cd254ed1f25b41b89f9a959#diff-cacbb66ea1331a1869d9da2bcd039ef1>.

Regards,
---
Emmanuel.

On 05/06/18 11:05, Emmanuel Marchand wrote:
>
> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = 
> NONE* and while performance seems better I failed to restart the 
> cluster after what I think is a proper shutdown (using top -deactivate 
> then kill -k from visor).
>
> When I try to reactivate the cluster (using top -activate from visor) 
> I get the following exception on each nodes :
>
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi] 
> Accepted incoming communication connection 
> [locAddr=/192.168.1.1:47100, rmtAddr=/192.168.1.102:44646]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending 
> activate request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor] 
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor] 
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init 
> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true, 
> evt=DISCOVERY_CUSTOM_EVT, 
> evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60, 
> customEvt=ChangeGlobalStateMessage 
> [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd, 
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c, 
> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true, 
> baselineTopology=null, forceChangeBaselineTopology=false, 
> timestamp=1528183297656], allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture] 
> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, 
> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] 
> Resolved page store work directory: 
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager] 
> Resolved write ahead log work directory: 
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager] 
> Resolved write ahead log archive directory: 
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager] 
> Started write-ahead log manager in NONE mode, persisted data may be 
> lost in a case of unexpected node failure. Make sure to deactivate the 
> cluster before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page 
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB, 
> checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page 
> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB, 
> checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page 
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB, 
> checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager] 
> Read checkpoint status 
> [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin, 
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager] 
> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0, 
> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0], 
> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error 
> detected. Will be handled accordingly to configured handler [hnd=class 
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: 
> Restore wal pointer = null, while status.endPtr = FileWALPointer 
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of WAL 
> archive is missing.]]**
> **class org.apache.ignite.internal.pagemem.wal.StorageException: 
> Restore wal pointer = null, while status.endPtr = FileWALPointer 
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of WAL 
> archive is missing.**
> **    at 
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
> **    at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
> **    at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
> **    at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
> **    at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
> **    at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
> **    at java.lang.Thread.run(Thread.java:748)**
> **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted 
> immediately due to the failure: [failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: 
> Restore wal pointer = null, while status.endPtr = FileWALPointer 
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of WAL 
> archive is missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393 
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8> 
> but the problem may lie deeper, like in the checkpoint process which 
> store a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.