You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Emmanuel Marchand <em...@exensa.com> on 2018/06/05 09:05:21 UTC
Ignite 2.5 | Can't restore memory - critical part of WAL archive is
missing with walMode=NONE
Hi,
I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
and while performance seems better I failed to restart the cluster after
what I think is a proper shutdown (using top -deactivate then kill -k
from visor).
When I try to reactivate the cluster (using top -activate from visor) I
get the following exception on each nodes :
[09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
rmtAddr=/192.168.1.102:44646]
[09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
activate request with BaselineTopology null
[09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
Received activate request with BaselineTopology: null
[09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
Started state transition: true
[09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
customEvt=ChangeGlobalStateMessage
[id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true,
baselineTopology=null, forceChangeBaselineTopology=false,
timestamp=1528183297656], allowMerge=false]
[09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
[09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
page store work directory:
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
[09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
Resolved write ahead log work directory:
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
[09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
Resolved write ahead log archive directory:
/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
[09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
Started write-ahead log manager in NONE mode, persisted data may be lost
in a case of unexpected node failure. Make sure to deactivate the
cluster before shutdown.
[09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
checkpointBuffer=100.0 MiB]
[09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
checkpointBuffer=2.0 GiB]
[09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
checkpointBuffer=100.0 MiB]
[09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
Read checkpoint status
[startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
[09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
*[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
detected. Will be handled accordingly to configured handler [hnd=class
o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
[type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
fileOff=0, len=0]. Can't restore memory - critical part of WAL archive
is missing.]]**
**class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
fileOff=0, len=0]. Can't restore memory - critical part of WAL archive
is missing.**
** at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
** at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
** at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
** at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
** at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
** at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
** at java.lang.Thread.run(Thread.java:748)**
**[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
immediately due to the failure: [failureCtx=FailureContext
[type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
fileOff=0, len=0]. Can't restore memory - critical part of WAL archive
is missing.]]*
This operation ended successfully with version 2.4.
The exception was introduced by IGNITE-8393
<https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
but the problem may lie deeper, like in the checkpoint process which
store a status.endPtr = FileWALPointer even when walMode = NONE.
Any idea ?
Regards,
---
Emmanuel.
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Emmanuel Marchand <em...@exensa.com>.
Hi,
Here are some more informations :
* the region are as in the provided configuration file. I guess there
is only one.
* I have 3 caches and I update all 3 using one datastreamer with
custom receiver for each.
* all caches are not using any backup and atomicity mode is set to ATOMIC
* caches are indeed partitioned, keys use affinityKeyMapped and
implement Comparable<K> and Externalizable
* values are using indexing
About your test :
* do you add enough data to trigger a checkpoint ? May be try to
reduce setMaxSize ?
* I don't set checkpointPageBufferSize
* are you sure the checkpoint folder is not empty before restart ?
Thanks for your help.
Regards,
---
Emmanuel.
On 05/06/18 17:26, Andrey Mashenkov wrote:
> Hi,
>
> I've make a simple test and still can't reproduce this.
>
> Would you please take a look if I missed smth?
> Is it possible you have more than one region and\or cache?
>
>
>
> On Tue, Jun 5, 2018 at 5:41 PM, Emmanuel Marchand
> <emmanuel.marchand@exensa.com <ma...@exensa.com>>
> wrote:
>
> Hi,
>
> I'm not sure I can provide a reproducer, but here is some
> informations :
>
> * configuration file attached
> * 2 server nodes, 1 client (+ visor)
> * I'm streaming enough data to trigger a checkpoint with a
> reason 'too many dirty pages'
> o [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
> Checkpoint started
> [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
> startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
> checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
> walCpRecordFsyncDuration=0ms, pages=1508362, reason='too
> many dirty pages']
> * no error occurs
> * deactivate cluster then stop nodes using visor
> * restart nodes
> * activate cluster using visor -> crash
>
> Workaround : if I delete (or rename) the checkpoint folder (which
> is *not* empty), the activation completes successfully.
>
> Regards,
> ---
> Emmanuel.
>
>
>
> On 05/06/18 15:40, Andrey Mashenkov wrote:
>> Hi,
>>
>> I can't reproduce the issue.
>> Is it possible grid configuration was changed between runs?
>> Is it possible to share a reproducer?
>>
>>
>> On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand
>> <emmanuel.marchand@exensa.com
>> <ma...@exensa.com>> wrote:
>>
>> Hi,
>>
>> I'm testing v2.5 vs v2.4 for persisted dataregion with
>> *walModel = NONE* and while performance seems better I failed
>> to restart the cluster after what I think is a proper
>> shutdown (using top -deactivate then kill -k from visor).
>>
>> When I try to reactivate the cluster (using top -activate
>> from visor) I get the following exception on each nodes :
>>
>> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>> Accepted incoming communication connection
>> [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
>> rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
>> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor]
>> Sending activate request with BaselineTopology null
>> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Received activate request with BaselineTopology: null
>> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Started state transition: true
>> [09:21:37,687][INFO][exchange-worker-#52][time] Started
>> exchange init [topVer=AffinityTopologyVersion [topVer=69,
>> minorTopVer=1], crd=true, evt=DISCOVERY_CUSTOM_EVT,
>> evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> customEvt=ChangeGlobalStateMessage
>> [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> activate=true, baselineTopology=null,
>> forceChangeBaselineTopology=false, timestamp=1528183297656],
>> allowMerge=false]
>> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>> Start activation process
>> [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
>> topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
>> Resolved page store work directory:
>> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log work directory:
>> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log archive directory:
>> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>> Started write-ahead log manager in NONE mode, persisted data
>> may be lost in a case of unexpected node failure. Make sure
>> to deactivate the cluster before shutdown.
>> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl]
>> Started page memory [memoryAllocated=100.0 MiB, pages=24804,
>> tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl]
>> Started page memory [memoryAllocated=8.0 GiB, pages=2032836,
>> tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
>> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl]
>> Started page memory [memoryAllocated=100.0 MiB, pages=24804,
>> tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Read checkpoint status
>> [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
>> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
>> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Checking memory state [lastValidPos=FileWALPointer [idx=0,
>> fileOff=0, len=0], lastMarked=FileWALPointer [idx=0,
>> fileOff=0, len=0],
>> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical
>> system error detected. Will be handled accordingly to
>> configured handler [hnd=class
>> o.a.i.failure.StopNodeOrHaltFailureHandler,
>> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>> o.a.i.i.pagemem.wal.StorageException: Restore wal pointer =
>> null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>> len=0]. Can't restore memory - critical part of WAL archive
>> is missing.]]**
>> **class
>> org.apache.ignite.internal.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr =
>> FileWALPointer [idx=0, fileOff=0, len=0]. Can't restore
>> memory - critical part of WAL archive is missing.**
>> ** at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
>> ** at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
>> ** at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
>> ** at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
>> ** at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
>> ** at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
>> ** at java.lang.Thread.run(Thread.java:748)**
>> **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be
>> halted immediately due to the failure:
>> [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>> o.a.i.i.pagemem.wal.StorageException: Restore wal pointer =
>> null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>> len=0]. Can't restore memory - critical part of WAL archive
>> is missing.]]*
>>
>> This operation ended successfully with version 2.4.
>>
>> The exception was introduced by IGNITE-8393
>> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>> but the problem may lie deeper, like in the checkpoint
>> process which store a status.endPtr = FileWALPointer even
>> when walMode = NONE.
>>
>> Any idea ?
>>
>>
>> Regards,
>> ---
>> Emmanuel.
>>
>>
>>
>>
>> --
>> Best regards,
>> Andrey V. Mashenkov
>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Andrey Mashenkov <an...@gmail.com>.
Hi,
I've make a simple test and still can't reproduce this.
Would you please take a look if I missed smth?
Is it possible you have more than one region and\or cache?
On Tue, Jun 5, 2018 at 5:41 PM, Emmanuel Marchand <
emmanuel.marchand@exensa.com> wrote:
> Hi,
>
> I'm not sure I can provide a reproducer, but here is some informations :
>
> - configuration file attached
> - 2 server nodes, 1 client (+ visor)
> - I'm streaming enough data to trigger a checkpoint with a reason 'too
> many dirty pages'
> - [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
> Checkpoint started [checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
> startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
> checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
> walCpRecordFsyncDuration=0ms, pages=1508362, reason='too many dirty pages']
> - no error occurs
> - deactivate cluster then stop nodes using visor
> - restart nodes
> - activate cluster using visor -> crash
>
> Workaround : if I delete (or rename) the checkpoint folder (which is *not*
> empty), the activation completes successfully.
> Regards,
> ---
> Emmanuel.
>
>
>
> On 05/06/18 15:40, Andrey Mashenkov wrote:
>
> Hi,
>
> I can't reproduce the issue.
> Is it possible grid configuration was changed between runs?
> Is it possible to share a reproducer?
>
>
> On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand <
> emmanuel.marchand@exensa.com> wrote:
>
>> Hi,
>>
>> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
>> and while performance seems better I failed to restart the cluster after
>> what I think is a proper shutdown (using top -deactivate then kill -k from
>> visor).
>>
>> When I try to reactivate the cluster (using top -activate from visor) I
>> get the following exception on each nodes :
>> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
>> rmtAddr=/192.168.1.102:44646]
>> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
>> activate request with BaselineTopology null
>> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Received activate request with BaselineTopology: null
>> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Started state transition: true
>> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
>> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
>> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> customEvt=ChangeGlobalStateMessage [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true,
>> baselineTopology=null, forceChangeBaselineTopology=false,
>> timestamp=1528183297656], allowMerge=false]
>> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
>> page store work directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log work directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log archive directory: /usr/share/apache-ignite-fabri
>> c-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-
>> 9ebd-90b873711c19
>> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>> Started write-ahead log manager in NONE mode, persisted data may be lost in
>> a case of unexpected node failure. Make sure to deactivate the cluster
>> before shutdown.
>> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
>> checkpointBuffer=100.0 MiB]
>> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
>> checkpointBuffer=2.0 GiB]
>> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
>> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
>> checkpointBuffer=100.0 MiB]
>> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Read checkpoint status [startMarker=/usr/share/apache
>> -ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-
>> 9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-
>> 4b64-b328-87cc09d3d460-START.bin, endMarker=/usr/share/apache-ig
>> nite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd
>> -90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-
>> 87cc09d3d460-END.bin]
>> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
>> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
>> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
>> detected. Will be handled accordingly to configured handler [hnd=class
>> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
>> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
>> missing.]]*
>> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
>> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
>> len=0]. Can't restore memory - critical part of WAL archive is missing.*
>> * at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
>> * at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
>> * at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
>> * at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
>> * at org.apache.ignite.internal.pro
>> <http://org.apache.ignite.internal.pro>cessors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
>> * at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
>> * at java.lang.Thread.run(Thread.java:748)*
>> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
>> immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
>> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
>> missing.]]*
>>
>> This operation ended successfully with version 2.4.
>>
>> The exception was introduced by IGNITE-8393
>> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>> but the problem may lie deeper, like in the checkpoint process which store
>> a status.endPtr = FileWALPointer even when walMode = NONE.
>>
>> Any idea ?
>>
>>
>> Regards,
>> ---
>> Emmanuel.
>>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>
>
>
--
Best regards,
Andrey V. Mashenkov
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Emmanuel Marchand <em...@exensa.com>.
Hi,
I'm not sure I can provide a reproducer, but here is some informations :
* configuration file attached
* 2 server nodes, 1 client (+ visor)
* I'm streaming enough data to trigger a checkpoint with a reason 'too
many dirty pages'
o [INFO][db-checkpoint-thread-#69][GridCacheDatabaseSharedManager]
Checkpoint started
[checkpointId=225ef67d-2850-499f-860d-f7868f1f73ec,
startPtr=FileWALPointer [idx=0, fileOff=0, len=0],
checkpointLockWait=151ms, checkpointLockHoldTime=17ms,
walCpRecordFsyncDuration=0ms, pages=1508362, reason='too many
dirty pages']
* no error occurs
* deactivate cluster then stop nodes using visor
* restart nodes
* activate cluster using visor -> crash
Workaround : if I delete (or rename) the checkpoint folder (which is
*not* empty), the activation completes successfully.
Regards,
---
Emmanuel.
On 05/06/18 15:40, Andrey Mashenkov wrote:
> Hi,
>
> I can't reproduce the issue.
> Is it possible grid configuration was changed between runs?
> Is it possible to share a reproducer?
>
>
> On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand
> <emmanuel.marchand@exensa.com <ma...@exensa.com>>
> wrote:
>
> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel =
> NONE* and while performance seems better I failed to restart the
> cluster after what I think is a proper shutdown (using top
> -deactivate then kill -k from visor).
>
> When I try to reactivate the cluster (using top -activate from
> visor) I get the following exception on each nodes :
>
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
> Accepted incoming communication connection
> [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
> rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
> activate request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange
> init [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1],
> crd=true, evt=DISCOVERY_CUSTOM_EVT,
> evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
> customEvt=ChangeGlobalStateMessage
> [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> activate=true, baselineTopology=null,
> forceChangeBaselineTopology=false, timestamp=1528183297656],
> allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
> Start activation process
> [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
> topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
> Resolved page store work directory:
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log work directory:
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log archive directory:
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
> Started write-ahead log manager in NONE mode, persisted data may
> be lost in a case of unexpected node failure. Make sure to
> deactivate the cluster before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started
> page memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9
> MiB, checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started
> page memory [memoryAllocated=8.0 GiB, pages=2032836,
> tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started
> page memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9
> MiB, checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Read checkpoint status
> [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=0,
> fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0,
> len=0], lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system
> error detected. Will be handled accordingly to configured handler
> [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
> o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
> while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
> Can't restore memory - critical part of WAL archive is missing.]]**
> **class org.apache.ignite.internal.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of
> WAL archive is missing.**
> ** at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
> ** at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
> ** at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
> ** at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
> ** at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
> ** at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
> ** at java.lang.Thread.run(Thread.java:748)**
> **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
> immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class
> o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
> while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
> Can't restore memory - critical part of WAL archive is missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
> but the problem may lie deeper, like in the checkpoint process
> which store a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.
>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Andrey Mashenkov <an...@gmail.com>.
Hi,
I can't reproduce the issue.
Is it possible grid configuration was changed between runs?
Is it possible to share a reproducer?
On Tue, Jun 5, 2018 at 12:05 PM, Emmanuel Marchand <
emmanuel.marchand@exensa.com> wrote:
> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
> and while performance seems better I failed to restart the cluster after
> what I think is a proper shutdown (using top -deactivate then kill -k from
> visor).
>
> When I try to reactivate the cluster (using top -activate from visor) I
> get the following exception on each nodes :
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
> rmtAddr=/192.168.1.102:44646]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending activate
> request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
> customEvt=ChangeGlobalStateMessage [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c, initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> activate=true, baselineTopology=null, forceChangeBaselineTopology=false,
> timestamp=1528183297656], allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
> page store work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log archive directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-
> 41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
> Started write-ahead log manager in NONE mode, persisted data may be lost in
> a case of unexpected node failure. Make sure to deactivate the cluster
> before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
> checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Read checkpoint status [startMarker=/usr/share/apache-ignite-fabric-2.5.0-
> bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/
> cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/
> db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/
> 1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
> detected. Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
> len=0]. Can't restore memory - critical part of WAL archive is missing.*
> * at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
> * at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
> * at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
> * at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
> * at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
> * at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
> * at java.lang.Thread.run(Thread.java:748)*
> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
> immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
> but the problem may lie deeper, like in the checkpoint process which store
> a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.
>
--
Best regards,
Andrey V. Mashenkov
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Emmanuel Marchand <em...@exensa.com>.
Hi Andrey,
Thanks for your time.
Regards,
---
Emmanuel.
On 08/06/18 16:42, Andrey Mashenkov wrote:
> Hi Emmanuel,
>
> Sorry for late answer.
> I've found I check your test against master branch and test passes,
> but it fails on 2.5 branch.
> It looks like already fixed.
>
> On Wed, Jun 6, 2018 at 3:47 PM, Emmanuel Marchand
> <emmanuel.marchand@exensa.com <ma...@exensa.com>>
> wrote:
>
> I was wrong on the introduction of the exception. I guess it was
> added by a fix about IGNITE-8066
> <https://github.com/apache/ignite/commit/55c50a1c4309246e9cd254ed1f25b41b89f9a959#diff-cacbb66ea1331a1869d9da2bcd039ef1>.
>
> Regards,
> ---
> Emmanuel.
>
>
> On 05/06/18 11:05, Emmanuel Marchand wrote:
>>
>> Hi,
>>
>> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel
>> = NONE* and while performance seems better I failed to restart
>> the cluster after what I think is a proper shutdown (using top
>> -deactivate then kill -k from visor).
>>
>> When I try to reactivate the cluster (using top -activate from
>> visor) I get the following exception on each nodes :
>>
>> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
>> Accepted incoming communication connection
>> [locAddr=/192.168.1.1:47100 <http://192.168.1.1:47100>,
>> rmtAddr=/192.168.1.102:44646 <http://192.168.1.102:44646>]
>> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
>> activate request with BaselineTopology null
>> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Received activate request with BaselineTopology: null
>> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
>> Started state transition: true
>> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange
>> init [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1],
>> crd=true, evt=DISCOVERY_CUSTOM_EVT,
>> evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> customEvt=ChangeGlobalStateMessage
>> [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
>> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
>> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
>> activate=true, baselineTopology=null,
>> forceChangeBaselineTopology=false, timestamp=1528183297656],
>> allowMerge=false]
>> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
>> Start activation process
>> [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, client=false,
>> topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
>> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
>> Resolved page store work directory:
>> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log work directory:
>> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
>> Resolved write ahead log archive directory:
>> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
>> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
>> Started write-ahead log manager in NONE mode, persisted data may
>> be lost in a case of unexpected node failure. Make sure to
>> deactivate the cluster before shutdown.
>> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started
>> page memory [memoryAllocated=100.0 MiB, pages=24804,
>> tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started
>> page memory [memoryAllocated=8.0 GiB, pages=2032836,
>> tableSize=158.1 MiB, checkpointBuffer=2.0 GiB]
>> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started
>> page memory [memoryAllocated=100.0 MiB, pages=24804,
>> tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
>> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Read checkpoint status
>> [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
>> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
>> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
>> Checking memory state [lastValidPos=FileWALPointer [idx=0,
>> fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0,
>> len=0], lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
>> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system
>> error detected. Will be handled accordingly to configured handler
>> [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
>> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
>> o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
>> while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
>> Can't restore memory - critical part of WAL archive is missing.]]**
>> **class org.apache.ignite.internal.pagemem.wal.StorageException:
>> Restore wal pointer = null, while status.endPtr = FileWALPointer
>> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part
>> of WAL archive is missing.**
>> ** at
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
>> ** at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
>> ** at
>> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
>> ** at
>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
>> ** at
>> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
>> ** at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
>> ** at java.lang.Thread.run(Thread.java:748)**
>> **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be
>> halted immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class
>> o.a.i.i.pagemem.wal.StorageException: Restore wal pointer = null,
>> while status.endPtr = FileWALPointer [idx=0, fileOff=0, len=0].
>> Can't restore memory - critical part of WAL archive is missing.]]*
>>
>> This operation ended successfully with version 2.4.
>>
>> The exception was introduced by IGNITE-8393
>> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
>> but the problem may lie deeper, like in the checkpoint process
>> which store a status.endPtr = FileWALPointer even when walMode =
>> NONE.
>>
>> Any idea ?
>>
>>
>> Regards,
>> ---
>> Emmanuel.
>
>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Andrey Mashenkov <an...@gmail.com>.
Hi Emmanuel,
Sorry for late answer.
I've found I check your test against master branch and test passes, but it
fails on 2.5 branch.
It looks like already fixed.
On Wed, Jun 6, 2018 at 3:47 PM, Emmanuel Marchand <
emmanuel.marchand@exensa.com> wrote:
> I was wrong on the introduction of the exception. I guess it was added by
> a fix about IGNITE-8066
> <https://github.com/apache/ignite/commit/55c50a1c4309246e9cd254ed1f25b41b89f9a959#diff-cacbb66ea1331a1869d9da2bcd039ef1>
> .
> Regards,
> ---
> Emmanuel.
>
>
> On 05/06/18 11:05, Emmanuel Marchand wrote:
>
> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel = NONE*
> and while performance seems better I failed to restart the cluster after
> what I think is a proper shutdown (using top -deactivate then kill -k from
> visor).
>
> When I try to reactivate the cluster (using top -activate from visor) I
> get the following exception on each nodes :
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.1:47100,
> rmtAddr=/192.168.1.102:44646]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending activate
> request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
> evt=DISCOVERY_CUSTOM_EVT, evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
> customEvt=ChangeGlobalStateMessage [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c, initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> activate=true, baselineTopology=null, forceChangeBaselineTopology=false,
> timestamp=1528183297656], allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager] Resolved
> page store work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log work directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log archive directory: /usr/share/apache-ignite-
> fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-
> 41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
> Started write-ahead log manager in NONE mode, persisted data may be lost in
> a case of unexpected node failure. Make sure to deactivate the cluster
> before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
> checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Read checkpoint status [startMarker=/usr/share/apache-ignite-fabric-2.5.0-
> bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/
> cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/
> db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/
> 1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
> detected. Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
> *class org.apache.ignite.internal.pagemem.wal.StorageException: Restore
> wal pointer = null, while status.endPtr = FileWALPointer [idx=0, fileOff=0,
> len=0]. Can't restore memory - critical part of WAL archive is missing.*
> * at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)*
> * at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)*
> * at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)*
> * at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)*
> * at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)*
> * at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)*
> * at java.lang.Thread.run(Thread.java:748)*
> *[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
> immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer [idx=0,
> fileOff=0, len=0]. Can't restore memory - critical part of WAL archive is
> missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
> but the problem may lie deeper, like in the checkpoint process which store
> a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.
>
>
>
--
Best regards,
Andrey V. Mashenkov
Re: Ignite 2.5 | Can't restore memory - critical part of WAL archive
is missing with walMode=NONE
Posted by Emmanuel Marchand <em...@exensa.com>.
I was wrong on the introduction of the exception. I guess it was added
by a fix about IGNITE-8066
<https://github.com/apache/ignite/commit/55c50a1c4309246e9cd254ed1f25b41b89f9a959#diff-cacbb66ea1331a1869d9da2bcd039ef1>.
Regards,
---
Emmanuel.
On 05/06/18 11:05, Emmanuel Marchand wrote:
>
> Hi,
>
> I'm testing v2.5 vs v2.4 for persisted dataregion with *walModel =
> NONE* and while performance seems better I failed to restart the
> cluster after what I think is a proper shutdown (using top -deactivate
> then kill -k from visor).
>
> When I try to reactivate the cluster (using top -activate from visor)
> I get the following exception on each nodes :
>
> [09:21:37,592][INFO][grid-nio-worker-tcp-comm-0-#33][TcpCommunicationSpi]
> Accepted incoming communication connection
> [locAddr=/192.168.1.1:47100, rmtAddr=/192.168.1.102:44646]
> [09:21:37,656][INFO][pub-#92][GridClusterStateProcessor] Sending
> activate request with BaselineTopology null
> [09:21:37,659][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Received activate request with BaselineTopology: null
> [09:21:37,661][INFO][tcp-disco-msg-worker-#3][GridClusterStateProcessor]
> Started state transition: true
> [09:21:37,687][INFO][exchange-worker-#52][time] Started exchange init
> [topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1], crd=true,
> evt=DISCOVERY_CUSTOM_EVT,
> evtNode=0f5d38b7-b748-4861-91ef-204ed9343e60,
> customEvt=ChangeGlobalStateMessage
> [id=c0eeccec361-85ace6cb-d27e-4a0e-9106-ca39e6fcbfdd,
> reqId=5a1cf16e-f610-4b4b-b1eb-76078be38d6c,
> initiatingNodeId=0f5d38b7-b748-4861-91ef-204ed9343e60, activate=true,
> baselineTopology=null, forceChangeBaselineTopology=false,
> timestamp=1528183297656], allowMerge=false]
> [09:21:37,688][INFO][exchange-worker-#52][GridDhtPartitionsExchangeFuture]
> Start activation process [nodeId=0f5d38b7-b748-4861-91ef-204ed9343e60,
> client=false, topVer=AffinityTopologyVersion [topVer=69, minorTopVer=1]]
> [09:21:37,688][INFO][exchange-worker-#52][FilePageStoreManager]
> Resolved page store work directory:
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log work directory:
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,689][INFO][exchange-worker-#52][FileWriteAheadLogManager]
> Resolved write ahead log archive directory:
> /usr/share/apache-ignite-fabric-2.5.0-bin/work/db/wal/archive/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19
> [09:21:37,690][WARNING][exchange-worker-#52][FileWriteAheadLogManager]
> Started write-ahead log manager in NONE mode, persisted data may be
> lost in a case of unexpected node failure. Make sure to deactivate the
> cluster before shutdown.
> [09:21:37,701][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:37,798][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=8.0 GiB, pages=2032836, tableSize=158.1 MiB,
> checkpointBuffer=2.0 GiB]
> [09:21:37,800][INFO][exchange-worker-#52][PageMemoryImpl] Started page
> memory [memoryAllocated=100.0 MiB, pages=24804, tableSize=1.9 MiB,
> checkpointBuffer=100.0 MiB]
> [09:21:38,168][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Read checkpoint status
> [startMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-START.bin,
> endMarker=/usr/share/apache-ignite-fabric-2.5.0-bin/work/db/node00-bcfb4de5-5fc6-41e9-9ebd-90b873711c19/cp/1528182048551-ea54267c-22c4-4b64-b328-87cc09d3d460-END.bin]
> [09:21:38,169][INFO][exchange-worker-#52][GridCacheDatabaseSharedManager]
> Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0,
> len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0],
> lastCheckpointId=ea54267c-22c4-4b64-b328-87cc09d3d460]
> *[09:21:38,228][SEVERE][exchange-worker-#52][] Critical system error
> detected. Will be handled accordingly to configured handler [hnd=class
> o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of WAL
> archive is missing.]]**
> **class org.apache.ignite.internal.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of WAL
> archive is missing.**
> ** at
> org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:759)**
> ** at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onClusterStateChangeRequest(GridDhtPartitionsExchangeFuture.java:894)**
> ** at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:641)**
> ** at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)**
> ** at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)**
> ** at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)**
> ** at java.lang.Thread.run(Thread.java:748)**
> **[09:21:38,229][SEVERE][exchange-worker-#52][] JVM will be halted
> immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
> Restore wal pointer = null, while status.endPtr = FileWALPointer
> [idx=0, fileOff=0, len=0]. Can't restore memory - critical part of WAL
> archive is missing.]]*
>
> This operation ended successfully with version 2.4.
>
> The exception was introduced by IGNITE-8393
> <https://github.com/apache/ignite/commit/5ec1796408d45296cbb26bb93353e23e18b039c8>
> but the problem may lie deeper, like in the checkpoint process which
> store a status.endPtr = FileWALPointer even when walMode = NONE.
>
> Any idea ?
>
>
> Regards,
> ---
> Emmanuel.