You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Raymond Wilson <ra...@trimble.com> on 2020/05/12 09:23:17 UTC

Re: Out of memory error in data region with persistence enabled

Well, it appears I was wrong. It reappeared. :(

I thought I had sent a reply to this thread but cannot find it, so I am
resending it now.

Attached is a c# reproducer that throws Ignite out of memory errors in the
situation I outlined above where cache operations against a small cache
with persistence enabled.

Let me know if you're able to reproduce it on your local systems.

Thanks,
Raymond.


On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <ra...@trimble.com>
wrote:

> It's possible this is user (me) error.
>
> I discovered I had set the cache size to be 64Mb in the server, but 65Mb
> (typo!) in the client. Making these two values consistent appeared to
> prevent the error.
>
> Raymond.
>
>
> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <ra...@trimble.com>
> wrote:
>
>> I'm using Ignite v2.7.5 with C# client.
>>
>> I have an error where Ignite throws an out of memory exception, like this:
>>
>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM will be
>> halted immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>   ^-- Increase maximum off-heap memory size
>> (DataRegionConfiguration.maxSize)
>>   ^-- Enable Ignite persistence
>> (DataRegionConfiguration.persistenceEnabled)
>>   ^-- Enable eviction or expiration policies]]
>>
>> I don't have an eviction policy set (is this even a valid recommendation
>> when using persistence?)
>>
>> Increasing the off heap memory size for the data region does prevent this
>> error, but I want to minimise the in-memory size for this buffer as it is
>> essentially just a queue.
>>
>> The suggestion of enabling data persistence is strange as this data
>> region has already persistence enabled for it.
>>
>> My assumption is that Ignite manages the memory in this cache by saving
>> and loading values as required.
>>
>> The test workflow in this failure is one where ~14,500 objects totalling
>> ~440 Mb in size (avery object size = ~30Kb) are added to the cache, and are
>> then drained by a processors using a continuous query. Elements are removed
>> from the cache as the processor completes them.
>>
>> Is this kind of out of memory error supposed to be possible when using
>> persistent data regions?
>>
>> Thanks,
>> Raymond.
>>
>>
>>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Out of memory error in data region with persistence enabled

Posted by Alex Plehanov <pl...@gmail.com>.
> Would you recommend defining only a single persistent data region in the
general case?
> Would you recommend not using larger page sizes to permit more head room
for checkpointing?

I'm not sure about this, it should be tested. I think in your case it's
better to just add memory to the second data region, 64 Mb seems very low.

> It seems adding more memory will help as you suggest. However the
relation of 256*caches*partitions*CPUs means addition of more caches to
support more functionality, or scaling of infrastructure, risks crossing a
boundary where checkpointing is no longer 'safe' given the memory supplied
to the data region under the existing workloads.

Sorry, I was wrong about "256*caches*partitions*CPUs" metadata pages. You
can have such amount of free-list tail pages in the worst case, but
metadata includes only references to these pages and there will be one or
more metadata page per each cache per each partition.
Also, caches can be grouped to "cache groups" in this case they can share
some data structures (including free-lists).

> If a full checkpoint cannot be completed due to free space restrictions,
can a series of partial checkpoints be executed?

Currently, checkpoint guarantees that all pages from memory are stored to
disk in a consistent state (memory snapshot at some point in time), so
checkpoint can't be split.

вт, 16 июн. 2020 г. в 17:03, Raymond Wilson <ra...@trimble.com>:

> Hi Alex,
>
> Thanks for providing the additional detail on the checkpointing memory
> requirements.
>
> As you say, this is difficult to calculate statically, but it seems to
> indicate that small data regions are bad (ie: susceptible to this issue),
> so should be avoided.
>
> Our current system really only has two data regions, one that supports
> general long term data storage and access, and another one that supports
> ingest. The reason for a data region supporting ingest is to act as a safe
> buffer for inbound information until ingest processors process it, without
> impacting mainline operations against the other data region. It sounds like
> having multiple data regions may be an inefficient use of memory due to the
> need to have sufficient free space to support checkpointing operations.
> Would you recommend defining only a single persistent data region in the
> general case?
>
> It is interesting that the number of pages defined by page size has such a
> large effect on how well checkpointing works in stressful work loads. Would
> you recommend not using larger page sizes to permit more head room for
> checkpointing?
>
> It seems adding more memory will help as you suggest. However the relation
> of 256*caches*partitions*CPUs means addition of more caches to support more
> functionality, or scaling of infrastructure, risks crossing a boundary
> where checkpointing is no longer 'safe' given the memory supplied to the
> data region under the existing workloads.
>
> Finally, the aspect that did surprise me is how final this failure mode is
> in that the JVM logs the issue and then quits, which would give the support
> team nightmares! Is it possible to have a more graceful degradation of this
> functionality; ie: If a full checkpoint cannot be completed due to free
> space restrictions, can a series of partial checkpoints be executed?
>
> I look forward to your suggestions.
>
> Thanks,
> Raymond.
>
>
>
> On Tue, Jun 16, 2020 at 10:22 PM Alex Plehanov <pl...@gmail.com>
> wrote:
>
>> Raymond,
>>
>> When a checkpoint is triggered you need to have some amount of free page
>> slots in offheap to save metadata (for example free-lists metadata,
>> partition counter gaps, etc). The number of required pages depends on count
>> of caches, count of partitions, workload, and count of CPUs. In worst
>> cases, you will need up to 256*caches*partitions*CPU count of pages only to
>> store free-list buckets metadata. This number of pages can't be calculated
>> statically, so the exact amount can't be reserved in advance. Currently,
>> 1/4 of offheap memory is reserved for this purpose (when amount of dirty
>> pages riches 3/4 of total number of pages checkpoint is triggered), but
>> sometimes it's not enough.
>>
>> In your case, 64Mb data-region is allocated. Page size is 16Kb, so you
>> have a total of about 4000 pages (real page size in offheap is a little bit
>> bigger than configured page size). Checkpoint is triggered by "too many
>> dirty page" event, so 3/4 of pages are already dirty. And only 1000 pages
>> left to store metadata, it's too small. If page size is 4kb the amount of
>> clean pages is 4000, so your reproducer can pass in some circumstances.
>>
>> Increase data region size to solve the problem.
>>
>>
>> вт, 16 июн. 2020 г. в 05:39, Raymond Wilson <ra...@trimble.com>:
>>
>>> I have spent some more time on the reproducer. It is now very simple and
>>> reliably reproduces the issue with a simple loop adding slowly growing
>>> entries into a cache with no continuous query ro filters. I have attached
>>> the source files and the log I obtain when running it.
>>>
>>> Running from a clean slate (no existing persistent data) this reproducer
>>> exhibits the out of memory error when adding an element 4150 bytes in size.
>>>
>>> I did find this SO article (
>>> https://stackoverflow.com/questions/55937768/ignite-report-igniteoutofmemoryexception-out-of-memory-in-data-region)
>>> that describes the same problem. The solution offered was to increase the
>>> empty page pool size so it is larger than the biggest element being added.
>>> The empty pool size should always be bigger than the largest element added
>>> in the reproducer until the point of failure where 4150 bytes is the
>>> largest size being added. I tried increasing it to 200, it made no
>>> difference.
>>>
>>> The reproducer is using a pagesize of 16384 bytes.
>>>
>>> If I set the page size to the default 4096 bytes this reproducer does
>>> not show the error up to the size limit of 19999 bytes the reproducer tests.
>>> If I set the page size to 8192 bytes the reproducer does reliably fail
>>> with the error at the item with 6941 bytes.
>>>
>>> This feels like a bug in handling non-default page sizes. Would you
>>> recommend switching from 16384 bytes to 4096 for our page size? The reason
>>> I opted for the larger size is that we may have elements ranging in size
>>> from 100's of bytes to 100Kb, and sometimes larger.
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>> On Thu, Jun 11, 2020 at 4:25 PM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> Just a correction to context of the data region running out of memory:
>>>> This one does not have a queue of items or a continuous query operating on
>>>> a cache within it.
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>> On Thu, Jun 11, 2020 at 4:12 PM Raymond Wilson <
>>>> raymond_wilson@trimble.com> wrote:
>>>>
>>>>> Pavel,
>>>>>
>>>>> I have run into a different instance of a memory out of error in a
>>>>> data region in a different context from the one I wrote the reproducer for.
>>>>> In this case, there is an activity which queues items for processing at a
>>>>> point in the future and which does use a continuous query, however there is
>>>>> also significant vanilla put/get activity against a range of other caches..
>>>>>
>>>>> This data region was permitted to grow to 1Gb and has persistence
>>>>> enabled. We are now using Ignite 2.8
>>>>>
>>>>> I would like to understand if this is a possible failure mode given
>>>>> that the data region has persistence enabled. The underlying cause appears
>>>>> to be 'Unable to find a page for eviction'. Should this be expected on data
>>>>> regions with persistence?
>>>>>
>>>>> I have included the error below.
>>>>>
>>>>> This is the initial error reported by Ignite:
>>>>>
>>>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM
>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>>>> failedToPrepare=5417]
>>>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>>>   ^-- Increase maximum off-heap memory size
>>>>> (DataRegionConfiguration.maxSize)
>>>>>   ^-- Enable Ignite persistence
>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>
>>>>> Following this error is a lock dump, where this is the only thread
>>>>> with a lock:(I am assuming the structureId member with the value
>>>>> 'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
>>>>> lock against an item in the local node )
>>>>>
>>>>> Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
>>>>> Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
>>>>> Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
>>>>> time=(1591836815071, 2020-06-11 12:53:35.071)
>>>>> L=1 -> Write lock pageId=284060547022916,
>>>>> structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
>>>>> partId=602, pageIdx=68, flags=00000001]
>>>>>
>>>>> Following the lock dump is this final error before the Ignite node
>>>>> stops:
>>>>>
>>>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM
>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>>>> failedToPrepare=5417]
>>>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>>>   ^-- Increase maximum off-heap memory size
>>>>> (DataRegionConfiguration.maxSize)
>>>>>   ^-- Enable Ignite persistence
>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <
>>>>> raymond_wilson@trimble.com> wrote:
>>>>>
>>>>>> Hi Pavel,
>>>>>>
>>>>>> The reproducer is not the actual use case which is too big to use -
>>>>>> it's a small example using the same mechanisms. I have not used a data
>>>>>> streamer before, I'll read up on it.
>>>>>>
>>>>>> I'll try running the reproducer again against 2.8 (I used 2.7.6 for
>>>>>> the reproducer).
>>>>>>
>>>>>> Thanks,
>>>>>> Raymond.
>>>>>>
>>>>>>
>>>>>> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Raymond,
>>>>>>>
>>>>>>> First, I could not reproduce the issue. Attached program runs to
>>>>>>> completion on my machine.
>>>>>>>
>>>>>>> Second, I see a few issues with the attached code:
>>>>>>> - Cache.PutIfAbsent is used instead of DataStreamer
>>>>>>> - ICacheEntryEventFilter is used to remove cache entries, and is
>>>>>>> called twice - on add and on remove
>>>>>>>
>>>>>>> My recommendation is to use a "classic" combination of Data
>>>>>>> Streamer, Continuous Query, and Expiry Policy.
>>>>>>> Set expiry policy to a few seconds, and you won't keep much data in
>>>>>>> memory. Ignite will handle the removal for you.
>>>>>>> Let me know if I should prepare an example.
>>>>>>>
>>>>>>> Also it is not clear why persistence is needed for such a "buffer"
>>>>>>> cache - items are removed almost immediately,
>>>>>>> it would be much more efficient to disable persistence.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Pavel
>>>>>>>
>>>>>>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>>
>>>>>>>> Well, it appears I was wrong. It reappeared. :(
>>>>>>>>
>>>>>>>> I thought I had sent a reply to this thread but cannot find it, so
>>>>>>>> I am resending it now.
>>>>>>>>
>>>>>>>> Attached is a c# reproducer that throws Ignite out of memory errors
>>>>>>>> in the situation I outlined above where cache operations against a small
>>>>>>>> cache with persistence enabled.
>>>>>>>>
>>>>>>>> Let me know if you're able to reproduce it on your local systems.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Raymond.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>>>
>>>>>>>>> It's possible this is user (me) error.
>>>>>>>>>
>>>>>>>>> I discovered I had set the cache size to be 64Mb in the server,
>>>>>>>>> but 65Mb (typo!) in the client. Making these two values consistent appeared
>>>>>>>>> to prevent the error.
>>>>>>>>>
>>>>>>>>> Raymond.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>>>>
>>>>>>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>>>>>>
>>>>>>>>>> I have an error where Ignite throws an out of memory exception,
>>>>>>>>>> like this:
>>>>>>>>>>
>>>>>>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM
>>>>>>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>>>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>>>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>>>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>>>>>>   ^-- Increase maximum off-heap memory size
>>>>>>>>>> (DataRegionConfiguration.maxSize)
>>>>>>>>>>   ^-- Enable Ignite persistence
>>>>>>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>>>>>>
>>>>>>>>>> I don't have an eviction policy set (is this even a valid
>>>>>>>>>> recommendation when using persistence?)
>>>>>>>>>>
>>>>>>>>>> Increasing the off heap memory size for the data region does
>>>>>>>>>> prevent this error, but I want to minimise the in-memory size for this
>>>>>>>>>> buffer as it is essentially just a queue.
>>>>>>>>>>
>>>>>>>>>> The suggestion of enabling data persistence is strange as this
>>>>>>>>>> data region has already persistence enabled for it.
>>>>>>>>>>
>>>>>>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>>>>>>> saving and loading values as required.
>>>>>>>>>>
>>>>>>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>>>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>>>>>>> cache, and are then drained by a processors using a continuous query.
>>>>>>>>>> Elements are removed from the cache as the processor completes them.
>>>>>>>>>>
>>>>>>>>>> Is this kind of out of memory error supposed to be possible when
>>>>>>>>>> using persistent data regions?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Raymond.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>>
>>>>>>>> <http://www.trimble.com/>
>>>>>>>> Raymond Wilson
>>>>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>>>>> +64-21-2013317 Mobile
>>>>>>>> raymond_wilson@trimble.com
>>>>>>>>
>>>>>>>> [image: image.jpeg]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> <http://www.trimble.com/>
>>>>>> Raymond Wilson
>>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>>> +64-21-2013317 Mobile
>>>>>> raymond_wilson@trimble.com
>>>>>>
>>>>>> [image: image.jpeg]
>>>>>>
>>>>>>
>>>>>>
>>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> [image: image.png]
>>>>>
>>>>>
>>>>> <http://www.trimble.com/>
>>>>> Raymond Wilson
>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>> +64-21-2013317 Mobile
>>>>> raymond_wilson@trimble.com
>>>>>
>>>>> [image: image.jpeg]
>>>>>
>>>>>
>>>>>
>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>
>>>>
>>>>
>>>> --
>>>> [image: image.png]
>>>>
>>>>
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> +64-21-2013317 Mobile
>>>> raymond_wilson@trimble.com
>>>>
>>>> [image: image.jpeg]
>>>>
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>
>>>
>>> --
>>> [image: image.png]
>>>
>>>
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> +64-21-2013317 Mobile
>>> raymond_wilson@trimble.com
>>>
>>> [image: image.jpeg]
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>
> --
> [image: image.png]
>
>
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wilson@trimble.com
>
> [image: image.jpeg]
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Re: Out of memory error in data region with persistence enabled

Posted by Raymond Wilson <ra...@trimble.com>.
Hi Alex,

Thanks for providing the additional detail on the checkpointing memory
requirements.

As you say, this is difficult to calculate statically, but it seems to
indicate that small data regions are bad (ie: susceptible to this issue),
so should be avoided.

Our current system really only has two data regions, one that supports
general long term data storage and access, and another one that supports
ingest. The reason for a data region supporting ingest is to act as a safe
buffer for inbound information until ingest processors process it, without
impacting mainline operations against the other data region. It sounds like
having multiple data regions may be an inefficient use of memory due to the
need to have sufficient free space to support checkpointing operations.
Would you recommend defining only a single persistent data region in the
general case?

It is interesting that the number of pages defined by page size has such a
large effect on how well checkpointing works in stressful work loads. Would
you recommend not using larger page sizes to permit more head room for
checkpointing?

It seems adding more memory will help as you suggest. However the relation
of 256*caches*partitions*CPUs means addition of more caches to support more
functionality, or scaling of infrastructure, risks crossing a boundary
where checkpointing is no longer 'safe' given the memory supplied to the
data region under the existing workloads.

Finally, the aspect that did surprise me is how final this failure mode is
in that the JVM logs the issue and then quits, which would give the support
team nightmares! Is it possible to have a more graceful degradation of this
functionality; ie: If a full checkpoint cannot be completed due to free
space restrictions, can a series of partial checkpoints be executed?

I look forward to your suggestions.

Thanks,
Raymond.



On Tue, Jun 16, 2020 at 10:22 PM Alex Plehanov <pl...@gmail.com>
wrote:

> Raymond,
>
> When a checkpoint is triggered you need to have some amount of free page
> slots in offheap to save metadata (for example free-lists metadata,
> partition counter gaps, etc). The number of required pages depends on count
> of caches, count of partitions, workload, and count of CPUs. In worst
> cases, you will need up to 256*caches*partitions*CPU count of pages only to
> store free-list buckets metadata. This number of pages can't be calculated
> statically, so the exact amount can't be reserved in advance. Currently,
> 1/4 of offheap memory is reserved for this purpose (when amount of dirty
> pages riches 3/4 of total number of pages checkpoint is triggered), but
> sometimes it's not enough.
>
> In your case, 64Mb data-region is allocated. Page size is 16Kb, so you
> have a total of about 4000 pages (real page size in offheap is a little bit
> bigger than configured page size). Checkpoint is triggered by "too many
> dirty page" event, so 3/4 of pages are already dirty. And only 1000 pages
> left to store metadata, it's too small. If page size is 4kb the amount of
> clean pages is 4000, so your reproducer can pass in some circumstances.
>
> Increase data region size to solve the problem.
>
>
> вт, 16 июн. 2020 г. в 05:39, Raymond Wilson <ra...@trimble.com>:
>
>> I have spent some more time on the reproducer. It is now very simple and
>> reliably reproduces the issue with a simple loop adding slowly growing
>> entries into a cache with no continuous query ro filters. I have attached
>> the source files and the log I obtain when running it.
>>
>> Running from a clean slate (no existing persistent data) this reproducer
>> exhibits the out of memory error when adding an element 4150 bytes in size.
>>
>> I did find this SO article (
>> https://stackoverflow.com/questions/55937768/ignite-report-igniteoutofmemoryexception-out-of-memory-in-data-region)
>> that describes the same problem. The solution offered was to increase the
>> empty page pool size so it is larger than the biggest element being added.
>> The empty pool size should always be bigger than the largest element added
>> in the reproducer until the point of failure where 4150 bytes is the
>> largest size being added. I tried increasing it to 200, it made no
>> difference.
>>
>> The reproducer is using a pagesize of 16384 bytes.
>>
>> If I set the page size to the default 4096 bytes this reproducer does not
>> show the error up to the size limit of 19999 bytes the reproducer tests.
>> If I set the page size to 8192 bytes the reproducer does reliably fail
>> with the error at the item with 6941 bytes.
>>
>> This feels like a bug in handling non-default page sizes. Would you
>> recommend switching from 16384 bytes to 4096 for our page size? The reason
>> I opted for the larger size is that we may have elements ranging in size
>> from 100's of bytes to 100Kb, and sometimes larger.
>>
>> Thanks,
>> Raymond.
>>
>>
>> On Thu, Jun 11, 2020 at 4:25 PM Raymond Wilson <
>> raymond_wilson@trimble.com> wrote:
>>
>>> Just a correction to context of the data region running out of memory:
>>> This one does not have a queue of items or a continuous query operating on
>>> a cache within it.
>>>
>>> Thanks,
>>> Raymond.
>>>
>>> On Thu, Jun 11, 2020 at 4:12 PM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> Pavel,
>>>>
>>>> I have run into a different instance of a memory out of error in a data
>>>> region in a different context from the one I wrote the reproducer for. In
>>>> this case, there is an activity which queues items for processing at a
>>>> point in the future and which does use a continuous query, however there is
>>>> also significant vanilla put/get activity against a range of other caches..
>>>>
>>>> This data region was permitted to grow to 1Gb and has persistence
>>>> enabled. We are now using Ignite 2.8
>>>>
>>>> I would like to understand if this is a possible failure mode given
>>>> that the data region has persistence enabled. The underlying cause appears
>>>> to be 'Unable to find a page for eviction'. Should this be expected on data
>>>> regions with persistence?
>>>>
>>>> I have included the error below.
>>>>
>>>> This is the initial error reported by Ignite:
>>>>
>>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>>> failedToPrepare=5417]
>>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>>   ^-- Increase maximum off-heap memory size
>>>> (DataRegionConfiguration.maxSize)
>>>>   ^-- Enable Ignite persistence
>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>   ^-- Enable eviction or expiration policies]]
>>>>
>>>> Following this error is a lock dump, where this is the only thread with
>>>> a lock:(I am assuming the structureId member with the value
>>>> 'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
>>>> lock against an item in the local node )
>>>>
>>>> Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
>>>> Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
>>>> Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
>>>> time=(1591836815071, 2020-06-11 12:53:35.071)
>>>> L=1 -> Write lock pageId=284060547022916,
>>>> structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
>>>> partId=602, pageIdx=68, flags=00000001]
>>>>
>>>> Following the lock dump is this final error before the Ignite node
>>>> stops:
>>>>
>>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>>> failedToPrepare=5417]
>>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>>   ^-- Increase maximum off-heap memory size
>>>> (DataRegionConfiguration.maxSize)
>>>>   ^-- Enable Ignite persistence
>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>   ^-- Enable eviction or expiration policies]]
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <
>>>> raymond_wilson@trimble.com> wrote:
>>>>
>>>>> Hi Pavel,
>>>>>
>>>>> The reproducer is not the actual use case which is too big to use -
>>>>> it's a small example using the same mechanisms. I have not used a data
>>>>> streamer before, I'll read up on it.
>>>>>
>>>>> I'll try running the reproducer again against 2.8 (I used 2.7.6 for
>>>>> the reproducer).
>>>>>
>>>>> Thanks,
>>>>> Raymond.
>>>>>
>>>>>
>>>>> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Raymond,
>>>>>>
>>>>>> First, I could not reproduce the issue. Attached program runs to
>>>>>> completion on my machine.
>>>>>>
>>>>>> Second, I see a few issues with the attached code:
>>>>>> - Cache.PutIfAbsent is used instead of DataStreamer
>>>>>> - ICacheEntryEventFilter is used to remove cache entries, and is
>>>>>> called twice - on add and on remove
>>>>>>
>>>>>> My recommendation is to use a "classic" combination of Data Streamer,
>>>>>> Continuous Query, and Expiry Policy.
>>>>>> Set expiry policy to a few seconds, and you won't keep much data in
>>>>>> memory. Ignite will handle the removal for you.
>>>>>> Let me know if I should prepare an example.
>>>>>>
>>>>>> Also it is not clear why persistence is needed for such a "buffer"
>>>>>> cache - items are removed almost immediately,
>>>>>> it would be much more efficient to disable persistence.
>>>>>>
>>>>>> Thanks,
>>>>>> Pavel
>>>>>>
>>>>>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>
>>>>>>> Well, it appears I was wrong. It reappeared. :(
>>>>>>>
>>>>>>> I thought I had sent a reply to this thread but cannot find it, so I
>>>>>>> am resending it now.
>>>>>>>
>>>>>>> Attached is a c# reproducer that throws Ignite out of memory errors
>>>>>>> in the situation I outlined above where cache operations against a small
>>>>>>> cache with persistence enabled.
>>>>>>>
>>>>>>> Let me know if you're able to reproduce it on your local systems.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Raymond.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>>
>>>>>>>> It's possible this is user (me) error.
>>>>>>>>
>>>>>>>> I discovered I had set the cache size to be 64Mb in the server, but
>>>>>>>> 65Mb (typo!) in the client. Making these two values consistent appeared to
>>>>>>>> prevent the error.
>>>>>>>>
>>>>>>>> Raymond.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>>>
>>>>>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>>>>>
>>>>>>>>> I have an error where Ignite throws an out of memory exception,
>>>>>>>>> like this:
>>>>>>>>>
>>>>>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM
>>>>>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>>>>>   ^-- Increase maximum off-heap memory size
>>>>>>>>> (DataRegionConfiguration.maxSize)
>>>>>>>>>   ^-- Enable Ignite persistence
>>>>>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>>>>>
>>>>>>>>> I don't have an eviction policy set (is this even a valid
>>>>>>>>> recommendation when using persistence?)
>>>>>>>>>
>>>>>>>>> Increasing the off heap memory size for the data region does
>>>>>>>>> prevent this error, but I want to minimise the in-memory size for this
>>>>>>>>> buffer as it is essentially just a queue.
>>>>>>>>>
>>>>>>>>> The suggestion of enabling data persistence is strange as this
>>>>>>>>> data region has already persistence enabled for it.
>>>>>>>>>
>>>>>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>>>>>> saving and loading values as required.
>>>>>>>>>
>>>>>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>>>>>> cache, and are then drained by a processors using a continuous query.
>>>>>>>>> Elements are removed from the cache as the processor completes them.
>>>>>>>>>
>>>>>>>>> Is this kind of out of memory error supposed to be possible when
>>>>>>>>> using persistent data regions?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Raymond.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> <http://www.trimble.com/>
>>>>>>> Raymond Wilson
>>>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>>>> +64-21-2013317 Mobile
>>>>>>> raymond_wilson@trimble.com
>>>>>>>
>>>>>>>
>>>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> <http://www.trimble.com/>
>>>>> Raymond Wilson
>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>> +64-21-2013317 Mobile
>>>>> raymond_wilson@trimble.com
>>>>>
>>>>>
>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> +64-21-2013317 Mobile
>>>> raymond_wilson@trimble.com
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> +64-21-2013317 Mobile
>>> raymond_wilson@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> +64-21-2013317 Mobile
>> raymond_wilson@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Out of memory error in data region with persistence enabled

Posted by Alex Plehanov <pl...@gmail.com>.
Raymond,

When a checkpoint is triggered you need to have some amount of free page
slots in offheap to save metadata (for example free-lists metadata,
partition counter gaps, etc). The number of required pages depends on count
of caches, count of partitions, workload, and count of CPUs. In worst
cases, you will need up to 256*caches*partitions*CPU count of pages only to
store free-list buckets metadata. This number of pages can't be calculated
statically, so the exact amount can't be reserved in advance. Currently,
1/4 of offheap memory is reserved for this purpose (when amount of dirty
pages riches 3/4 of total number of pages checkpoint is triggered), but
sometimes it's not enough.

In your case, 64Mb data-region is allocated. Page size is 16Kb, so you have
a total of about 4000 pages (real page size in offheap is a little bit
bigger than configured page size). Checkpoint is triggered by "too many
dirty page" event, so 3/4 of pages are already dirty. And only 1000 pages
left to store metadata, it's too small. If page size is 4kb the amount of
clean pages is 4000, so your reproducer can pass in some circumstances.

Increase data region size to solve the problem.


вт, 16 июн. 2020 г. в 05:39, Raymond Wilson <ra...@trimble.com>:

> I have spent some more time on the reproducer. It is now very simple and
> reliably reproduces the issue with a simple loop adding slowly growing
> entries into a cache with no continuous query ro filters. I have attached
> the source files and the log I obtain when running it.
>
> Running from a clean slate (no existing persistent data) this reproducer
> exhibits the out of memory error when adding an element 4150 bytes in size.
>
> I did find this SO article (
> https://stackoverflow.com/questions/55937768/ignite-report-igniteoutofmemoryexception-out-of-memory-in-data-region)
> that describes the same problem. The solution offered was to increase the
> empty page pool size so it is larger than the biggest element being added.
> The empty pool size should always be bigger than the largest element added
> in the reproducer until the point of failure where 4150 bytes is the
> largest size being added. I tried increasing it to 200, it made no
> difference.
>
> The reproducer is using a pagesize of 16384 bytes.
>
> If I set the page size to the default 4096 bytes this reproducer does not
> show the error up to the size limit of 19999 bytes the reproducer tests.
> If I set the page size to 8192 bytes the reproducer does reliably fail
> with the error at the item with 6941 bytes.
>
> This feels like a bug in handling non-default page sizes. Would you
> recommend switching from 16384 bytes to 4096 for our page size? The reason
> I opted for the larger size is that we may have elements ranging in size
> from 100's of bytes to 100Kb, and sometimes larger.
>
> Thanks,
> Raymond.
>
>
> On Thu, Jun 11, 2020 at 4:25 PM Raymond Wilson <ra...@trimble.com>
> wrote:
>
>> Just a correction to context of the data region running out of memory:
>> This one does not have a queue of items or a continuous query operating on
>> a cache within it.
>>
>> Thanks,
>> Raymond.
>>
>> On Thu, Jun 11, 2020 at 4:12 PM Raymond Wilson <
>> raymond_wilson@trimble.com> wrote:
>>
>>> Pavel,
>>>
>>> I have run into a different instance of a memory out of error in a data
>>> region in a different context from the one I wrote the reproducer for. In
>>> this case, there is an activity which queues items for processing at a
>>> point in the future and which does use a continuous query, however there is
>>> also significant vanilla put/get activity against a range of other caches..
>>>
>>> This data region was permitted to grow to 1Gb and has persistence
>>> enabled. We are now using Ignite 2.8
>>>
>>> I would like to understand if this is a possible failure mode given that
>>> the data region has persistence enabled. The underlying cause appears to be
>>> 'Unable to find a page for eviction'. Should this be expected on data
>>> regions with persistence?
>>>
>>> I have included the error below.
>>>
>>> This is the initial error reported by Ignite:
>>>
>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>> failedToPrepare=5417]
>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>   ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>>   ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>>   ^-- Enable eviction or expiration policies]]
>>>
>>> Following this error is a lock dump, where this is the only thread with
>>> a lock:(I am assuming the structureId member with the value
>>> 'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
>>> lock against an item in the local node )
>>>
>>> Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
>>> Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
>>> Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
>>> time=(1591836815071, 2020-06-11 12:53:35.071)
>>> L=1 -> Write lock pageId=284060547022916,
>>> structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
>>> partId=602, pageIdx=68, flags=00000001]
>>>
>>> Following the lock dump is this final error before the Ignite node stops:
>>>
>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>> failedToPrepare=5417]
>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>   ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>>   ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>>   ^-- Enable eviction or expiration policies]]
>>>
>>>
>>>
>>>
>>> On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> Hi Pavel,
>>>>
>>>> The reproducer is not the actual use case which is too big to use -
>>>> it's a small example using the same mechanisms. I have not used a data
>>>> streamer before, I'll read up on it.
>>>>
>>>> I'll try running the reproducer again against 2.8 (I used 2.7.6 for the
>>>> reproducer).
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Raymond,
>>>>>
>>>>> First, I could not reproduce the issue. Attached program runs to
>>>>> completion on my machine.
>>>>>
>>>>> Second, I see a few issues with the attached code:
>>>>> - Cache.PutIfAbsent is used instead of DataStreamer
>>>>> - ICacheEntryEventFilter is used to remove cache entries, and is
>>>>> called twice - on add and on remove
>>>>>
>>>>> My recommendation is to use a "classic" combination of Data Streamer,
>>>>> Continuous Query, and Expiry Policy.
>>>>> Set expiry policy to a few seconds, and you won't keep much data in
>>>>> memory. Ignite will handle the removal for you.
>>>>> Let me know if I should prepare an example.
>>>>>
>>>>> Also it is not clear why persistence is needed for such a "buffer"
>>>>> cache - items are removed almost immediately,
>>>>> it would be much more efficient to disable persistence.
>>>>>
>>>>> Thanks,
>>>>> Pavel
>>>>>
>>>>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>>>>> raymond_wilson@trimble.com> wrote:
>>>>>
>>>>>> Well, it appears I was wrong. It reappeared. :(
>>>>>>
>>>>>> I thought I had sent a reply to this thread but cannot find it, so I
>>>>>> am resending it now.
>>>>>>
>>>>>> Attached is a c# reproducer that throws Ignite out of memory errors
>>>>>> in the situation I outlined above where cache operations against a small
>>>>>> cache with persistence enabled.
>>>>>>
>>>>>> Let me know if you're able to reproduce it on your local systems.
>>>>>>
>>>>>> Thanks,
>>>>>> Raymond.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>
>>>>>>> It's possible this is user (me) error.
>>>>>>>
>>>>>>> I discovered I had set the cache size to be 64Mb in the server, but
>>>>>>> 65Mb (typo!) in the client. Making these two values consistent appeared to
>>>>>>> prevent the error.
>>>>>>>
>>>>>>> Raymond.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>>
>>>>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>>>>
>>>>>>>> I have an error where Ignite throws an out of memory exception,
>>>>>>>> like this:
>>>>>>>>
>>>>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM
>>>>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>>>>   ^-- Increase maximum off-heap memory size
>>>>>>>> (DataRegionConfiguration.maxSize)
>>>>>>>>   ^-- Enable Ignite persistence
>>>>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>>>>
>>>>>>>> I don't have an eviction policy set (is this even a valid
>>>>>>>> recommendation when using persistence?)
>>>>>>>>
>>>>>>>> Increasing the off heap memory size for the data region does
>>>>>>>> prevent this error, but I want to minimise the in-memory size for this
>>>>>>>> buffer as it is essentially just a queue.
>>>>>>>>
>>>>>>>> The suggestion of enabling data persistence is strange as this data
>>>>>>>> region has already persistence enabled for it.
>>>>>>>>
>>>>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>>>>> saving and loading values as required.
>>>>>>>>
>>>>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>>>>> cache, and are then drained by a processors using a continuous query.
>>>>>>>> Elements are removed from the cache as the processor completes them.
>>>>>>>>
>>>>>>>> Is this kind of out of memory error supposed to be possible when
>>>>>>>> using persistent data regions?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Raymond.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> <http://www.trimble.com/>
>>>>>> Raymond Wilson
>>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>>> +64-21-2013317 Mobile
>>>>>> raymond_wilson@trimble.com
>>>>>>
>>>>>>
>>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> +64-21-2013317 Mobile
>>>> raymond_wilson@trimble.com
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> +64-21-2013317 Mobile
>>> raymond_wilson@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> +64-21-2013317 Mobile
>> raymond_wilson@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wilson@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Re: Out of memory error in data region with persistence enabled

Posted by Raymond Wilson <ra...@trimble.com>.
I have spent some more time on the reproducer. It is now very simple and
reliably reproduces the issue with a simple loop adding slowly growing
entries into a cache with no continuous query ro filters. I have attached
the source files and the log I obtain when running it.

Running from a clean slate (no existing persistent data) this reproducer
exhibits the out of memory error when adding an element 4150 bytes in size.

I did find this SO article (
https://stackoverflow.com/questions/55937768/ignite-report-igniteoutofmemoryexception-out-of-memory-in-data-region)
that describes the same problem. The solution offered was to increase the
empty page pool size so it is larger than the biggest element being added.
The empty pool size should always be bigger than the largest element added
in the reproducer until the point of failure where 4150 bytes is the
largest size being added. I tried increasing it to 200, it made no
difference.

The reproducer is using a pagesize of 16384 bytes.

If I set the page size to the default 4096 bytes this reproducer does not
show the error up to the size limit of 19999 bytes the reproducer tests.
If I set the page size to 8192 bytes the reproducer does reliably fail with
the error at the item with 6941 bytes.

This feels like a bug in handling non-default page sizes. Would you
recommend switching from 16384 bytes to 4096 for our page size? The reason
I opted for the larger size is that we may have elements ranging in size
from 100's of bytes to 100Kb, and sometimes larger.

Thanks,
Raymond.


On Thu, Jun 11, 2020 at 4:25 PM Raymond Wilson <ra...@trimble.com>
wrote:

> Just a correction to context of the data region running out of memory:
> This one does not have a queue of items or a continuous query operating on
> a cache within it.
>
> Thanks,
> Raymond.
>
> On Thu, Jun 11, 2020 at 4:12 PM Raymond Wilson <ra...@trimble.com>
> wrote:
>
>> Pavel,
>>
>> I have run into a different instance of a memory out of error in a data
>> region in a different context from the one I wrote the reproducer for. In
>> this case, there is an activity which queues items for processing at a
>> point in the future and which does use a continuous query, however there is
>> also significant vanilla put/get activity against a range of other caches..
>>
>> This data region was permitted to grow to 1Gb and has persistence
>> enabled. We are now using Ignite 2.8
>>
>> I would like to understand if this is a possible failure mode given that
>> the data region has persistence enabled. The underlying cause appears to be
>> 'Unable to find a page for eviction'. Should this be expected on data
>> regions with persistence?
>>
>> I have included the error below.
>>
>> This is the initial error reported by Ignite:
>>
>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>> be halted immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>> failedToPrepare=5417]
>> Out of memory in data region [name=Default-Immutable, initSize=128.0 MiB,
>> maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>   ^-- Increase maximum off-heap memory size
>> (DataRegionConfiguration.maxSize)
>>   ^-- Enable Ignite persistence
>> (DataRegionConfiguration.persistenceEnabled)
>>   ^-- Enable eviction or expiration policies]]
>>
>> Following this error is a lock dump, where this is the only thread with a
>> lock:(I am assuming the structureId member with the value
>> 'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
>> lock against an item in the local node )
>>
>> Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
>> Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
>> Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
>> time=(1591836815071, 2020-06-11 12:53:35.071)
>> L=1 -> Write lock pageId=284060547022916,
>> structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
>> partId=602, pageIdx=68, flags=00000001]
>>
>> Following the lock dump is this final error before the Ignite node stops:
>>
>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>> be halted immediately due to the failure: [failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>> failedToPrepare=5417]
>> Out of memory in data region [name=Default-Immutable, initSize=128.0 MiB,
>> maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>   ^-- Increase maximum off-heap memory size
>> (DataRegionConfiguration.maxSize)
>>   ^-- Enable Ignite persistence
>> (DataRegionConfiguration.persistenceEnabled)
>>   ^-- Enable eviction or expiration policies]]
>>
>>
>>
>>
>> On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <
>> raymond_wilson@trimble.com> wrote:
>>
>>> Hi Pavel,
>>>
>>> The reproducer is not the actual use case which is too big to use - it's
>>> a small example using the same mechanisms. I have not used a data streamer
>>> before, I'll read up on it.
>>>
>>> I'll try running the reproducer again against 2.8 (I used 2.7.6 for the
>>> reproducer).
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
>>> wrote:
>>>
>>>> Hi Raymond,
>>>>
>>>> First, I could not reproduce the issue. Attached program runs to
>>>> completion on my machine.
>>>>
>>>> Second, I see a few issues with the attached code:
>>>> - Cache.PutIfAbsent is used instead of DataStreamer
>>>> - ICacheEntryEventFilter is used to remove cache entries, and is called
>>>> twice - on add and on remove
>>>>
>>>> My recommendation is to use a "classic" combination of Data Streamer,
>>>> Continuous Query, and Expiry Policy.
>>>> Set expiry policy to a few seconds, and you won't keep much data in
>>>> memory. Ignite will handle the removal for you.
>>>> Let me know if I should prepare an example.
>>>>
>>>> Also it is not clear why persistence is needed for such a "buffer"
>>>> cache - items are removed almost immediately,
>>>> it would be much more efficient to disable persistence.
>>>>
>>>> Thanks,
>>>> Pavel
>>>>
>>>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>>>> raymond_wilson@trimble.com> wrote:
>>>>
>>>>> Well, it appears I was wrong. It reappeared. :(
>>>>>
>>>>> I thought I had sent a reply to this thread but cannot find it, so I
>>>>> am resending it now.
>>>>>
>>>>> Attached is a c# reproducer that throws Ignite out of memory errors in
>>>>> the situation I outlined above where cache operations against a small cache
>>>>> with persistence enabled.
>>>>>
>>>>> Let me know if you're able to reproduce it on your local systems.
>>>>>
>>>>> Thanks,
>>>>> Raymond.
>>>>>
>>>>>
>>>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>>>> raymond_wilson@trimble.com> wrote:
>>>>>
>>>>>> It's possible this is user (me) error.
>>>>>>
>>>>>> I discovered I had set the cache size to be 64Mb in the server, but
>>>>>> 65Mb (typo!) in the client. Making these two values consistent appeared to
>>>>>> prevent the error.
>>>>>>
>>>>>> Raymond.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>>>> raymond_wilson@trimble.com> wrote:
>>>>>>
>>>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>>>
>>>>>>> I have an error where Ignite throws an out of memory exception, like
>>>>>>> this:
>>>>>>>
>>>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM
>>>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>>>   ^-- Increase maximum off-heap memory size
>>>>>>> (DataRegionConfiguration.maxSize)
>>>>>>>   ^-- Enable Ignite persistence
>>>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>>>
>>>>>>> I don't have an eviction policy set (is this even a valid
>>>>>>> recommendation when using persistence?)
>>>>>>>
>>>>>>> Increasing the off heap memory size for the data region does prevent
>>>>>>> this error, but I want to minimise the in-memory size for this buffer as it
>>>>>>> is essentially just a queue.
>>>>>>>
>>>>>>> The suggestion of enabling data persistence is strange as this data
>>>>>>> region has already persistence enabled for it.
>>>>>>>
>>>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>>>> saving and loading values as required.
>>>>>>>
>>>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>>>> cache, and are then drained by a processors using a continuous query.
>>>>>>> Elements are removed from the cache as the processor completes them.
>>>>>>>
>>>>>>> Is this kind of out of memory error supposed to be possible when
>>>>>>> using persistent data regions?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Raymond.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> <http://www.trimble.com/>
>>>>> Raymond Wilson
>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>> +64-21-2013317 Mobile
>>>>> raymond_wilson@trimble.com
>>>>>
>>>>>
>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>
>>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> +64-21-2013317 Mobile
>>> raymond_wilson@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> +64-21-2013317 Mobile
>> raymond_wilson@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wilson@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Out of memory error in data region with persistence enabled

Posted by Raymond Wilson <ra...@trimble.com>.
Just a correction to context of the data region running out of memory: This
one does not have a queue of items or a continuous query operating on a
cache within it.

Thanks,
Raymond.

On Thu, Jun 11, 2020 at 4:12 PM Raymond Wilson <ra...@trimble.com>
wrote:

> Pavel,
>
> I have run into a different instance of a memory out of error in a data
> region in a different context from the one I wrote the reproducer for. In
> this case, there is an activity which queues items for processing at a
> point in the future and which does use a continuous query, however there is
> also significant vanilla put/get activity against a range of other caches..
>
> This data region was permitted to grow to 1Gb and has persistence enabled.
> We are now using Ignite 2.8
>
> I would like to understand if this is a possible failure mode given that
> the data region has persistence enabled. The underlying cause appears to be
> 'Unable to find a page for eviction'. Should this be expected on data
> regions with persistence?
>
> I have included the error below.
>
> This is the initial error reported by Ignite:
>
> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will be
> halted immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
> failedToPrepare=5417]
> Out of memory in data region [name=Default-Immutable, initSize=128.0 MiB,
> maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>   ^-- Increase maximum off-heap memory size
> (DataRegionConfiguration.maxSize)
>   ^-- Enable Ignite persistence
> (DataRegionConfiguration.persistenceEnabled)
>   ^-- Enable eviction or expiration policies]]
>
> Following this error is a lock dump, where this is the only thread with a
> lock:(I am assuming the structureId member with the value
> 'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
> lock against an item in the local node )
>
> Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
> Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
> Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
> time=(1591836815071, 2020-06-11 12:53:35.071)
> L=1 -> Write lock pageId=284060547022916,
> structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
> partId=602, pageIdx=68, flags=00000001]
>
> Following the lock dump is this final error before the Ignite node stops:
>
> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will be
> halted immediately due to the failure: [failureCtx=FailureContext
> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
> failedToPrepare=5417]
> Out of memory in data region [name=Default-Immutable, initSize=128.0 MiB,
> maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>   ^-- Increase maximum off-heap memory size
> (DataRegionConfiguration.maxSize)
>   ^-- Enable Ignite persistence
> (DataRegionConfiguration.persistenceEnabled)
>   ^-- Enable eviction or expiration policies]]
>
>
>
>
> On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <ra...@trimble.com>
> wrote:
>
>> Hi Pavel,
>>
>> The reproducer is not the actual use case which is too big to use - it's
>> a small example using the same mechanisms. I have not used a data streamer
>> before, I'll read up on it.
>>
>> I'll try running the reproducer again against 2.8 (I used 2.7.6 for the
>> reproducer).
>>
>> Thanks,
>> Raymond.
>>
>>
>> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
>> wrote:
>>
>>> Hi Raymond,
>>>
>>> First, I could not reproduce the issue. Attached program runs to
>>> completion on my machine.
>>>
>>> Second, I see a few issues with the attached code:
>>> - Cache.PutIfAbsent is used instead of DataStreamer
>>> - ICacheEntryEventFilter is used to remove cache entries, and is called
>>> twice - on add and on remove
>>>
>>> My recommendation is to use a "classic" combination of Data Streamer,
>>> Continuous Query, and Expiry Policy.
>>> Set expiry policy to a few seconds, and you won't keep much data in
>>> memory. Ignite will handle the removal for you.
>>> Let me know if I should prepare an example.
>>>
>>> Also it is not clear why persistence is needed for such a "buffer" cache
>>> - items are removed almost immediately,
>>> it would be much more efficient to disable persistence.
>>>
>>> Thanks,
>>> Pavel
>>>
>>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> Well, it appears I was wrong. It reappeared. :(
>>>>
>>>> I thought I had sent a reply to this thread but cannot find it, so I am
>>>> resending it now.
>>>>
>>>> Attached is a c# reproducer that throws Ignite out of memory errors in
>>>> the situation I outlined above where cache operations against a small cache
>>>> with persistence enabled.
>>>>
>>>> Let me know if you're able to reproduce it on your local systems.
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>>> raymond_wilson@trimble.com> wrote:
>>>>
>>>>> It's possible this is user (me) error.
>>>>>
>>>>> I discovered I had set the cache size to be 64Mb in the server, but
>>>>> 65Mb (typo!) in the client. Making these two values consistent appeared to
>>>>> prevent the error.
>>>>>
>>>>> Raymond.
>>>>>
>>>>>
>>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>>> raymond_wilson@trimble.com> wrote:
>>>>>
>>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>>
>>>>>> I have an error where Ignite throws an out of memory exception, like
>>>>>> this:
>>>>>>
>>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM
>>>>>> will be halted immediately due to the failure: [failureCtx=FailureContext
>>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>>   ^-- Increase maximum off-heap memory size
>>>>>> (DataRegionConfiguration.maxSize)
>>>>>>   ^-- Enable Ignite persistence
>>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>>
>>>>>> I don't have an eviction policy set (is this even a valid
>>>>>> recommendation when using persistence?)
>>>>>>
>>>>>> Increasing the off heap memory size for the data region does prevent
>>>>>> this error, but I want to minimise the in-memory size for this buffer as it
>>>>>> is essentially just a queue.
>>>>>>
>>>>>> The suggestion of enabling data persistence is strange as this data
>>>>>> region has already persistence enabled for it.
>>>>>>
>>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>>> saving and loading values as required.
>>>>>>
>>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>>> cache, and are then drained by a processors using a continuous query.
>>>>>> Elements are removed from the cache as the processor completes them.
>>>>>>
>>>>>> Is this kind of out of memory error supposed to be possible when
>>>>>> using persistent data regions?
>>>>>>
>>>>>> Thanks,
>>>>>> Raymond.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> +64-21-2013317 Mobile
>>>> raymond_wilson@trimble.com
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> +64-21-2013317 Mobile
>> raymond_wilson@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wilson@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Out of memory error in data region with persistence enabled

Posted by Raymond Wilson <ra...@trimble.com>.
Pavel,

I have run into a different instance of a memory out of error in a data
region in a different context from the one I wrote the reproducer for. In
this case, there is an activity which queues items for processing at a
point in the future and which does use a continuous query, however there is
also significant vanilla put/get activity against a range of other caches..

This data region was permitted to grow to 1Gb and has persistence enabled.
We are now using Ignite 2.8

I would like to understand if this is a possible failure mode given that
the data region has persistence enabled. The underlying cause appears to be
'Unable to find a page for eviction'. Should this be expected on data
regions with persistence?

I have included the error below.

This is the initial error reported by Ignite:

2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will be
halted immediately due to the failure: [failureCtx=FailureContext
[type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
failedToPrepare=5417]
Out of memory in data region [name=Default-Immutable, initSize=128.0 MiB,
maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
  ^-- Increase maximum off-heap memory size
(DataRegionConfiguration.maxSize)
  ^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
  ^-- Enable eviction or expiration policies]]

Following this error is a lock dump, where this is the only thread with a
lock:(I am assuming the structureId member with the value
'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
lock against an item in the local node )

Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
time=(1591836815071, 2020-06-11 12:53:35.071)
L=1 -> Write lock pageId=284060547022916,
structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
partId=602, pageIdx=68, flags=00000001]

Following the lock dump is this final error before the Ignite node stops:

2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will be
halted immediately due to the failure: [failureCtx=FailureContext
[type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
failedToPrepare=5417]
Out of memory in data region [name=Default-Immutable, initSize=128.0 MiB,
maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
  ^-- Increase maximum off-heap memory size
(DataRegionConfiguration.maxSize)
  ^-- Enable Ignite persistence (DataRegionConfiguration.persistenceEnabled)
  ^-- Enable eviction or expiration policies]]




On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <ra...@trimble.com>
wrote:

> Hi Pavel,
>
> The reproducer is not the actual use case which is too big to use - it's a
> small example using the same mechanisms. I have not used a data streamer
> before, I'll read up on it.
>
> I'll try running the reproducer again against 2.8 (I used 2.7.6 for the
> reproducer).
>
> Thanks,
> Raymond.
>
>
> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
> wrote:
>
>> Hi Raymond,
>>
>> First, I could not reproduce the issue. Attached program runs to
>> completion on my machine.
>>
>> Second, I see a few issues with the attached code:
>> - Cache.PutIfAbsent is used instead of DataStreamer
>> - ICacheEntryEventFilter is used to remove cache entries, and is called
>> twice - on add and on remove
>>
>> My recommendation is to use a "classic" combination of Data Streamer,
>> Continuous Query, and Expiry Policy.
>> Set expiry policy to a few seconds, and you won't keep much data in
>> memory. Ignite will handle the removal for you.
>> Let me know if I should prepare an example.
>>
>> Also it is not clear why persistence is needed for such a "buffer" cache
>> - items are removed almost immediately,
>> it would be much more efficient to disable persistence.
>>
>> Thanks,
>> Pavel
>>
>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>> raymond_wilson@trimble.com> wrote:
>>
>>> Well, it appears I was wrong. It reappeared. :(
>>>
>>> I thought I had sent a reply to this thread but cannot find it, so I am
>>> resending it now.
>>>
>>> Attached is a c# reproducer that throws Ignite out of memory errors in
>>> the situation I outlined above where cache operations against a small cache
>>> with persistence enabled.
>>>
>>> Let me know if you're able to reproduce it on your local systems.
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> It's possible this is user (me) error.
>>>>
>>>> I discovered I had set the cache size to be 64Mb in the server, but
>>>> 65Mb (typo!) in the client. Making these two values consistent appeared to
>>>> prevent the error.
>>>>
>>>> Raymond.
>>>>
>>>>
>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>> raymond_wilson@trimble.com> wrote:
>>>>
>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>
>>>>> I have an error where Ignite throws an out of memory exception, like
>>>>> this:
>>>>>
>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM will
>>>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>   ^-- Increase maximum off-heap memory size
>>>>> (DataRegionConfiguration.maxSize)
>>>>>   ^-- Enable Ignite persistence
>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>
>>>>> I don't have an eviction policy set (is this even a valid
>>>>> recommendation when using persistence?)
>>>>>
>>>>> Increasing the off heap memory size for the data region does prevent
>>>>> this error, but I want to minimise the in-memory size for this buffer as it
>>>>> is essentially just a queue.
>>>>>
>>>>> The suggestion of enabling data persistence is strange as this data
>>>>> region has already persistence enabled for it.
>>>>>
>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>> saving and loading values as required.
>>>>>
>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>> cache, and are then drained by a processors using a continuous query.
>>>>> Elements are removed from the cache as the processor completes them.
>>>>>
>>>>> Is this kind of out of memory error supposed to be possible when using
>>>>> persistent data regions?
>>>>>
>>>>> Thanks,
>>>>> Raymond.
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> +64-21-2013317 Mobile
>>> raymond_wilson@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wilson@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Out of memory error in data region with persistence enabled

Posted by Raymond Wilson <ra...@trimble.com>.
Hi Pavel,

The reproducer is not the actual use case which is too big to use - it's a
small example using the same mechanisms. I have not used a data streamer
before, I'll read up on it.

I'll try running the reproducer again against 2.8 (I used 2.7.6 for the
reproducer).

Thanks,
Raymond.


On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <pt...@apache.org>
wrote:

> Hi Raymond,
>
> First, I could not reproduce the issue. Attached program runs to
> completion on my machine.
>
> Second, I see a few issues with the attached code:
> - Cache.PutIfAbsent is used instead of DataStreamer
> - ICacheEntryEventFilter is used to remove cache entries, and is called
> twice - on add and on remove
>
> My recommendation is to use a "classic" combination of Data Streamer,
> Continuous Query, and Expiry Policy.
> Set expiry policy to a few seconds, and you won't keep much data in
> memory. Ignite will handle the removal for you.
> Let me know if I should prepare an example.
>
> Also it is not clear why persistence is needed for such a "buffer" cache -
> items are removed almost immediately,
> it would be much more efficient to disable persistence.
>
> Thanks,
> Pavel
>
> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
> raymond_wilson@trimble.com> wrote:
>
>> Well, it appears I was wrong. It reappeared. :(
>>
>> I thought I had sent a reply to this thread but cannot find it, so I am
>> resending it now.
>>
>> Attached is a c# reproducer that throws Ignite out of memory errors in
>> the situation I outlined above where cache operations against a small cache
>> with persistence enabled.
>>
>> Let me know if you're able to reproduce it on your local systems.
>>
>> Thanks,
>> Raymond.
>>
>>
>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <ra...@trimble.com>
>> wrote:
>>
>>> It's possible this is user (me) error.
>>>
>>> I discovered I had set the cache size to be 64Mb in the server, but 65Mb
>>> (typo!) in the client. Making these two values consistent appeared to
>>> prevent the error.
>>>
>>> Raymond.
>>>
>>>
>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>> raymond_wilson@trimble.com> wrote:
>>>
>>>> I'm using Ignite v2.7.5 with C# client.
>>>>
>>>> I have an error where Ignite throws an out of memory exception, like
>>>> this:
>>>>
>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM will
>>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>   ^-- Increase maximum off-heap memory size
>>>> (DataRegionConfiguration.maxSize)
>>>>   ^-- Enable Ignite persistence
>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>   ^-- Enable eviction or expiration policies]]
>>>>
>>>> I don't have an eviction policy set (is this even a valid
>>>> recommendation when using persistence?)
>>>>
>>>> Increasing the off heap memory size for the data region does prevent
>>>> this error, but I want to minimise the in-memory size for this buffer as it
>>>> is essentially just a queue.
>>>>
>>>> The suggestion of enabling data persistence is strange as this data
>>>> region has already persistence enabled for it.
>>>>
>>>> My assumption is that Ignite manages the memory in this cache by saving
>>>> and loading values as required.
>>>>
>>>> The test workflow in this failure is one where ~14,500 objects
>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>> cache, and are then drained by a processors using a continuous query.
>>>> Elements are removed from the cache as the processor completes them.
>>>>
>>>> Is this kind of out of memory error supposed to be possible when using
>>>> persistent data regions?
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> +64-21-2013317 Mobile
>> raymond_wilson@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wilson@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Out of memory error in data region with persistence enabled

Posted by Pavel Tupitsyn <pt...@apache.org>.
Hi Raymond,

First, I could not reproduce the issue. Attached program runs to completion
on my machine.

Second, I see a few issues with the attached code:
- Cache.PutIfAbsent is used instead of DataStreamer
- ICacheEntryEventFilter is used to remove cache entries, and is called
twice - on add and on remove

My recommendation is to use a "classic" combination of Data Streamer,
Continuous Query, and Expiry Policy.
Set expiry policy to a few seconds, and you won't keep much data in memory.
Ignite will handle the removal for you.
Let me know if I should prepare an example.

Also it is not clear why persistence is needed for such a "buffer" cache -
items are removed almost immediately,
it would be much more efficient to disable persistence.

Thanks,
Pavel

On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <ra...@trimble.com>
wrote:

> Well, it appears I was wrong. It reappeared. :(
>
> I thought I had sent a reply to this thread but cannot find it, so I am
> resending it now.
>
> Attached is a c# reproducer that throws Ignite out of memory errors in the
> situation I outlined above where cache operations against a small cache
> with persistence enabled.
>
> Let me know if you're able to reproduce it on your local systems.
>
> Thanks,
> Raymond.
>
>
> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <ra...@trimble.com>
> wrote:
>
>> It's possible this is user (me) error.
>>
>> I discovered I had set the cache size to be 64Mb in the server, but 65Mb
>> (typo!) in the client. Making these two values consistent appeared to
>> prevent the error.
>>
>> Raymond.
>>
>>
>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>> raymond_wilson@trimble.com> wrote:
>>
>>> I'm using Ignite v2.7.5 with C# client.
>>>
>>> I have an error where Ignite throws an out of memory exception, like
>>> this:
>>>
>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM will
>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>   ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>>   ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>>   ^-- Enable eviction or expiration policies]]
>>>
>>> I don't have an eviction policy set (is this even a valid recommendation
>>> when using persistence?)
>>>
>>> Increasing the off heap memory size for the data region does prevent
>>> this error, but I want to minimise the in-memory size for this buffer as it
>>> is essentially just a queue.
>>>
>>> The suggestion of enabling data persistence is strange as this data
>>> region has already persistence enabled for it.
>>>
>>> My assumption is that Ignite manages the memory in this cache by saving
>>> and loading values as required.
>>>
>>> The test workflow in this failure is one where ~14,500 objects totalling
>>> ~440 Mb in size (avery object size = ~30Kb) are added to the cache, and are
>>> then drained by a processors using a continuous query. Elements are removed
>>> from the cache as the processor completes them.
>>>
>>> Is this kind of out of memory error supposed to be possible when using
>>> persistent data regions?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wilson@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>