You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladimir Steshin (Jira)" <ji...@apache.org> on 2022/10/27 18:35:00 UTC
[jira] [Comment Edited] (IGNITE-17735) Datastreamer may consume heap with allowOverwtire=='false'.

    [ https://issues.apache.org/jira/browse/IGNITE-17735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607891#comment-17607891 ] 

Vladimir Steshin edited comment on IGNITE-17735 at 10/27/22 6:34 PM:
---------------------------------------------------------------------

Datastreamer with 'allowOverwrite==true' and PRIMARY_SYNC persistent cache may cause heap issue or consume increased heap amount.

Streamer node may not wait for backup updates depending on streamer receiver, setting 'allowOverwrite' and cache sync mode.
And keep sending more and more streamer batches to process. The receiving node collects related to backup updates futures, requests. 
The same happens on backup node: collecting update incoming update requests stucking at disk writes. See 'DS_heap_consumption.png' for example.

There is related 'perNodeParallelOperations()' setting. Probably not an issue at all. What discouraged, I met this issue with trivial research like few servers, simple cache and just trying data streaming with various persistence and loading settings (like `HeapConsumptionDataStreamerTest.src`). Think user may meet the same. But the default value might be improved for persistence.

Suggestion: bring reduced default parallel batches number for persistent caches `IgniteDataStreamer#DFLT_PARALLEL_OPS_PERSISTENT_MULTIPLIER` (PR #10343).
Or use per-internal-receiver setting `InternalUpdater#perNodeParallelOperations()` (PR #10351)

Did estimation benchmarks. Even in-memory benchmarks (like 'bench_inmem_isolated_pc2.txt') shows 2 or may be 4 batches per threads seems enough. 

For persistent caches, `CPUs x 2` seems enough. See `bench_persistent_results_Isolated_pc1.txt` and `bench_persistent_results_Individual_pc1.txt`


was (Author: vladsz83):
Datastreamer with '_allowOverwrite==true_' and _ATOMIC/PRIMARY_SYNC_ persistent cache may consume heap. 

The streamer had been created before the persistence. It's default setting are still for in-memory caches. Streamer decides how many data send to a node based on CPU number. Probably it's not best approach for persistent caches.

There is related 'perNodeParallelOperations()' setting. But the defaults might be adjusted for persistence.

Suggestion: reduce default max unresponded streamer batches for persistent caches. There is no reason to send more than 4-8-16 unresponded batches because they stuck at disk writes, WAL writes, page replacements, WAL rolling, GCs and so on. 

The problem is that certain streamer receiver might not wait for backup updates on loading node and keep sending update batches again and again. Default _Individual_ receiver (when _allowOverwrite_ if _true_) uses _cache.put()_. Every put creates additional backup requests. But current streamer batch request is already responded to. Next batch updates is accepted. Nodes start accumulating related to records update structures in the heap. Some JFR screens attached.

 See `DataStreamProcessorSelfTest.testAtomicPrimarySyncStability()`, `JmhStreamerReceiverBenchmark`.

> Datastreamer may consume heap with allowOverwtire=='false'.
> -----------------------------------------------------------
>
>                 Key: IGNITE-17735
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17735
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Vladimir Steshin
>            Assignee: Vladimir Steshin
>            Priority: Major
>              Labels: ise
>         Attachments: DS_heap_consumption.png, DS_heap_consumption_2.png, HeapConsumptionDataStreamerTest.src, bench_inmem_individual_pc2.txt, bench_inmem_isolated_pc2.txt, bench_persistent_full_Individual_pc1.txt, bench_persistent_results_Individual_pc1.txt, bench_persistent_results_Isolated_pc1.txt
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)