You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/12/09 18:43:00 UTC

[jira] [Commented] (GEODE-9854) Orphaned .drf files causing memory leak

    [ https://issues.apache.org/jira/browse/GEODE-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456669#comment-17456669 ] 

ASF subversion and git services commented on GEODE-9854:
--------------------------------------------------------

Commit 324ed89c3d43a53466cf5aeb614b63e757ba8b23 in geode's branch refs/heads/develop from Jakov Varenina
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=324ed89 ]

GEODE-9854: Orphaned .drf file causing memory leak (#7145)

* GEODE-9854: Orphaned .drf file causing memory leak

Issue:
An OpLog files are compacted, but the .drf file is left because it contains deletes of
entries in previous .crfs. The .crf file is deleted, but the orphaned .drf is not until all
previous .crf files (.crfs with smaller id) are deleted.

The problem is that compacted Oplog object representing orphaned .drf file holds
a structure in memory (Oplog.regionMap) that contains information that is not useful
after the compaction and it takes certain amount of memory. Besides, there is a race
condition in the code when creating .krf files that, depending on the execution order,
could make the problem more severe (it could leave pendingKrfTags structure
on the regionMap and this could take up a significant amount of memory). This
pendingKrfTags HashMap is actually empty, but consumes memory because it was used
previously and the size of the HashMap was not reduced after it is cleared.
This race condition usually happens when new Oplog is rolled out and previous Oplog
is immediately marked as eligible for compaction. Compaction and .krf creation start at
the similar time and compactor cancels creation of .krf if it is executed first.
The pendingKrfTags structure is usually cleared when .krf file is created, but since
compaction canceled creation of .krf, the pendingKrfTags structure remain in memory
until Oplog representing orphaned .drf file is deleted.

Solution:
Clear the regionMap data structure of the Oplog when it is compacted (currently it is
deleted when the Oplog is destroyed).

* introduced inner static class RegionMap in Oplog
* RegionMap.get() will return always empty map if it was closed before
* When closing disk region skip adding only drf oplog to unrecovered
map and also don't try to remove it from regionMap (it was already
removed during compaction).

* Following test cases are introduced:

1. Recovery of single region after cache is closed and then recreated
(testCompactorRegionMapDeletedForOnlyDrfOplogAfterCompactionAndRecoveryAfterCacheClosed)

2. Recovery of single region after region is closed and then recreated
(testCompactorRegionMapDeletedForOnlyDrfOplogAfterCompactionAndRecoveryAfterRegionClose)

Co-authored-by: Alberto Gomez <al...@est.tech>

> Orphaned .drf files causing memory leak
> ---------------------------------------
>
>                 Key: GEODE-9854
>                 URL: https://issues.apache.org/jira/browse/GEODE-9854
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>         Attachments: screenshot-1.png, screenshot-2.png, server1.log
>
>
> Issue:
> An OpLog files are compacted, but the .drf file is left because it contains deletes ofentries in previous .crfs. The .crf file is deleted, but the orphaned .drf is not until all
> previous .crf files (.crfs with smaller id) are deleted.
> The problem is that compacted Oplog object representing orphaned .drf file holds a structure in memory (Oplog.regionMap) that contains information that is not useful
> after the compaction and it takes certain amount of memory. Besides, there is a race condition in the code when creating .krf files that, depending on the execution order,
> could make the problem more severe  (it could leave pendingKrfTags structure on the regionMap and this could take up a significant amount of memory). This
> pendingKrfTags HashMap is actually empty, but consumes memory because it was used previously and the size of the HashMap was not reduced after it is cleared.
> This race condition usually happens when new Oplog is rolled out and previous Oplog is immediately marked as eligible for compaction. Compaction and .krf creation start at
> the similar time and compactor cancels creation of .krf if it is executed first. The pendingKrfTags structure is usually cleared when .krf file is created, but sincecompaction canceled creation of .krf, the pendingKrfTags structure remain in memory until Oplog representing orphaned .drf file is deleted.
> Below it can be see that actually .krf is never created for the orphaned .drf Oplog object that has memory allocated in pendingKrfTags:
> {code:java}
> server1.log:1956:[info 2021/11/25 21:52:26.866 CET server1 <Replicate/Partition Region Garbage Collector> tid=0x34] Created oplog#129 drf for disk store store1.
> server1.log:1958:[info 2021/11/25 21:52:26.867 CET server1 <Replicate/Partition Region Garbage Collector> tid=0x34] Created oplog#129 crf for disk store store1.
> server1.log:1974:[info 2021/11/25 21:52:39.490 CET server1 <OplogCompactor store1 for oplog oplog#129> tid=0x5c] OplogCompactor for store1 compaction oplog id(s): oplog#129
> server1.log:1980:[info 2021/11/25 21:52:39.532 CET server1 <OplogCompactor store1 for oplog oplog#129> tid=0x5c] compaction did 3685 creates and updates in 41 ms
> server1.log:1982:[info 2021/11/25 21:52:39.532 CET server1 <Oplog Delete Task4> tid=0x5d] Deleted oplog#129 crf for disk store store1.
> {code}
> !screenshot-1.png|width=1123,height=268!
> Below you can see the log and heap dump of orphaned .drf Oplg that dont have pendingKrfTags allocated in memory. This is because pendingKrfTags is cleared when .krf is created as can be seen in below logs.
> {code:java}
> server1.log:1976:[info 2021/11/25 21:52:39.491 CET server1 <Replicate/Partition Region Garbage Collector> tid=0x34] Created oplog#130 drf for disk store store1.
> server1.log:1978:[info 2021/11/25 21:52:39.493 CET server1 <Replicate/Partition Region Garbage Collector> tid=0x34] Created oplog#130 crf for disk store store1.
> server1.log:1998:[info 2021/11/25 21:52:41.131 CET server1 <Idle OplogCompactor> tid=0x5c] Created oplog#130 krf for disk store store1.
> server1.log:2000:[info 2021/11/25 21:52:41.893 CET server1 <OplogCompactor store1 for oplog oplog#130> tid=0x5c|#130> tid=0x5c] OplogCompactor for store1 compaction oplog id(s): oplog#130
> server1.log:2002:[info 2021/11/25 21:52:41.958 CET server1 <OplogCompactor store1 for oplog oplog#130> tid=0x5c|#130> tid=0x5c] compaction did 9918 creates and updates in 64 ms
> server1.log:2004:[info 2021/11/25 21:52:41.958 CET server1 <Oplog Delete Task4> tid=0x5d] Deleted oplog#130 crf for disk store store1.
> server1.log:2006:[info 2021/11/25 21:52:41.958 CET server1 <Oplog Delete Task4> tid=0x5d] Deleted oplog#130 krf for disk store store1.
> {code}
> !screenshot-2.png|width=1123,height=268!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)