You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Owen Nichols (Jira)" <ji...@apache.org> on 2022/06/22 20:47:05 UTC

[jira] [Closed] (GEODE-9881) Fully recoverd Oplogs object indicating unrecoveredRegionCount>0 preventing compaction

     [ https://issues.apache.org/jira/browse/GEODE-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen Nichols closed GEODE-9881.
-------------------------------

> Fully recoverd Oplogs object indicating unrecoveredRegionCount>0 preventing compaction
> --------------------------------------------------------------------------------------
>
>                 Key: GEODE-9881
>                 URL: https://issues.apache.org/jira/browse/GEODE-9881
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> We have found problem in case when region is closed with Region.close() and then recreated to start the recovery. If you inspect this code in close() function you will notice that it doesn't make any sense:
> {code:java}
>   void close(DiskRegion dr) {
>     // while a krf is being created can not close a region
>     lockCompactor();
>     try {
>       if (!isDrfOnly()) {
>         DiskRegionInfo dri = getDRI(dr);
>         if (dri != null) {
>           long clearCount = dri.clear(null);
>           if (clearCount != 0) {
>             totalLiveCount.addAndGet(-clearCount);
>             // no need to call handleNoLiveValues because we now have an
>             // unrecovered region.
>           }
>           regionMap.get().remove(dr.getId(), dri);
>         }
>         addUnrecoveredRegion(dr.getId());
>       }
>     } finally {
>       unlockCompactor();
>     }
>   }
> {code}
> Please notice that addUnrecoveredRegion() marks DiskRegionInfo object as unrecovered and increments counter unrecoveredRegionCount. This DiskRegionInfo object is contained in regionMap structure. Then afterwards it removes DiskRegionInfo object (that was previously marked as unrecovered) from the regionMap. This doesn't make any sense, it updated object and then removed it from map to be garbage collected. As you will see later on this will cause some issues when region is recovered.
> Please check this code at recovery:
> {code:java}
> /**
>  * For each dri that this oplog has that is currently unrecoverable check to see if a DiskRegion
>  * that is recoverable now exists.
>  */
> void checkForRecoverableRegion(DiskRegionView dr) {
>   if (unrecoveredRegionCount.get() > 0) {
>     DiskRegionInfo dri = getDRI(dr);
>     if (dri != null) {
>       if (dri.testAndSetRecovered(dr)) {
>         unrecoveredRegionCount.decrementAndGet();
>       }
>     }
>   }
> }
> {code}
> The problem is that geode will not clear counter unrecoveredRegionCount in Oplog objects after recovery is done. This is because checkForRecoverableRegion will check unrecoveredRegionCount counter and perform testAndSetRecovered. The testAndSetRecovered will always return false, because non of the DiskRegionInfo objects in region map have unrecovered flag set to true (all object marked as unrecovered were deleted by close(), and then they were recreated during recovery.... see note below). The problem here is that all Oplogs will be fully recovered with the counter incorrectly indicating unrecoveredRegionCount>0. This will later on prevent the compaction of recovered Oplogs (the files that have .crf, .drf and .krf) when they reach compaction threshold.
> Note: During recovery regionMap will be recreated from the Oplog files. Since all DiskRegionInfo objects are deleted from regionMap during the close(), they will be recreated by using function initRecoveredEntry during the recovery. All DiskRegionInfo will be created with flag unrecovered set to false.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)