You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/10/10 20:04:00 UTC

[jira] [Commented] (GEODE-10401) Oplog recovery takes too long due to fault in fastutil library

    [ https://issues.apache.org/jira/browse/GEODE-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17615291#comment-17615291 ] 

ASF subversion and git services commented on GEODE-10401:
---------------------------------------------------------

Commit 7cae66172f950c779625c241553edcaecd424d62 in geode-native's branch refs/heads/support/1.15 from Owen Nichols
[ https://gitbox.apache.org/repos/asf?p=geode-native.git;h=7cae66172 ]

GEODE-10401: Update Dockerfile and vars

Native client hardcodes Geode version to test with in several places.
Update native Dockerfile and other variables to apache-geode 1.15.1


> Oplog recovery takes too long due to fault in fastutil library
> --------------------------------------------------------------
>
>                 Key: GEODE-10401
>                 URL: https://issues.apache.org/jira/browse/GEODE-10401
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.1, 1.16.0
>
>
> {color:#0e101a}As we already know, the .drf file delete operations only contain OplogEntryID. During recovery, the server reads (byte by byte) each OplogEntryID and stores it in a HashSet to use later when recovering .crf files. There are two types of HashSets: IntOpenHashSet and LongOpenHashSet. The OplogEntryID of type {color}_{color:#0e101a}integer{color}_{color:#0e101a} will be stored in IntOpenHashSet, and {color}_{color:#0e101a}long integer{color}_{color:#0e101a} in LongOpenHashSet, probably due to memory optimization and performance factors. OplogEntryID starts with a zero and increments throughout time.
> {color}
> {color:#0e101a}We have observed in logs that between exception (There is a large number of deleted entries) and the previous log have passed more than 4 minutes (sometimes even more).{color}
> {code:java}
> {"timestamp":"2022-06-14T21:41:43.772+08:00","severity":"info","message":"Recovering oplog#271 /opt/dbservice/data/datastore/BACKUPdataDiskStore_271.drf for disk store dataDiskStore.","metadata":
> {"timestamp":"2022-06-14T21:46:02.152+08:00","severity":"warning","message":"There is a large number of deleted entries within the disk-store, please execute an offline
> compaction.","metadata":
> {code}
> {color:#0e101a}When the above exception occurs, that means that the limit of {color}_{color:#0e101a}805306401{color}_{color:#0e101a} entries in IntOpenHashSet has been reached. In that case, the server rolls to the new IntOpenHashSet, where an exception and the delay could happen again.{color}
> {color:#0e101a}The problem is that due to the fault in FastUtil dependency (IntOpenHashSet and LongOpenHashSet), the unnecessary rehashing happens multiple times before the max size is reached. The{color} _{color:#0e101a}rehashing starts from{color}_ {color:#0e101a}805306368 onwards for each new entry until the max size. This rehashing adds several minutes to .drf Oplog recovery, but does nothing as max is already reached.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)