You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@geode.apache.org by GitBox <gi...@apache.org> on 2020/10/29 22:10:50 UTC

[GitHub] [geode] jchen21 opened a new pull request #5689: [DO NOT REVIEW] GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

jchen21 opened a new pull request #5689:
URL: https://github.com/apache/geode/pull/5689


   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] jchen21 commented on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
jchen21 commented on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-730682983


   > add unit test coverage of needsCompaction to OplogTest.
   > 
   > It seems like the only change you made was to say we don't need to compact if an Oplog has a totalCount == 0. I think totalCount == 0 means that the oplog is empty. This seems like something we should compact. Do you understand why offline compaction gave us an empty oplog? I'm concerned this change might cause us not to remove empty oplogs. Or maybe we say it has no need to compact but still do a check and remove it for being empty somewhere else?
   
   I have added a unit test in `OplogTest`. 
   
   `totalCount` == 0 does not necessarily mean the oplog is empty. You remind me of the case of an empty oplog. There is another field `totalLiveCount` to check. What I see in the persistent recovery immediately after an offline compaction is, the offline compacted oplog is non-empty. During the recovery, `totalCount` is `0`. However, `totalLiveCount` has the actual number of live entries in the oplog. Therefore I add one more condition to check `totalLiveCount`. 
   
   If `totalCount` == 0 and `totalLiveCount` > 0, such as the case of recovery after offline compaction, we don't need compaction again. Are there any other cases that need compaction, even though `totalCount` == 0 and `totalLiveCount` > 0?
   
   If `totalCount` == 0 and `totalLiveCount` <= 0, then the oplog is empty, which needs compaction. Please correct me if I am wrong.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] upthewaterspout commented on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
upthewaterspout commented on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-723359120


   I agree with Darrel.  Maybe the problem this test is finding is that the compaction is leaving behind an empty oplog (perhaps it generates an empty oplog to write to when it starts or something)?
   
   If that's the case though - it's not much of a bug, especially since it gets cleaned up on startup the next time. 
   
   If I remember what you were showing me earlier, I thought the compaction on startup after the offline compact was taking a long time. I wouldn't expect that to be the case if all it's doing is removing empty oplogs?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] jchen21 commented on a change in pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
jchen21 commented on a change in pull request #5689:
URL: https://github.com/apache/geode/pull/5689#discussion_r535734226



##########
File path: geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
##########
@@ -5789,9 +5789,8 @@ boolean needsCompaction() {
       if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
         return true;
       }
-    } else {

Review comment:
       I have changed `readKrf()` function. So that at the end of this function, when `totalCount` and `totalLiveCount` is updated after reading krf file, it will check if `totalCount` is `0` and `totalLiveCount` is greater than `0`, and update `totalCount` accordingly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] gesterzhou commented on a change in pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
gesterzhou commented on a change in pull request #5689:
URL: https://github.com/apache/geode/pull/5689#discussion_r519074167



##########
File path: geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
##########
@@ -5789,9 +5789,8 @@ boolean needsCompaction() {
       if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
         return true;
       }
-    } else {

Review comment:
       This fix is incorrect. It only caused the empty oplog (created by the offline compaction) is not deleted by the recovery's auto compaction. 
   
   I modified the dunit test a little bit and proved it. 
   
   Unfortunately, there's no flag or status to show that a compaction is just finished. Even we want to use the empty oplog created by offline compaction, we need to careful arrange the logic. Such as:
   
   change  needsCompaction to return int. 1: need to compact, -1: no need to compact, 0: there's one empty file,  only need to delete it.
   
   But maybe there's better ideas. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] gesterzhou commented on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
gesterzhou commented on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-739465537


   I read your new fix and have different opinion. The root cause is the new oplog (i.e. #2) created during offline compaction wrote a krf with totalCount==0. 
   
   You fix is to still read the krf and when found it's totalCount is 0, change it to be totalLiveCount. This is just a patch fix. The real issue is: we should NOT create such a krf with totalCount==0. Actually, when offline compaction, it is not allowed to create krf. 
   
   Here is the better fix for this issue:
   
   diff --git a/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java b/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   index f6daa3a1d6..0a31af9386 100755
   --- a/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   +++ b/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   @@ -582,7 +582,7 @@ public class Oplog implements CompactableOplog, Flushable {
          createDrf(null);
          createCrf(null);
          // open krf for offline compaction
   -      if (getParent().isOfflineCompacting()) {
   +      if (getParent().isOfflineCompacting() && couldHaveKrf()) {
            krfFileCreate();
          }
        } catch (Exception ex) {
   @@ -3980,7 +3980,7 @@ public class Oplog implements CompactableOplog, Flushable {
        } catch (IOException ignore) {
        }
    
   -    if (this.krf.f.exists()) {
   +    if (this.krf.f != null && this.krf.f.exists()) {
          this.krf.f.delete();
        }
      }
   @@ -5789,7 +5789,7 @@ public class Oplog implements CompactableOplog, Flushable {
          if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
            return true;
          }
   -    } else if (this.totalLiveCount.get() <= 0) {
   +    } else {
          return true;
        }
   
   
   I don't know who introduced "if (getParent().isOfflineCompacting()" here. To be conservative, I still keep this check. Maybe we can remove the 3 lines. But I want to be conservative for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] gesterzhou commented on a change in pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
gesterzhou commented on a change in pull request #5689:
URL: https://github.com/apache/geode/pull/5689#discussion_r533822731



##########
File path: geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
##########
@@ -5789,9 +5789,8 @@ boolean needsCompaction() {
       if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
         return true;
       }
-    } else {

Review comment:
       Darrel and I discussed again. We might need to re-fix it the other way: 
   
   Since the trouble is caused by offline compaction wrote totalCount=0 into krf, there's a potential bug that we restart with totalCount=0, but totalLiveCount=1, then with your current code change this oplog will never get a chance to compact until the only live entry is destroyed. 
   
   So a better fix maybe: when restart and found totalCount=0, then update totalCount to be totalLiveCount. With this fix, you don't need to change needsCompaction() any more. 
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] jchen21 commented on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
jchen21 commented on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-731472883


   When compacting oplogs offline, it creates a new krf and persists `totalCount` with `0` in Oplog constructor. Then live entries are copied from old oplogs to the new oplog. `totalCount` of the new oplog is updated during this process. However, the updated `totalCount` is not persisted. Its persisted value is still `0`, although the oplog is non-empty after compaction. 
   
   During recovery, the value of `totalCount` is recovered from krf as `0`.  `totalLiveCount` is incremented as the oplog is recovered. `totalCount` remains `0`. Therefore, if `totalCount` == 0 and `totalLiveCount` > 0, no compaction is needed. Since this condition shows that offline compaction is just done before recovery.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] gesterzhou commented on a change in pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
gesterzhou commented on a change in pull request #5689:
URL: https://github.com/apache/geode/pull/5689#discussion_r533588541



##########
File path: geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
##########
@@ -5789,9 +5789,8 @@ boolean needsCompaction() {
       if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
         return true;
       }
-    } else {

Review comment:
       The new fix is good. 
   
   Should use "} else if (hasNoLiveValues()) {" instead. 
   
   And it's better to add a few comments here. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] jchen21 commented on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
jchen21 commented on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-744970105


   @dschneider-pivotal So far, I still consider the current fix is the best option we can have. We need a convenient and reliable way to tell that the Oplog is offline compacted. The persisted `0` value of `totalCount` in krf serves the purpose. And during persistence recovery, if `totalCount` is `0`, then it is updated with the value of `totalLiveCount`, which is the most accurate value of the number of live entries in the Oplog.
   
   I have been thinking about whether we should use the local variable `krfEntryCount`, instead of `totalLiveCount`. At the end of `readKrf()`, the value of `totalLiveCount` and `krfEntryCount` are the same. Because at the beginning of `readKrf()`, `totalLiveCount` and `krfEntryCount` are both 0. And in the `while` loop inside `readKrf()`, both variables are incremented together: 
   
   `totalLiveCount` is incremented here: https://github.com/apache/geode/blob/develop/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java#L1765
   
   `krfEntryCount` is incremented here: https://github.com/apache/geode/blob/develop/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java#L1768
   
   I'd prefer using `totalLiveCount` instead of `krfEntryCount` which is a local variable.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] gesterzhou edited a comment on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
gesterzhou edited a comment on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-739465537


   ```
   I read your new fix and have different opinion. The root cause is the new oplog (i.e. #2) created during offline compaction wrote a krf with totalCount==0. 
   
   You fix is to still read the krf and when found it's totalCount is 0, change it to be totalLiveCount. This is just a patch fix. The real issue is: we should NOT create such a krf with totalCount==0. Actually, when offline compaction, it is not allowed to create krf. 
   
   Here is the better fix for this issue:
   
   diff --git a/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java b/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   index f6daa3a1d6..0a31af9386 100755
   --- a/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   +++ b/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   @@ -582,7 +582,7 @@ public class Oplog implements CompactableOplog, Flushable {
          createDrf(null);
          createCrf(null);
          // open krf for offline compaction
   -      if (getParent().isOfflineCompacting()) {
   +      if (getParent().isOfflineCompacting() && couldHaveKrf()) {
            krfFileCreate();
          }
        } catch (Exception ex) {
   @@ -3980,7 +3980,7 @@ public class Oplog implements CompactableOplog, Flushable {
        } catch (IOException ignore) {
        }
    
   -    if (this.krf.f.exists()) {
   +    if (this.krf.f != null && this.krf.f.exists()) {
          this.krf.f.delete();
        }
      }
   @@ -5789,7 +5789,7 @@ public class Oplog implements CompactableOplog, Flushable {
          if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
            return true;
          }
   -    } else if (this.totalLiveCount.get() <= 0) {
   +    } else {
          return true;
        }
   
   
   I don't know who introduced "if (getParent().isOfflineCompacting()" here. To be conservative, I still keep this check. Maybe we can remove the 3 lines. But I want to be conservative for now.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] jchen21 merged pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
jchen21 merged pull request #5689:
URL: https://github.com/apache/geode/pull/5689


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [geode] jchen21 commented on pull request #5689: GEODE-8667: Duplicate online Oplog compaction after offline Oplog compaction

Posted by GitBox <gi...@apache.org>.
jchen21 commented on pull request #5689:
URL: https://github.com/apache/geode/pull/5689#issuecomment-740265251


   > ```
   > I read your new fix and have different opinion. The root cause is the new oplog (i.e. #2) created during offline compaction wrote a krf with totalCount==0. 
   > 
   > You fix is to still read the krf and when found it's totalCount is 0, change it to be totalLiveCount. This is just a patch fix. The real issue is: we should NOT create such a krf with totalCount==0. Actually, when offline compaction, it is not allowed to create krf. 
   > 
   > Here is the better fix for this issue:
   > 
   > diff --git a/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java b/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   > index f6daa3a1d6..0a31af9386 100755
   > --- a/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   > +++ b/geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
   > @@ -582,7 +582,7 @@ public class Oplog implements CompactableOplog, Flushable {
   >        createDrf(null);
   >        createCrf(null);
   >        // open krf for offline compaction
   > -      if (getParent().isOfflineCompacting()) {
   > +      if (getParent().isOfflineCompacting() && couldHaveKrf()) {
   >          krfFileCreate();
   >        }
   >      } catch (Exception ex) {
   > @@ -3980,7 +3980,7 @@ public class Oplog implements CompactableOplog, Flushable {
   >      } catch (IOException ignore) {
   >      }
   >  
   > -    if (this.krf.f.exists()) {
   > +    if (this.krf.f != null && this.krf.f.exists()) {
   >        this.krf.f.delete();
   >      }
   >    }
   > @@ -5789,7 +5789,7 @@ public class Oplog implements CompactableOplog, Flushable {
   >        if (((rv / (double) rvHWMtmp) * 100) <= parent.getCompactionThreshold()) {
   >          return true;
   >        }
   > -    } else if (this.totalLiveCount.get() <= 0) {
   > +    } else {
   >        return true;
   >      }
   > 
   > 
   > I don't know who introduced "if (getParent().isOfflineCompacting()" here. To be conservative, I still keep this check. Maybe we can remove the 3 lines. But I want to be conservative for now.
   > ```
   
   I don't think we should avoid creating krf during offline compaction. Your proposed fix avoid creating the krf file during offline compaction. After the compaction, the original non-compacted krf is also deleted. Without a krf, there is a significant impact on the performance of persistent recovery. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org