You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ratis.apache.org by GitBox <gi...@apache.org> on 2020/08/07 10:27:17 UTC

[GitHub] [incubator-ratis] lokeshj1703 opened a new pull request #170: RATIS-1025. Ratis logs may not be purged completely

lokeshj1703 opened a new pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170


   ## What changes were proposed in this pull request?
   
   Ozone Manager Ratis server tries to purge logs up to the snapshotIndex after a snapshot is taken. But it only purges the logs which have been cached in memory. This could lead to older logs not being purged and consuming disk space.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/RATIS-1025
   
   ## How was this patch tested?
   
   Existing UT


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-682353467


   @hanishakoneru Thanks for review! I have added a commit which fixes the logic for loading log segments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] amaliujia edited a comment on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
amaliujia edited a comment on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-670644698


   Newbie question:
   
   With/without this change, UT still work means that the complete purge logic is not properly tested, right? Can you point me to which UT is covering purge logic? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 closed pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 closed pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-688660817


   @hanishakoneru Thanks for review! I have committed the PR to master branch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] hanishakoneru commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
hanishakoneru commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-671459251


   @lokeshj1703, if I understand correctly, this fix will purge the logs instead of evicting them from cache.
   
   This might lead to more frequent install snapshots requests if a follower is lagging behind a few log segments. Currently in Ozone, the maximum number of cached segments is set to 2 in both Datanodes and Ozone. 
   
   1. Instead of purging logs when they need to be evicted from cache, can the purge implementation be fixed to also consider the logs  on disk and not in cache. With this approach, we would have  to take care of tracking on disk logs for other cases as well (for example {{RaftLog#getStartIndex}}).
   2. Another option would be to align the cache eviction logic with the log purge logic but this might not be optimal if the purge limit is set very high. 
   
   I think the first approach would be the clean solution. What are your thoughts?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] hanishakoneru commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
hanishakoneru commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-685008260


   @lokeshj1703 are the CI failures related?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-688220271


   @hanishakoneru The failure does not seem to be related. Retriggered CI.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] hanishakoneru commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
hanishakoneru commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-683131449


   Thanks @lokeshj1703.
   LGTM. +1 pending CI.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] hanishakoneru commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
hanishakoneru commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-681168526


   >   I think we can avoid directory scan during purge with this precondition.
   
   Agreed. Its best if directory scan can be avoided.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] amaliujia commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
amaliujia commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-670644698


   Newbie question:
   
   With/without this change, UT still work means that the complete purge logically is not properly tested, right? Can you point me to which UT is covering purge logic? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] hanishakoneru commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
hanishakoneru commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-673673687


   Consider the following scenario.
   Snapshot is taken every 25 transactions. PurgeUptoSnapshotIndex is set to false. This implies when snapshot is taken, only logs up to min commit index on all nodes will be purged.
   Now let's say we have log segments _log_11_20_, _log_21_30_ and _log_31_40_ on disk. Snapshot was taken at index 25 but one of the nodes had commit index 12. So only _log_0_10_ is purged. When this node is restarted, it will not add _log_11_20_ to its closedSegments because of the following logic in _SegmentedRaftLogCache_. When purge is called next, _log_11_20_ will not be purged as it is not in the cached closedSegments list. 
   
   ```
   void loadSegment(LogPathAndIndex pi, boolean keepEntryInCache,
         Consumer<LogEntryProto> logConsumer, long lastIndexInSnapshot) throws IOException {
       LogSegment logSegment = LogSegment.loadSegment(storage, pi.getPath().toFile(),
           pi.getStartIndex(), pi.getEndIndex(), pi.isOpen(), keepEntryInCache, logConsumer, raftLogMetrics);
       if (logSegment != null && logSegment.getEndIndex() > lastIndexInSnapshot) {
         addSegment(logSegment);
       }
     }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] runzhiwang commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
runzhiwang commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-670494227


   +1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-671261126


   > With/without this change, UT still work means that the complete purge logic is not properly tested, right? Can you point me to which UT is covering purge logic?
   
   @amaliujia Please check TestSegmentedRaftLog. Purge currently considers cached segments for purging and currently in code we are evicting the cache. This leads to scenarios where segments are never purged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-673994822


   > When purge is called next, log_11_20 will not be purged as it is not in the cached closedSegments list.
   
   Good point! I think this can be fixed by the following change. 
   ```
   void loadSegment(LogPathAndIndex pi, boolean keepEntryInCache,
         Consumer<LogEntryProto> logConsumer, long lastIndexInSnapshot) throws IOException {
       LogSegment logSegment = LogSegment.loadSegment(storage, pi.getPath().toFile(),
           pi.getStartIndex(), pi.getEndIndex(), pi.isOpen(), keepEntryInCache, logConsumer, raftLogMetrics);
       if (logSegment != null) {
         addSegment(logSegment);
       }
     }
   ```
   The point is if we have a list of all log segments during startup the purge works correctly. The log segment references should be light-weight if the entry cache is evicted from the log segment. I think we can avoid directory scan during purge with this precondition.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ratis] lokeshj1703 commented on pull request #170: RATIS-1025. Ratis logs may not be purged completely

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on pull request #170:
URL: https://github.com/apache/incubator-ratis/pull/170#issuecomment-672642449


   > @lokeshj1703, if I understand correctly, this fix will purge the logs instead of evicting them from cache.
   > 
   > This might lead to more frequent install snapshots requests if a follower is lagging behind a few log segments. Currently in Ozone, the maximum number of cached segments is set to 2 in both Datanodes and Ozone.
   > 
   
   I think you are referring to the change in TestSegmentedRaftLog#syncWithSnapshot. So this function is called in follower after install snapshot or notify install snapshot. Since the snapshot has already been installed in the follower I think it is ok to purge the logs. Leader would have the snapshot too. 
   Currently purge is called in StateMachineUpdater#takeSnapshot. I think follower where snapshot is getting installed would not call takeSnapshot until later and thus may not be able to purge logs.
   
   > 1. Instead of purging logs when they need to be evicted from cache, can the purge implementation be fixed to also consider the logs  on disk and not in cache. With this approach, we would have  to take care of tracking on disk logs for other cases as well (for example {{RaftLog#getStartIndex}}).
   
   The change will not purge logs when they need to be evicted. If you look at SegmentedRaftLogCache#evictCache it only evicts the entry cache. It still tracks the log segment in the SegmentedRaftLogCache#closedSegments list. Log segments are removed from closedSegments during purge or truncate.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org