You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2018/12/10 10:53:06 UTC
[GitHub] clintropolis commented on a change in pull request #6699: release mmap immediately after merge indexes

clintropolis commented on a change in pull request #6699: release mmap immediately after merge indexes
URL: https://github.com/apache/incubator-druid/pull/6699#discussion_r240164301
 
 

 ##########
 File path: server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorImpl.java
 ##########
 @@ -716,8 +717,20 @@ private DataSegment mergeAndPush(final SegmentIdentifier identifier, final Sink
           closer.register(segmentAndCloseable.rhs);
         }
 
-        mergedFile = indexMerger.mergeQueryableIndex(
-            indexes,
 
 Review comment:
   Ah, I think I am maybe starting to understand what is going on. I was not aware of how the YARN oom killer functions - it turns out, it behaves in a manner I find very unintuitive, where ALL physical memory associated to the process is counted by this killer, _including_ the physical memory occupied by any page cache from our mapped intermediary files that has been tracked by Linux memory accounting as belonging to the process. This is counter to the behavior of the oom killer in the Linux kernel or via cgroups, where the free page cache would be reclaimed by the kernel when memory gets tight before invoking the oom killer. 
   
   See this discussion for details: https://issues.apache.org/jira/browse/HDFS-5957?focusedCommentId=13907875&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13907875 which has spawned this YARN issue https://issues.apache.org/jira/browse/YARN-1747. Is this maybe this issue you are running into? [This solution](https://jira.apache.org/jira/browse/YARN-1775) looks like it has maybe added a setting and been around for a while and might be a possible fix for your problem? If you're running a newer version of YARN, maybe is it possible to use the cgroups solution to limit your YARN processes? I believe that cgroup memory limits will correctly steal from page cache when a process reaches it's memory limits. If that isn't possible, maybe you could disable the physical memory check for containers, or increasing the limits on the container size?
   
   I think the behavior of the YARN physical memory check is sort of nonsensical, at least for the nature of Druid realtime tasks, but maybe you are stuck with it being on as is for whatever circumstance, so back to this PR.
   
   I naively expected mmap from the same process to the same file to be counted as the same memory usage (will look into this further), but if not (as the results of your testing seem to indicate), and accounting is done per call to mmap and knowing that a call to mmap a file itself doesn't actually take physical space until you read data from the file, I could see how this maybe could temporarily trick YARN into not killing your process. This would probably be more likely to hold true if also either: not all columns of all on disk segments had not been queried yet (likely), or if the data was primarily still on heap in incremental indexes and was flushed to do the merge. 
   
   However, to clarify something, I think the mmap usage should not be increasing _after_ the merge, but rather _as a result of the merge_ where all columns of all persisted files from the hydrants are fully read to construct the new segment. Presumably there is free memory available while the merge is happening, that the kernel will allow to be used by page cache for the mapped files since it doesn't know or care about the YARN limits, and that is why this is getting triggered.
   
   I also would think that it means that if queries were to occur after you close the merge only queryable indexes, that ended up covering all columns of all of the segments of the queryable indexes that you _didn't_ close, then you would trigger the same oom killer condition with YARN if there was memory available to be used for page cache, so if anything this strategy would be a trick to avoid the inevitable if data is fully queried. I suspect that depending on the timing of this YARN check, or if the merge were slightly larger, or took longer for any other reason, then the oom killer could still kill a process during merge that would have otherwise been fine, were it not being run with YARN. 
   
   I'm not certain what is the best way to proceed here if you are unable to adjust settings, since I think this fix is for a YARN specific problem but isn't guaranteed to work in all case if what I think is going on is correct. That is because it doesn't actually do anything to the peak memory usage that this realtime task could use if that data were to be fully queried and the OS itself had free space beyond the YARN limits. 
   
   Slow handoff is a bad place for any realtime index tasks to be in, but I can see how this is definitely worse for YARN if it's counting this page cache usage against container limits. The realtime indexing tuning config, `handoffConditionTimeout` is designed to help mitigate this in the normal case, by allowing a realtime indexer to just drop the segments after a period if handoff doesn't happen, but I don't think that would help you in this case, and it has a downside that it can result in data that has been indexed to be unavailable for query if handoff has completely stalled, and I believe it's more oriented to ensure that tasks do not run forever. 
   
   Your proposed alternative fix is potentially interesting, and if the merged segment was smaller than the intermediary files, then it could indeed have a smaller footprint, though it would definitely be a bit more complicated of a fix to ensure everything happens correctly. Ultimately though in your case, I would expect the merged segment would have to be dramatically smaller to avoid the peak usage problem, where if all columns of the merged segment were queried and the OS had the space to cache it then YARN would probably still kill it.
   
   Does this sound correct? (Apologies for such a long response!)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org