You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/06/18 19:59:48 UTC
[GitHub] [incubator-druid] clintropolis edited a comment on issue #7919: disable all compression in intermediate segment persists while ingestion

clintropolis edited a comment on issue #7919: disable all compression in intermediate segment persists while ingestion
URL: https://github.com/apache/incubator-druid/pull/7919#issuecomment-503283496
 
 
   Hmm, I _really_ like the idea of being able to separately control this stuff for the intermediary segments so mega +1 on that, however I'm not sure how I feel about straight `UNCOMPRESSED` being the default behavior (if I understand this PR correctly). I think we should consider if this is the best thing to do to avoid the unbounded usage of the 64k processing buffers used for decompression, and maybe we should measure some things? My fear is this as the default could potentially change the dynamic of realtime indexing tasks quite a lot, namely how much memory impact they have on page cache, potentially exaggerating issues like described in #6699 (though I suspect running realtime tasks via YARN is rare-ish). 
   
   Experimentation I did related to #6016 showed an often _very dramatic_ size difference between compressed and non-compressed data, particularly with int and long typed columns that I don't think can be ignored. Even the size difference between using `CompressedVSizeByte` and `VSizeByte` versions of int columns could be very large. I will see if I can dig up some measurements where i collected uncompressed and/or vsize byte packed sizes and share them here.
   
   Some other ideas have been suggested to help mitigate merging issues like this I think. #5526 suggests adjustments to the merging algorithm that would reduce overall memory usage such as producing the dictionary out of band, and reducing the duplicated number of operations which I would suspect would reduce overall memory usage.
   
   #7900 additionally suggests another thing we could do that would specifically help the unbounded decompression buffer on merge issue, in the form of doing a sort of hierarchical merge. Since we can measure how many intermediary segments we have with column counts for each, we could calculate how much buffer required and divide merge work up as necessary to keep total usage at a reasonable size. If I understand correctly, it also suggests some similar reworking of merge algorithm as mentioned in #5526.
   
   The other thing that makes me think this might not be the best _default_ behavior at least, is that to simply things for new users, the documentation for getting started and smaller cluster tuning suggests running co-tenant middle-manager and historical processes, and I suspect if uncompressed columns size differences are noticeable that this will greatly increase the amount of contention of the os free space between these processes, especially at merge time where all columns of all intermediary segments will be paged in during merge (#6699 again, some related discussion on this in comments). This is already a thing that bothers me about this setup, and I think it could make this issue become a lot worse.
   
   It's also totally possible i'm being overly cautious, but I think we need more data before going with these defaults.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org