You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/07/30 22:52:16 UTC

[GitHub] [accumulo] EdColeman commented on issue #1664: Investigate optimal / configurable batch sizes for accumulo-gc deletion candidates

EdColeman commented on issue #1664:
URL: https://github.com/apache/accumulo/issues/1664#issuecomment-666759157


   I may be just adding more words and Christopher's comment has it covered, but in case this helps.
   It would be interesting to know some of the ratios when you run different timing scenarios - like number of candidates vs the number of deletes.  
   I think the thing we are trying to gauge is if there is a bottle neck in the Accumulo gc processor and do different batch sizes have any impact on where / when a bottle neck could be triggered? Something that could help determine that, is there a timing difference when there is a large number of candidate and few deletes vs a large number of candidates and lots of deletes?  The first case may provide insight into the Accumulo overhead, while the second could be dominated by hdfs.
   The overall goal is to determine if there is are situations where Accumulo could get into a state where it just cannot keep up with deletes and would fall further and further behind. If that can be shown, then what are the triggering conditions and where are the bottle necks, and does batch size have any impact?
   Some of this could be simulated, but there also needs to measurements that include hdfs interactions. If it can be shown that the Accumulo gc process is never a bottle neck regardless of batch size, then we know to focus on hdfs interactions for additional performance improvements. If there are times where the gc process is dominating the gc cycle, then what is the performance difference with different batch sizes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org