You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2019/12/06 22:23:32 UTC

[GitHub] [accumulo] keith-turner opened a new issue #1451: Support external compactions in containers

keith-turner opened a new issue #1451: Support external compactions in containers
URL: https://github.com/apache/accumulo/issues/1451
 
 
   For use cases like large scale filtering data on an accumulo table, it may be useful to support running compactions externally from a tserver in a system like kubernetes. This feature could support the following operations and behaviors.
   
    * Client side Accumulo API that selects tablets and files to compact that returns serializable+runnable objects for each external compaction. The objects can be serialized and run anywhere that has access to DFS.
    * A lease for every file that has been selected for external compaction.  This lease prevents other compactions from processing the files.
    * A client side API for listing, committing, and canceling outstanding external compactions. 
   
   If all of the selection decisions for tablets and files to compact are made in the client side, then a user could pass a lambda to Accumulo to make these decisions.  This approach would avoid having to put that code on tservers.  The Accumulo client code could make RPCs to bring all needed information to the client side. Accumulo could also automatically handle the case of things changing and call the lambda again.
   
   External compactions could have early or late binding for the set of files to compact.  Early binding is much easier to implement and run idempotently on a cluster.  There is one problem with early binding : leases could be held on lots of small file for a long time negatively impacting scans.  One possible way to avoid this would be compact all files less than size X on tserver before starting an external compaction and then only select files over size X for the external compaction.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services