You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2020/02/28 00:56:46 UTC

[GitHub] [accumulo-website] keith-turner opened a new pull request #223: Adds a design document for external compactions

keith-turner opened a new pull request #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223
 
 
   In a conversation with @ctubbsii one day he asked me if a distributed queue
   could be used instead of the approach proposed in #1451.  A few weeks later I
   brought this up in a discussion with @billierinaldi and we worked out a
   possible way to do this on a whiteboard.  This document captures that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] milleruntime commented on issue #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
milleruntime commented on issue #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#issuecomment-592671961
 
 
   I was wondering what your thoughts were with this design and merging minor compactions?  I would think their benefits would be minimized and removing them would greatly reduce complexity.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] milleruntime commented on a change in pull request #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#discussion_r385845996
 
 

 ##########
 File path: design/external-compactions.md
 ##########
 @@ -0,0 +1,245 @@
+---
+title: External Compaction Design
+---
+
+## Definitions
+
+ * *External compaction:* compactions of a tablet execute in a process other than
+   the tablet server hosting that tablet.
+
+ * *Internal compaction:* compactions of a tablet execute in the tablet server
+   process hosting the tablet.
+
+ * *Compactor:* Accumulo process that runs external compactions.
+
+## Introduction
+
+Currently, Accumulo only supports internal compactions.  This can lead to
+uneven load on a cluster.  For example, a few tablet servers could have many
+tablets to compact while many tablet servers are idle.  If Accumulo supported
+external compactions, then compaction work could evenly spread across a
+cluster.
+
+Compactors could start with a command like:
+
+```
+  accumulo compactor <queue>
+```
+
+This would start a process that looks for compactions on the specified
+distributed queue and executes them.  The command could easily run in a docker
+container in something like Kubernetes.  A compactor would need to continually
+do the following
+
+ * Find tablets with files to compact in the queue
+ * Reserve files/work unit
+ * Compact files
+ * Commit compaction
+
+This document outlines an alternative design to the one outlined in [#1451].
+[#1451] proposes a pull+polling approach, early binding, leases and client side
+selection. This proposal has a distributed queue supporting late binding
+instead of the pull+polling approach.  Selection is in the tablet server
+instead of an Accumulo client.  Zookeeper ephemeral nodes are used instead of
+leases.
+
+## Selection
+
+Accumulo needs a mechanism to select files for compaction.  Currently this is
+done in two ways.  The first way is by looking for tablets with too many files
+according to a logarithmic size ratio.  The second way is when a user initiates
+a compaction with a specified file selection criterion.  User initiated
+compactions may also specify custom iterators that do things like filter out
+unwanted data.
+
+In both cases above, this selection code currently runs in the tablet server.
+User can optionally pass configuration to the selection code running in the
+tablet servers.  Going forward the selection code could run in three places.
+
+ * In the tablet server.
+ * In the compactor processes.
+ * In a user process that initiates a user compaction.
+
+Determining where selection code runs is an important consideration for the
+overall user experience.  However, for this document its assumed that selection
+code runs somewhere and queues compaction work.  One possibility for selection
+is to use the approach outlined in [#564] with the additional capability of
+compaction managers to submit jobs to distributed external compaction queues
+(in addition to internal queues).
+
+## Queues
+
+A distributed queue for external compactions needs to support the following operations.
+
+ * Adding and removing compaction work for a tablet
+ * Prioritizing compaction work for a tablet
+ * Efficiently finding work
+
+One possibility to implement a distributed queue is a section in Accumulo’s
+metadata table for each queue.  Giving each queue a unique row prefix within
 
 Review comment:
   Do we want external processes (possibly many) to constantly be scanning the metadata table?  Would it be better to have a separate table for the compaction queue?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] ctubbsii commented on issue #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on issue #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#issuecomment-592761274
 
 
   I agree with the above statements about smart selection, multiple queues, and/or external compactions, removing the need for merging minor compactions. I think merging minor compactions provide too little value to justify their complexity, especially in light of these better alternatives. I'd be happy if they went away.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] keith-turner commented on a change in pull request #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#discussion_r385888951
 
 

 ##########
 File path: design/external-compactions.md
 ##########
 @@ -0,0 +1,245 @@
+---
+title: External Compaction Design
+---
+
+## Definitions
+
+ * *External compaction:* compactions of a tablet execute in a process other than
+   the tablet server hosting that tablet.
+
+ * *Internal compaction:* compactions of a tablet execute in the tablet server
+   process hosting the tablet.
+
+ * *Compactor:* Accumulo process that runs external compactions.
+
+## Introduction
+
+Currently, Accumulo only supports internal compactions.  This can lead to
+uneven load on a cluster.  For example, a few tablet servers could have many
+tablets to compact while many tablet servers are idle.  If Accumulo supported
+external compactions, then compaction work could evenly spread across a
+cluster.
+
+Compactors could start with a command like:
+
+```
+  accumulo compactor <queue>
+```
+
+This would start a process that looks for compactions on the specified
+distributed queue and executes them.  The command could easily run in a docker
+container in something like Kubernetes.  A compactor would need to continually
+do the following
+
+ * Find tablets with files to compact in the queue
+ * Reserve files/work unit
+ * Compact files
+ * Commit compaction
+
+This document outlines an alternative design to the one outlined in [#1451].
+[#1451] proposes a pull+polling approach, early binding, leases and client side
+selection. This proposal has a distributed queue supporting late binding
+instead of the pull+polling approach.  Selection is in the tablet server
+instead of an Accumulo client.  Zookeeper ephemeral nodes are used instead of
+leases.
+
+## Selection
+
+Accumulo needs a mechanism to select files for compaction.  Currently this is
+done in two ways.  The first way is by looking for tablets with too many files
+according to a logarithmic size ratio.  The second way is when a user initiates
+a compaction with a specified file selection criterion.  User initiated
+compactions may also specify custom iterators that do things like filter out
+unwanted data.
+
+In both cases above, this selection code currently runs in the tablet server.
+User can optionally pass configuration to the selection code running in the
+tablet servers.  Going forward the selection code could run in three places.
+
+ * In the tablet server.
+ * In the compactor processes.
+ * In a user process that initiates a user compaction.
+
+Determining where selection code runs is an important consideration for the
+overall user experience.  However, for this document its assumed that selection
+code runs somewhere and queues compaction work.  One possibility for selection
+is to use the approach outlined in [#564] with the additional capability of
+compaction managers to submit jobs to distributed external compaction queues
+(in addition to internal queues).
+
+## Queues
+
+A distributed queue for external compactions needs to support the following operations.
+
+ * Adding and removing compaction work for a tablet
+ * Prioritizing compaction work for a tablet
+ * Efficiently finding work
+
+One possibility to implement a distributed queue is a section in Accumulo’s
+metadata table for each queue.  Giving each queue a unique row prefix within
 
 Review comment:
   A goal of the design is to avoid that.  When a compactor is running a compaction it will not be scanning the metadata table.  When it its not running a compaction and searching there are a few things that should help reduce the load on the metadata table.  First whenever a compactor finds nothing or has a reservation collision, it will do exponential backoff before looking again.  Second, on purpose of the bins is to avoid reservation collisions (those could cause a lot of metadata activity).   
   
   Do you think it would be useful to add a section about this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] milleruntime commented on a change in pull request #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
milleruntime commented on a change in pull request #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#discussion_r385903753
 
 

 ##########
 File path: design/external-compactions.md
 ##########
 @@ -0,0 +1,245 @@
+---
+title: External Compaction Design
+---
+
+## Definitions
+
+ * *External compaction:* compactions of a tablet execute in a process other than
+   the tablet server hosting that tablet.
+
+ * *Internal compaction:* compactions of a tablet execute in the tablet server
+   process hosting the tablet.
+
+ * *Compactor:* Accumulo process that runs external compactions.
+
+## Introduction
+
+Currently, Accumulo only supports internal compactions.  This can lead to
+uneven load on a cluster.  For example, a few tablet servers could have many
+tablets to compact while many tablet servers are idle.  If Accumulo supported
+external compactions, then compaction work could evenly spread across a
+cluster.
+
+Compactors could start with a command like:
+
+```
+  accumulo compactor <queue>
+```
+
+This would start a process that looks for compactions on the specified
+distributed queue and executes them.  The command could easily run in a docker
+container in something like Kubernetes.  A compactor would need to continually
+do the following
+
+ * Find tablets with files to compact in the queue
+ * Reserve files/work unit
+ * Compact files
+ * Commit compaction
+
+This document outlines an alternative design to the one outlined in [#1451].
+[#1451] proposes a pull+polling approach, early binding, leases and client side
+selection. This proposal has a distributed queue supporting late binding
+instead of the pull+polling approach.  Selection is in the tablet server
+instead of an Accumulo client.  Zookeeper ephemeral nodes are used instead of
+leases.
+
+## Selection
+
+Accumulo needs a mechanism to select files for compaction.  Currently this is
+done in two ways.  The first way is by looking for tablets with too many files
+according to a logarithmic size ratio.  The second way is when a user initiates
+a compaction with a specified file selection criterion.  User initiated
+compactions may also specify custom iterators that do things like filter out
+unwanted data.
+
+In both cases above, this selection code currently runs in the tablet server.
+User can optionally pass configuration to the selection code running in the
+tablet servers.  Going forward the selection code could run in three places.
+
+ * In the tablet server.
+ * In the compactor processes.
+ * In a user process that initiates a user compaction.
+
+Determining where selection code runs is an important consideration for the
+overall user experience.  However, for this document its assumed that selection
+code runs somewhere and queues compaction work.  One possibility for selection
+is to use the approach outlined in [#564] with the additional capability of
+compaction managers to submit jobs to distributed external compaction queues
+(in addition to internal queues).
+
+## Queues
+
+A distributed queue for external compactions needs to support the following operations.
+
+ * Adding and removing compaction work for a tablet
+ * Prioritizing compaction work for a tablet
+ * Efficiently finding work
+
+One possibility to implement a distributed queue is a section in Accumulo’s
+metadata table for each queue.  Giving each queue a unique row prefix within
 
 Review comment:
   No your write up is solid and I did see your comments in the others sections about collisions.  I was thinking more of scale or performance and preventing users shooting themselves in the foot by crippling the metadata with too many Compactors or poorly configured Compactors.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] keith-turner commented on a change in pull request #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
keith-turner commented on a change in pull request #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#discussion_r385888951
 
 

 ##########
 File path: design/external-compactions.md
 ##########
 @@ -0,0 +1,245 @@
+---
+title: External Compaction Design
+---
+
+## Definitions
+
+ * *External compaction:* compactions of a tablet execute in a process other than
+   the tablet server hosting that tablet.
+
+ * *Internal compaction:* compactions of a tablet execute in the tablet server
+   process hosting the tablet.
+
+ * *Compactor:* Accumulo process that runs external compactions.
+
+## Introduction
+
+Currently, Accumulo only supports internal compactions.  This can lead to
+uneven load on a cluster.  For example, a few tablet servers could have many
+tablets to compact while many tablet servers are idle.  If Accumulo supported
+external compactions, then compaction work could evenly spread across a
+cluster.
+
+Compactors could start with a command like:
+
+```
+  accumulo compactor <queue>
+```
+
+This would start a process that looks for compactions on the specified
+distributed queue and executes them.  The command could easily run in a docker
+container in something like Kubernetes.  A compactor would need to continually
+do the following
+
+ * Find tablets with files to compact in the queue
+ * Reserve files/work unit
+ * Compact files
+ * Commit compaction
+
+This document outlines an alternative design to the one outlined in [#1451].
+[#1451] proposes a pull+polling approach, early binding, leases and client side
+selection. This proposal has a distributed queue supporting late binding
+instead of the pull+polling approach.  Selection is in the tablet server
+instead of an Accumulo client.  Zookeeper ephemeral nodes are used instead of
+leases.
+
+## Selection
+
+Accumulo needs a mechanism to select files for compaction.  Currently this is
+done in two ways.  The first way is by looking for tablets with too many files
+according to a logarithmic size ratio.  The second way is when a user initiates
+a compaction with a specified file selection criterion.  User initiated
+compactions may also specify custom iterators that do things like filter out
+unwanted data.
+
+In both cases above, this selection code currently runs in the tablet server.
+User can optionally pass configuration to the selection code running in the
+tablet servers.  Going forward the selection code could run in three places.
+
+ * In the tablet server.
+ * In the compactor processes.
+ * In a user process that initiates a user compaction.
+
+Determining where selection code runs is an important consideration for the
+overall user experience.  However, for this document its assumed that selection
+code runs somewhere and queues compaction work.  One possibility for selection
+is to use the approach outlined in [#564] with the additional capability of
+compaction managers to submit jobs to distributed external compaction queues
+(in addition to internal queues).
+
+## Queues
+
+A distributed queue for external compactions needs to support the following operations.
+
+ * Adding and removing compaction work for a tablet
+ * Prioritizing compaction work for a tablet
+ * Efficiently finding work
+
+One possibility to implement a distributed queue is a section in Accumulo’s
+metadata table for each queue.  Giving each queue a unique row prefix within
 
 Review comment:
   A goal of the design is to avoid that.  When a compactor is running a compaction it will not be scanning the metadata table.  When it its not running a compaction and searching there are a few things that should help reduce the load on the metadata table.  First whenever a compactor finds nothing or has a reservation collision, it will do exponential backoff before looking again.  Second, one purpose of the bins is to avoid reservation collisions (those could cause a lot of metadata activity).   
   
   Do you think it would be useful to add a section about this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [accumulo-website] keith-turner commented on issue #223: Adds a design document for external compactions

Posted by GitBox <gi...@apache.org>.
keith-turner commented on issue #223: Adds a design document for external compactions
URL: https://github.com/apache/accumulo-website/pull/223#issuecomment-592698663
 
 
   > I was wondering what your thoughts were with this design and merging minor compactions? I would think their benefits would be minimized and removing them would greatly reduce complexity.
   
   I had not considered that, but I think removing them may be a good idea.  A smart selection algorithm plus multiple compaction queues (like this [proposal](https://gist.github.com/keith-turner/16125790c6ff0d86c67795a08d2c057f)) could remove the need for merging minor compactions.  I think one would want compactions for small files to run on an internal queue for responsiveness.  Compactions of larger files could run on an external queue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services