You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by rahul gidwani <ch...@apache.org> on 2017/12/15 22:01:56 UTC

Major Compaction Tool

Hi,

I was wondering if anyone was interested in a manual major compactor tool.

The basic overview of how this tool works is:

Parameters:

   -

   Table
   -

   Stores
   -

   ClusterConcurrency
   -

   Timestamp


So you input a table, desired concurrency and the list of stores you wish
to major compact.  The tool first checks the filesystem to see which stores
need compaction based on the timestamp you provide (default is current
time).  It takes that list of stores that require compaction and executes
those requests concurrently with at most N distinct RegionServers
compacting at a given time.  Each thread waits for the compaction to
complete before moving to the next queue.  If a region split, merge or move
happens this tool ensures those regions get major compacted as well.

We have started using this tool in production but were wondering if there
is any interest from you guys in getting this upstream.

This helps us in two ways, we can limit how much I/O bandwidth we are using
for major compaction cluster wide and we are guaranteed after the tool
completes that all requested compactions complete regardless of moves,
merges and splits.

Re: Major Compaction Tool

Posted by rahul gidwani <ra...@gmail.com>.
thanks for all the great feedback!

I opened a ticket here:

https://issues.apache.org/jira/browse/HBASE-19528

Lets continue the discussion there.


On Fri, Dec 15, 2017 at 11:34 PM, sahil aggarwal <sa...@gmail.com>
wrote:

> Hi,
>
> We wrote something similar. It just triggers major compaction with given
> parallelism and distribute it across the cluster.
>
> https://github.com/flipkart-incubator/hbase-compactor
>
>
> On Dec 16, 2017 10:01 AM, "Jean-Marc Spaggiari" <je...@spaggiari.org>
> wrote:
>
> Rahul,
>
> I had something in mind for months/years! It's a must to have! Thanks for
> taking the task! I like register to the JIRA and come back very soon with
> tons of ides and recommendations. You can count on my to test it too!
>
> JMS
>
> 2017-12-15 17:44 GMT-05:00 rahul gidwani <ra...@gmail.com>:
>
> > The tool creates a Map of servers to CompactionRequests needing to be
> > performed.  You always select the server with the largest queue (*which
> is
> > not currently compacting) *to compact next.
> >
> > I created a JIRA: HBASE-19528 for this tool.
> >
> > On Fri, Dec 15, 2017 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > bq. with at most N distinct RegionServers compacting at a given time
> > >
> > > If per table balancing is not on, the regions for the underlying table
> > may
> > > not be evenly distributed across the cluster.
> > > In that case, how would the tool which servers to perform compaction ?
> > >
> > > I think you can log a JIRA for upstreaming this tool.
> > >
> > > Thanks
> > >
> > > On Fri, Dec 15, 2017 at 2:01 PM, rahul gidwani <ch...@apache.org>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I was wondering if anyone was interested in a manual major compactor
> > > tool.
> > > >
> > > > The basic overview of how this tool works is:
> > > >
> > > > Parameters:
> > > >
> > > >    -
> > > >
> > > >    Table
> > > >    -
> > > >
> > > >    Stores
> > > >    -
> > > >
> > > >    ClusterConcurrency
> > > >    -
> > > >
> > > >    Timestamp
> > > >
> > > >
> > > > So you input a table, desired concurrency and the list of stores you
> > wish
> > > > to major compact.  The tool first checks the filesystem to see which
> > > stores
> > > > need compaction based on the timestamp you provide (default is
> current
> > > > time).  It takes that list of stores that require compaction and
> > executes
> > > > those requests concurrently with at most N distinct RegionServers
> > > > compacting at a given time.  Each thread waits for the compaction to
> > > > complete before moving to the next queue.  If a region split, merge
> or
> > > move
> > > > happens this tool ensures those regions get major compacted as well.
> > > >
> > > > We have started using this tool in production but were wondering if
> > there
> > > > is any interest from you guys in getting this upstream.
> > > >
> > > > This helps us in two ways, we can limit how much I/O bandwidth we are
> > > using
> > > > for major compaction cluster wide and we are guaranteed after the
> tool
> > > > completes that all requested compactions complete regardless of
> moves,
> > > > merges and splits.
> > > >
> > >
> >
>

Re: Major Compaction Tool

Posted by sahil aggarwal <sa...@gmail.com>.
Hi,

We wrote something similar. It just triggers major compaction with given
parallelism and distribute it across the cluster.

https://github.com/flipkart-incubator/hbase-compactor


On Dec 16, 2017 10:01 AM, "Jean-Marc Spaggiari" <je...@spaggiari.org>
wrote:

Rahul,

I had something in mind for months/years! It's a must to have! Thanks for
taking the task! I like register to the JIRA and come back very soon with
tons of ides and recommendations. You can count on my to test it too!

JMS

2017-12-15 17:44 GMT-05:00 rahul gidwani <ra...@gmail.com>:

> The tool creates a Map of servers to CompactionRequests needing to be
> performed.  You always select the server with the largest queue (*which is
> not currently compacting) *to compact next.
>
> I created a JIRA: HBASE-19528 for this tool.
>
> On Fri, Dec 15, 2017 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. with at most N distinct RegionServers compacting at a given time
> >
> > If per table balancing is not on, the regions for the underlying table
> may
> > not be evenly distributed across the cluster.
> > In that case, how would the tool which servers to perform compaction ?
> >
> > I think you can log a JIRA for upstreaming this tool.
> >
> > Thanks
> >
> > On Fri, Dec 15, 2017 at 2:01 PM, rahul gidwani <ch...@apache.org>
> wrote:
> >
> > > Hi,
> > >
> > > I was wondering if anyone was interested in a manual major compactor
> > tool.
> > >
> > > The basic overview of how this tool works is:
> > >
> > > Parameters:
> > >
> > >    -
> > >
> > >    Table
> > >    -
> > >
> > >    Stores
> > >    -
> > >
> > >    ClusterConcurrency
> > >    -
> > >
> > >    Timestamp
> > >
> > >
> > > So you input a table, desired concurrency and the list of stores you
> wish
> > > to major compact.  The tool first checks the filesystem to see which
> > stores
> > > need compaction based on the timestamp you provide (default is current
> > > time).  It takes that list of stores that require compaction and
> executes
> > > those requests concurrently with at most N distinct RegionServers
> > > compacting at a given time.  Each thread waits for the compaction to
> > > complete before moving to the next queue.  If a region split, merge or
> > move
> > > happens this tool ensures those regions get major compacted as well.
> > >
> > > We have started using this tool in production but were wondering if
> there
> > > is any interest from you guys in getting this upstream.
> > >
> > > This helps us in two ways, we can limit how much I/O bandwidth we are
> > using
> > > for major compaction cluster wide and we are guaranteed after the tool
> > > completes that all requested compactions complete regardless of moves,
> > > merges and splits.
> > >
> >
>

Re: Major Compaction Tool

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Rahul,

I had something in mind for months/years! It's a must to have! Thanks for
taking the task! I like register to the JIRA and come back very soon with
tons of ides and recommendations. You can count on my to test it too!

JMS

2017-12-15 17:44 GMT-05:00 rahul gidwani <ra...@gmail.com>:

> The tool creates a Map of servers to CompactionRequests needing to be
> performed.  You always select the server with the largest queue (*which is
> not currently compacting) *to compact next.
>
> I created a JIRA: HBASE-19528 for this tool.
>
> On Fri, Dec 15, 2017 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > bq. with at most N distinct RegionServers compacting at a given time
> >
> > If per table balancing is not on, the regions for the underlying table
> may
> > not be evenly distributed across the cluster.
> > In that case, how would the tool which servers to perform compaction ?
> >
> > I think you can log a JIRA for upstreaming this tool.
> >
> > Thanks
> >
> > On Fri, Dec 15, 2017 at 2:01 PM, rahul gidwani <ch...@apache.org>
> wrote:
> >
> > > Hi,
> > >
> > > I was wondering if anyone was interested in a manual major compactor
> > tool.
> > >
> > > The basic overview of how this tool works is:
> > >
> > > Parameters:
> > >
> > >    -
> > >
> > >    Table
> > >    -
> > >
> > >    Stores
> > >    -
> > >
> > >    ClusterConcurrency
> > >    -
> > >
> > >    Timestamp
> > >
> > >
> > > So you input a table, desired concurrency and the list of stores you
> wish
> > > to major compact.  The tool first checks the filesystem to see which
> > stores
> > > need compaction based on the timestamp you provide (default is current
> > > time).  It takes that list of stores that require compaction and
> executes
> > > those requests concurrently with at most N distinct RegionServers
> > > compacting at a given time.  Each thread waits for the compaction to
> > > complete before moving to the next queue.  If a region split, merge or
> > move
> > > happens this tool ensures those regions get major compacted as well.
> > >
> > > We have started using this tool in production but were wondering if
> there
> > > is any interest from you guys in getting this upstream.
> > >
> > > This helps us in two ways, we can limit how much I/O bandwidth we are
> > using
> > > for major compaction cluster wide and we are guaranteed after the tool
> > > completes that all requested compactions complete regardless of moves,
> > > merges and splits.
> > >
> >
>

Re: Major Compaction Tool

Posted by rahul gidwani <ra...@gmail.com>.
The tool creates a Map of servers to CompactionRequests needing to be
performed.  You always select the server with the largest queue (*which is
not currently compacting) *to compact next.

I created a JIRA: HBASE-19528 for this tool.

On Fri, Dec 15, 2017 at 2:35 PM, Ted Yu <yu...@gmail.com> wrote:

> bq. with at most N distinct RegionServers compacting at a given time
>
> If per table balancing is not on, the regions for the underlying table may
> not be evenly distributed across the cluster.
> In that case, how would the tool which servers to perform compaction ?
>
> I think you can log a JIRA for upstreaming this tool.
>
> Thanks
>
> On Fri, Dec 15, 2017 at 2:01 PM, rahul gidwani <ch...@apache.org> wrote:
>
> > Hi,
> >
> > I was wondering if anyone was interested in a manual major compactor
> tool.
> >
> > The basic overview of how this tool works is:
> >
> > Parameters:
> >
> >    -
> >
> >    Table
> >    -
> >
> >    Stores
> >    -
> >
> >    ClusterConcurrency
> >    -
> >
> >    Timestamp
> >
> >
> > So you input a table, desired concurrency and the list of stores you wish
> > to major compact.  The tool first checks the filesystem to see which
> stores
> > need compaction based on the timestamp you provide (default is current
> > time).  It takes that list of stores that require compaction and executes
> > those requests concurrently with at most N distinct RegionServers
> > compacting at a given time.  Each thread waits for the compaction to
> > complete before moving to the next queue.  If a region split, merge or
> move
> > happens this tool ensures those regions get major compacted as well.
> >
> > We have started using this tool in production but were wondering if there
> > is any interest from you guys in getting this upstream.
> >
> > This helps us in two ways, we can limit how much I/O bandwidth we are
> using
> > for major compaction cluster wide and we are guaranteed after the tool
> > completes that all requested compactions complete regardless of moves,
> > merges and splits.
> >
>

Re: Major Compaction Tool

Posted by Ted Yu <yu...@gmail.com>.
bq. with at most N distinct RegionServers compacting at a given time

If per table balancing is not on, the regions for the underlying table may
not be evenly distributed across the cluster.
In that case, how would the tool which servers to perform compaction ?

I think you can log a JIRA for upstreaming this tool.

Thanks

On Fri, Dec 15, 2017 at 2:01 PM, rahul gidwani <ch...@apache.org> wrote:

> Hi,
>
> I was wondering if anyone was interested in a manual major compactor tool.
>
> The basic overview of how this tool works is:
>
> Parameters:
>
>    -
>
>    Table
>    -
>
>    Stores
>    -
>
>    ClusterConcurrency
>    -
>
>    Timestamp
>
>
> So you input a table, desired concurrency and the list of stores you wish
> to major compact.  The tool first checks the filesystem to see which stores
> need compaction based on the timestamp you provide (default is current
> time).  It takes that list of stores that require compaction and executes
> those requests concurrently with at most N distinct RegionServers
> compacting at a given time.  Each thread waits for the compaction to
> complete before moving to the next queue.  If a region split, merge or move
> happens this tool ensures those regions get major compacted as well.
>
> We have started using this tool in production but were wondering if there
> is any interest from you guys in getting this upstream.
>
> This helps us in two ways, we can limit how much I/O bandwidth we are using
> for major compaction cluster wide and we are guaranteed after the tool
> completes that all requested compactions complete regardless of moves,
> merges and splits.
>