You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Sam Seigal <se...@yahoo.com> on 2011/08/17 03:27:41 UTC

operational overhead for HBase

Hi All,

I had a question about the operational overhead of maintaining HBase in
production. Would someone care to share their experiences ? We have a team
of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to know
if we would need the same for HBase.

I am talking of a small cluster of 7-8 machines handling around 150 million
transactions per hour for the initial rollout.

What are some of the common operational/maintenance tasks associated
with maintaining a cluster of that size ? How much developer time goes into
this once the cluster is up and running ?

It would be extremely beneficial to hear some thoughts/experiences.

Thank you,

Sam

Re: operational overhead for HBase

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I'm obviously not in a good position to answer since I've been a
committer since 2008, but my experience if you can somehow relate is
the following:

At StumbleUpon we have 2 committers on staff (including me, oh and
we're looking to hire a third one if anyone is interested). We've been
using HBase in production since early 2009 so our clusters have been
through multiple upgrades of both software and hardware.

It used to be that our team was responsible for maintaining almost
everything related to HBase, except for actual hardware maintenance
and OS upgrades. It's hard for me to say exactly how much time we
spent maintaining HBase compared to developing it, as often both
overlapped.

Right now our situation evolved a bit. Our ops team is mostly composed
of SREs[1] and a lot of them have varying experience on HBase, one of
them being more carefully trained than the others so that we can have
a goto guy.

Once the cluster is running usually there's nothing to do except
keeping a dashboard with the important metrics on one screen. Mine has
the requests/second, GC activity, and compaction queues for the whole
prod cluster.

To make your life easier to really need to:

 - Have tools to automate cluster maintenance, such as doing rolling
upgrades. We use Puppet and Fabric[2].
 - Have good metrics, we use OpenTSDB[3]. It also helps that the
author works for us.
 - Have a good alerting system, we use Nagios.
 - Have at least one ops guy that understands/codes in Java. HBase and
Hadoop have a lot of Java-ism so it helps finding your way around.

Do allocate time for your teams to understand how HBase works (both
data model and architecture) as it will make everything much easier.
You don't want to end up in the middle of an outage with no
understanding of what's going on at all. Distributed systems have
different failure modes than the ones with single machine architecture
(even if they are in a cluster like a master-slave MySQL setup).

Hope this helps,

J-D


1. Site Reliability Engineer, a term that I believe comes from the
goog, use their search engine to learn more about what that position
involves. I think you can say it's close to DevOps.
2. Fabric: http://docs.fabfile.org/en/1.2.0/index.html
3. OpenTSDB: http://opentsdb.net/

On Tue, Aug 16, 2011 at 6:27 PM, Sam Seigal <se...@yahoo.com> wrote:
> Hi All,
>
> I had a question about the operational overhead of maintaining HBase in
> production. Would someone care to share their experiences ? We have a team
> of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to know
> if we would need the same for HBase.
>
> I am talking of a small cluster of 7-8 machines handling around 150 million
> transactions per hour for the initial rollout.
>
> What are some of the common operational/maintenance tasks associated
> with maintaining a cluster of that size ? How much developer time goes into
> this once the cluster is up and running ?
>
> It would be extremely beneficial to hear some thoughts/experiences.
>
> Thank you,
>
> Sam
>

Re: operational overhead for HBase

Posted by Friso van Vollenhoven <fv...@xebia.com>.
I worked with a cluster of about that size. Once everything is spinning, it requires little attention in my experience. Just have sensible checks (Nagios or alike) on things like disks filling up, especially on the namenode, and have an alert on swap usage (that's usually the beginning of something crashing later on).

Tuning (JVM garbage collection options and HBase configuration) is the hardest part, but possibly your development team does that (or possibly you are the dev team?). A problem can be when the types of workloads that you handle change all the time and you need to re-tune every now and then. Also, you can buy a little peace of mind by having overcapacity. Important part of the work is in development, making sure that the workload is nicely distributed and not hitting a single region server all the time.

Important thing is tooling around installation and configuration. So you'll want to be able to do config changes and (rolling) restarts by running a single command / script. Same with adding machines to the cluster. We use cobbler + puppet for that.

We managed a 10 node cluster with a 5 person dev/ops/everything team alongside a number of MySQL boxes and some 20-ish other boxes with several purposes. It's pretty doable while still having time to do development.

Perhaps Cloudera's SCM express tool can help, but I have never looked at that.


Friso



On 17 aug. 2011, at 03:27, Sam Seigal wrote:

> Hi All,
> 
> I had a question about the operational overhead of maintaining HBase in
> production. Would someone care to share their experiences ? We have a team
> of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to know
> if we would need the same for HBase.
> 
> I am talking of a small cluster of 7-8 machines handling around 150 million
> transactions per hour for the initial rollout.
> 
> What are some of the common operational/maintenance tasks associated
> with maintaining a cluster of that size ? How much developer time goes into
> this once the cluster is up and running ?
> 
> It would be extremely beneficial to hear some thoughts/experiences.
> 
> Thank you,
> 
> Sam


Re: operational overhead for HBase

Posted by Doug Meil <do...@explorysmedical.com>.
One of the things on my to-do list is to reorganize the Tools appendix
into a full-fledged Operations chapter.  Even though that's a few weeks
out, there are still a lot of points in there worth noting.  Especially in
the subject of backup (Master and RegionServers, etc.).  That's arguably
the most important operational task.





On 8/16/11 9:27 PM, "Sam Seigal" <se...@yahoo.com> wrote:

>Hi All,
>
>I had a question about the operational overhead of maintaining HBase in
>production. Would someone care to share their experiences ? We have a team
>of 3 DBAs dedicated to maintaining our Oracle cluster. I am curious to
>know
>if we would need the same for HBase.
>
>I am talking of a small cluster of 7-8 machines handling around 150
>million
>transactions per hour for the initial rollout.
>
>What are some of the common operational/maintenance tasks associated
>with maintaining a cluster of that size ? How much developer time goes
>into
>this once the cluster is up and running ?
>
>It would be extremely beneficial to hear some thoughts/experiences.
>
>Thank you,
>
>Sam