You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by S Ahmed <sa...@gmail.com> on 2010/11/23 22:14:14 UTC

managing 5-10 servers

Hi,

How much of a guru do you have to be to keep say 5-10 servers humming?

I'm a 1-man shop, and I dream of developing a web application, and scaling
will be a core part of the application.

Is it feasable for a 1-man operation to manage a 5-10 server hbase cluster?
Is it something that requires hand holding and constant monitoring or it
tends to be hands off?

Re: managing 5-10 servers

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Not just su.pr, but also stumbleupon.com which has the "social" layer.
We do have memcached in front of HBase. Regarding blog posts about our
setup, just search for "stumbleupon hbase" and you'll find tons. The
most recent presentation that's available online is my talk at Hadoop
World.

Vid: http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop
Slides: http://www.cloudera.com/resource/hw10_stumbleupon_advertising_platform_using_hbase

J-D

On Wed, Nov 24, 2010 at 6:22 AM, S Ahmed <sa...@gmail.com> wrote:
> So you have 20 nodes for the stumbled upon link redirection service?
>
> Are there any blog posts that go over the setup and what sort of read/write
> traffic it gets?  Is there a memcached layer that sites in front?
>
> On Tue, Nov 23, 2010 at 4:44 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> I wish I could do a dump of my memory into an ops guide to HBase, but
>> currently I don't think there's such a writeup.
>>
>> What can go wrong... again it depends on your type of usage. With a
>> MR-heavy cluster, it's usually very easy to drive the IO wait through
>> the roof and then you'll end up with GC pauses >60 secs caused by CPU
>> starvation. Here's a recent example we got when a big Mahout job was
>> running:
>>
>> 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K),
>> 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs]
>> [Times: user=4.45 sys=2.02, real=104.72 secs]
>>
>> The trained eye will quickly see that something very bad happened on
>> that cluster. Indeed, during post-mortem we saw that somehow that
>> machine started swapping which is the Worst Thing Ever (tm) that can
>> happen to a machine that runs java processes. Make sure that your
>> memory usage always stay under your total memory, even when all the
>> mappers and reducers are using their heap at the fullest. And then
>> double check that (which it seems we didn't do).
>>
>> On a cluster that serves web traffic, and thus must not be MRed
>> against, you get the "usual" stuff like bad disks and operator errors.
>>
>> J-D
>>
>> On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <sa...@gmail.com> wrote:
>> > Are there any writeups on what things to look for?
>> >
>> > What are some of the things that usually go wrong? Or is that an unfair
>> > question :)
>> >
>> > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> Constant hand holding no, constant monitoring yes. Do setup Ganglia
>> >> and preferably Nagios. Then it depends what you're planning to do with
>> >> your cluster... here we have 2x 20 machines in production, the one
>> >> that serves live traffic is pretty much doing it's own thing by itself
>> >> (although I keep a ganglia tab opened on a second monitor) and the
>> >> other one is used strictly for MapReduce for which our internal users
>> >> have developed a habit of running very destructive jobs on. But to be
>> >> fair, it's probably the users that need support the most ;)
>> >>
>> >> J-D
>> >>
>> >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <sa...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > How much of a guru do you have to be to keep say 5-10 servers humming?
>> >> >
>> >> > I'm a 1-man shop, and I dream of developing a web application, and
>> >> scaling
>> >> > will be a core part of the application.
>> >> >
>> >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase
>> >> cluster?
>> >> > Is it something that requires hand holding and constant monitoring or
>> it
>> >> > tends to be hands off?
>> >> >
>> >>
>> >
>>
>

Re: managing 5-10 servers

Posted by S Ahmed <sa...@gmail.com>.
So you have 20 nodes for the stumbled upon link redirection service?

Are there any blog posts that go over the setup and what sort of read/write
traffic it gets?  Is there a memcached layer that sites in front?

On Tue, Nov 23, 2010 at 4:44 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> I wish I could do a dump of my memory into an ops guide to HBase, but
> currently I don't think there's such a writeup.
>
> What can go wrong... again it depends on your type of usage. With a
> MR-heavy cluster, it's usually very easy to drive the IO wait through
> the roof and then you'll end up with GC pauses >60 secs caused by CPU
> starvation. Here's a recent example we got when a big Mahout job was
> running:
>
> 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K),
> 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs]
> [Times: user=4.45 sys=2.02, real=104.72 secs]
>
> The trained eye will quickly see that something very bad happened on
> that cluster. Indeed, during post-mortem we saw that somehow that
> machine started swapping which is the Worst Thing Ever (tm) that can
> happen to a machine that runs java processes. Make sure that your
> memory usage always stay under your total memory, even when all the
> mappers and reducers are using their heap at the fullest. And then
> double check that (which it seems we didn't do).
>
> On a cluster that serves web traffic, and thus must not be MRed
> against, you get the "usual" stuff like bad disks and operator errors.
>
> J-D
>
> On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <sa...@gmail.com> wrote:
> > Are there any writeups on what things to look for?
> >
> > What are some of the things that usually go wrong? Or is that an unfair
> > question :)
> >
> > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Constant hand holding no, constant monitoring yes. Do setup Ganglia
> >> and preferably Nagios. Then it depends what you're planning to do with
> >> your cluster... here we have 2x 20 machines in production, the one
> >> that serves live traffic is pretty much doing it's own thing by itself
> >> (although I keep a ganglia tab opened on a second monitor) and the
> >> other one is used strictly for MapReduce for which our internal users
> >> have developed a habit of running very destructive jobs on. But to be
> >> fair, it's probably the users that need support the most ;)
> >>
> >> J-D
> >>
> >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <sa...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > How much of a guru do you have to be to keep say 5-10 servers humming?
> >> >
> >> > I'm a 1-man shop, and I dream of developing a web application, and
> >> scaling
> >> > will be a core part of the application.
> >> >
> >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase
> >> cluster?
> >> > Is it something that requires hand holding and constant monitoring or
> it
> >> > tends to be hands off?
> >> >
> >>
> >
>

Re: managing 5-10 servers

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I wish I could do a dump of my memory into an ops guide to HBase, but
currently I don't think there's such a writeup.

What can go wrong... again it depends on your type of usage. With a
MR-heavy cluster, it's usually very easy to drive the IO wait through
the roof and then you'll end up with GC pauses >60 secs caused by CPU
starvation. Here's a recent example we got when a big Mahout job was
running:

2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K),
103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs]
[Times: user=4.45 sys=2.02, real=104.72 secs]

The trained eye will quickly see that something very bad happened on
that cluster. Indeed, during post-mortem we saw that somehow that
machine started swapping which is the Worst Thing Ever (tm) that can
happen to a machine that runs java processes. Make sure that your
memory usage always stay under your total memory, even when all the
mappers and reducers are using their heap at the fullest. And then
double check that (which it seems we didn't do).

On a cluster that serves web traffic, and thus must not be MRed
against, you get the "usual" stuff like bad disks and operator errors.

J-D

On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <sa...@gmail.com> wrote:
> Are there any writeups on what things to look for?
>
> What are some of the things that usually go wrong? Or is that an unfair
> question :)
>
> On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Constant hand holding no, constant monitoring yes. Do setup Ganglia
>> and preferably Nagios. Then it depends what you're planning to do with
>> your cluster... here we have 2x 20 machines in production, the one
>> that serves live traffic is pretty much doing it's own thing by itself
>> (although I keep a ganglia tab opened on a second monitor) and the
>> other one is used strictly for MapReduce for which our internal users
>> have developed a habit of running very destructive jobs on. But to be
>> fair, it's probably the users that need support the most ;)
>>
>> J-D
>>
>> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <sa...@gmail.com> wrote:
>> > Hi,
>> >
>> > How much of a guru do you have to be to keep say 5-10 servers humming?
>> >
>> > I'm a 1-man shop, and I dream of developing a web application, and
>> scaling
>> > will be a core part of the application.
>> >
>> > Is it feasable for a 1-man operation to manage a 5-10 server hbase
>> cluster?
>> > Is it something that requires hand holding and constant monitoring or it
>> > tends to be hands off?
>> >
>>
>

Re: managing 5-10 servers

Posted by S Ahmed <sa...@gmail.com>.
Are there any writeups on what things to look for?

What are some of the things that usually go wrong? Or is that an unfair
question :)

On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Constant hand holding no, constant monitoring yes. Do setup Ganglia
> and preferably Nagios. Then it depends what you're planning to do with
> your cluster... here we have 2x 20 machines in production, the one
> that serves live traffic is pretty much doing it's own thing by itself
> (although I keep a ganglia tab opened on a second monitor) and the
> other one is used strictly for MapReduce for which our internal users
> have developed a habit of running very destructive jobs on. But to be
> fair, it's probably the users that need support the most ;)
>
> J-D
>
> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <sa...@gmail.com> wrote:
> > Hi,
> >
> > How much of a guru do you have to be to keep say 5-10 servers humming?
> >
> > I'm a 1-man shop, and I dream of developing a web application, and
> scaling
> > will be a core part of the application.
> >
> > Is it feasable for a 1-man operation to manage a 5-10 server hbase
> cluster?
> > Is it something that requires hand holding and constant monitoring or it
> > tends to be hands off?
> >
>

Re: managing 5-10 servers

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Constant hand holding no, constant monitoring yes. Do setup Ganglia
and preferably Nagios. Then it depends what you're planning to do with
your cluster... here we have 2x 20 machines in production, the one
that serves live traffic is pretty much doing it's own thing by itself
(although I keep a ganglia tab opened on a second monitor) and the
other one is used strictly for MapReduce for which our internal users
have developed a habit of running very destructive jobs on. But to be
fair, it's probably the users that need support the most ;)

J-D

On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <sa...@gmail.com> wrote:
> Hi,
>
> How much of a guru do you have to be to keep say 5-10 servers humming?
>
> I'm a 1-man shop, and I dream of developing a web application, and scaling
> will be a core part of the application.
>
> Is it feasable for a 1-man operation to manage a 5-10 server hbase cluster?
> Is it something that requires hand holding and constant monitoring or it
> tends to be hands off?
>

Re: managing 5-10 servers

Posted by Lars George <la...@gmail.com>.
I have set up and maintained clusters between 6 and 40 machines while
being a full time developer, so all as part of the development
process. I used simple scripts like the ones I documented here
(http://www.larsgeorge.com/2009/02/hadoop-scripts-part-1.html).
Cluster SSH as mentioned is also used more often and if you want to do
it right then use Puppet. But as JD says, setting up monitoring is the
very first step or else you are flying blind.

On Wed, Nov 24, 2010 at 7:26 PM, Wojciech Langiewicz
<wl...@gmail.com> wrote:
> On 23.11.2010 22:14, S Ahmed wrote:
>>
>> Hi,
>>
>> How much of a guru do you have to be to keep say 5-10 servers humming?
>>
>> I'm a 1-man shop, and I dream of developing a web application, and scaling
>> will be a core part of the application.
>>
>> Is it feasable for a 1-man operation to manage a 5-10 server hbase
>> cluster?
>> Is it something that requires hand holding and constant monitoring or it
>> tends to be hands off?
>>
> I'm not sure what kind of managing you mean, but for doing admin work on
> hadoop/hbase machines I use Cluster SSH (cssh) which allows you to log on
> multiple machines at once, and execute commands. For cluster up to 20
> machines I think it's quite ok.
>
> --
> Wojciech Langiewicz
>

Re: managing 5-10 servers

Posted by Wojciech Langiewicz <wl...@gmail.com>.
On 23.11.2010 22:14, S Ahmed wrote:
> Hi,
>
> How much of a guru do you have to be to keep say 5-10 servers humming?
>
> I'm a 1-man shop, and I dream of developing a web application, and scaling
> will be a core part of the application.
>
> Is it feasable for a 1-man operation to manage a 5-10 server hbase cluster?
> Is it something that requires hand holding and constant monitoring or it
> tends to be hands off?
>
I'm not sure what kind of managing you mean, but for doing admin work on 
hadoop/hbase machines I use Cluster SSH (cssh) which allows you to log 
on multiple machines at once, and execute commands. For cluster up to 20 
machines I think it's quite ok.

--
Wojciech Langiewicz