You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wayne <wa...@gmail.com> on 2011/01/10 20:59:55 UTC

CPU Wait Problems

We had a node last night go awol and got stuck in permanent 50% CPU wait
time. The node also steadily shot up the load to 400 before we saw it and
had to hard reboot. Besides that all other ganglia metrics flat-lined. Is
this some sort of bizarre kernal problem? We are using xfs with std
settings. I have seen a few postings talk about bizarre problems like this.
Can XFS be blamed or is it more kernal related? Is there a posting somewhere
suggesting the best file system settings? Are there recommended settings for
using CentOS 5.5? We have a 10 nodes cluster we have been pounding for weeks
and we can't seem to keep all ten nodes up for a 24 hour period. I am hoping
there is a lower level problem causing much of it.

Thanks.

Re: CPU Wait Problems

Posted by Wayne <wa...@gmail.com>.
That was our first thought as we had motherboard controller based raid 0 in
place. We have since rebuilt all nodes with jbod using the recommended etx4
partition creation and mount parameters. So far so good.

On Tue, Jan 11, 2011 at 4:24 PM, Ted Dunning <td...@maprtech.com> wrote:

> I have seen this also with evil disk controllers on the edge of dying.
>
> On Tue, Jan 11, 2011 at 12:10 PM, Wayne <wa...@gmail.com> wrote:
>
> > Thanks a lot for the heads up on this. We have only seen this once, but
> if
> > we start seeing it more we will definitely try to go back to a previous
> > version. We are using 1.6u23. Are you using the Sun JVM? We were
> previously
> > working with cassandra and found the openJDK 1.6u17 to be a lot better
> for
> > other reasons (CMF).
> >
> > Thanks.
> >
> >
> > On Tue, Jan 11, 2011 at 12:22 PM, Brent Halsey <mr...@gmail.com>
> wrote:
> >
> > > Which jdk are you using?  We've had similar problems with jdk1.6u22 on
> > > Ubuntu 10.04 in Amazon EC2.  Nodes would lock up for 20-40+ minutes.
> > >
> > > We haven't done any conclusive tests yet, but we haven't seen the same
> > > problems after down rev'ing to jdk1.6u16.
> > >
> > >  -brent
> > >
> > > On Mon, Jan 10, 2011 at 12:59 PM, Wayne <wa...@gmail.com> wrote:
> > > > We had a node last night go awol and got stuck in permanent 50% CPU
> > wait
> > > > time. The node also steadily shot up the load to 400 before we saw it
> > and
> > > > had to hard reboot. Besides that all other ganglia metrics
> flat-lined.
> > Is
> > > > this some sort of bizarre kernal problem? We are using xfs with std
> > > > settings. I have seen a few postings talk about bizarre problems like
> > > this.
> > > > Can XFS be blamed or is it more kernal related? Is there a posting
> > > somewhere
> > > > suggesting the best file system settings? Are there recommended
> > settings
> > > for
> > > > using CentOS 5.5? We have a 10 nodes cluster we have been pounding
> for
> > > weeks
> > > > and we can't seem to keep all ten nodes up for a 24 hour period. I am
> > > hoping
> > > > there is a lower level problem causing much of it.
> > > >
> > > > Thanks.
> > > >
> > >
> >
>

Re: CPU Wait Problems

Posted by Ted Dunning <td...@maprtech.com>.
I have seen this also with evil disk controllers on the edge of dying.

On Tue, Jan 11, 2011 at 12:10 PM, Wayne <wa...@gmail.com> wrote:

> Thanks a lot for the heads up on this. We have only seen this once, but if
> we start seeing it more we will definitely try to go back to a previous
> version. We are using 1.6u23. Are you using the Sun JVM? We were previously
> working with cassandra and found the openJDK 1.6u17 to be a lot better for
> other reasons (CMF).
>
> Thanks.
>
>
> On Tue, Jan 11, 2011 at 12:22 PM, Brent Halsey <mr...@gmail.com> wrote:
>
> > Which jdk are you using?  We've had similar problems with jdk1.6u22 on
> > Ubuntu 10.04 in Amazon EC2.  Nodes would lock up for 20-40+ minutes.
> >
> > We haven't done any conclusive tests yet, but we haven't seen the same
> > problems after down rev'ing to jdk1.6u16.
> >
> >  -brent
> >
> > On Mon, Jan 10, 2011 at 12:59 PM, Wayne <wa...@gmail.com> wrote:
> > > We had a node last night go awol and got stuck in permanent 50% CPU
> wait
> > > time. The node also steadily shot up the load to 400 before we saw it
> and
> > > had to hard reboot. Besides that all other ganglia metrics flat-lined.
> Is
> > > this some sort of bizarre kernal problem? We are using xfs with std
> > > settings. I have seen a few postings talk about bizarre problems like
> > this.
> > > Can XFS be blamed or is it more kernal related? Is there a posting
> > somewhere
> > > suggesting the best file system settings? Are there recommended
> settings
> > for
> > > using CentOS 5.5? We have a 10 nodes cluster we have been pounding for
> > weeks
> > > and we can't seem to keep all ten nodes up for a 24 hour period. I am
> > hoping
> > > there is a lower level problem causing much of it.
> > >
> > > Thanks.
> > >
> >
>

Re: CPU Wait Problems

Posted by Wayne <wa...@gmail.com>.
Thanks a lot for the heads up on this. We have only seen this once, but if
we start seeing it more we will definitely try to go back to a previous
version. We are using 1.6u23. Are you using the Sun JVM? We were previously
working with cassandra and found the openJDK 1.6u17 to be a lot better for
other reasons (CMF).

Thanks.


On Tue, Jan 11, 2011 at 12:22 PM, Brent Halsey <mr...@gmail.com> wrote:

> Which jdk are you using?  We've had similar problems with jdk1.6u22 on
> Ubuntu 10.04 in Amazon EC2.  Nodes would lock up for 20-40+ minutes.
>
> We haven't done any conclusive tests yet, but we haven't seen the same
> problems after down rev'ing to jdk1.6u16.
>
>  -brent
>
> On Mon, Jan 10, 2011 at 12:59 PM, Wayne <wa...@gmail.com> wrote:
> > We had a node last night go awol and got stuck in permanent 50% CPU wait
> > time. The node also steadily shot up the load to 400 before we saw it and
> > had to hard reboot. Besides that all other ganglia metrics flat-lined. Is
> > this some sort of bizarre kernal problem? We are using xfs with std
> > settings. I have seen a few postings talk about bizarre problems like
> this.
> > Can XFS be blamed or is it more kernal related? Is there a posting
> somewhere
> > suggesting the best file system settings? Are there recommended settings
> for
> > using CentOS 5.5? We have a 10 nodes cluster we have been pounding for
> weeks
> > and we can't seem to keep all ten nodes up for a 24 hour period. I am
> hoping
> > there is a lower level problem causing much of it.
> >
> > Thanks.
> >
>

Re: CPU Wait Problems

Posted by Brent Halsey <mr...@gmail.com>.
Which jdk are you using?  We've had similar problems with jdk1.6u22 on
Ubuntu 10.04 in Amazon EC2.  Nodes would lock up for 20-40+ minutes.

We haven't done any conclusive tests yet, but we haven't seen the same
problems after down rev'ing to jdk1.6u16.

 -brent

On Mon, Jan 10, 2011 at 12:59 PM, Wayne <wa...@gmail.com> wrote:
> We had a node last night go awol and got stuck in permanent 50% CPU wait
> time. The node also steadily shot up the load to 400 before we saw it and
> had to hard reboot. Besides that all other ganglia metrics flat-lined. Is
> this some sort of bizarre kernal problem? We are using xfs with std
> settings. I have seen a few postings talk about bizarre problems like this.
> Can XFS be blamed or is it more kernal related? Is there a posting somewhere
> suggesting the best file system settings? Are there recommended settings for
> using CentOS 5.5? We have a 10 nodes cluster we have been pounding for weeks
> and we can't seem to keep all ten nodes up for a 24 hour period. I am hoping
> there is a lower level problem causing much of it.
>
> Thanks.
>