You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jack Levin <ma...@gmail.com> on 2011/05/02 02:47:18 UTC

one of our datanodes stops working after few hours

I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours, its
literally staggers to a halt and gets very very slow... Any ideas
whats its blocking on?
(main issue is that fsreads for RS get really slow when that happens).

-Jack

Re: one of our datanodes stops working after few hours

Posted by Jack Levin <ma...@gmail.com>.
I will try, thanks.  I have not ran NFS since 1998 :).

-Jack

On Mon, May 2, 2011 at 10:10 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi Jack,
>
> Try turning off your clienttrace logs in the DN log4j.properties, perhaps?
>
> By any chance do you log to NFS?
>
> Your blocked threads all seem to be waiting on appends to log4j.
>
> -Todd
>
> On Mon, May 2, 2011 at 7:29 PM, Jack Levin <ma...@gmail.com> wrote:
>
>> As requested:
>>
>> http://pastebin.com/aySaTADp
>>
>> Note, blocked threads.
>>
>> -Jack
>>
>> On Mon, May 2, 2011 at 2:39 PM, Jean-Daniel Cryans <jd...@apache.org>
>> wrote:
>> > I think Todd was asking to have a jstack without yourkit, so it
>> > shouldn't be an issue for you :)
>> >
>> > J-D
>> >
>> > On Mon, May 2, 2011 at 1:56 PM, Jack Levin <ma...@gmail.com> wrote:
>> >> my yourkit version expired :)... but here is the jstack when it
>> >> happens: http://pastebin.com/5v6mHg3t
>> >>
>> >> On Mon, May 2, 2011 at 1:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> >>> On Mon, May 2, 2011 at 12:56 PM, Jack Levin <ma...@gmail.com> wrote:
>> >>>
>> >>>> Tried removing yourkit and run on javasun, same thing.  We have some
>> >>>> threads blocked, does anyone know what they block on?
>> >>>>
>> >>>
>> >>> Which threads are blocked? Can you get some jstacks without yourkit?
>> >>>
>> >>> -Todd
>> >>>
>> >>>
>> >>>>
>> >>>> -Jack
>> >>>>
>> >>>> On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com>
>> wrote:
>> >>>> > Hi Jack,
>> >>>> >
>> >>>> > Does this happen even if you aren't running Yourkit on the DN?
>> >>>> >
>> >>>> > Can you try using a Sun JDK instead of OpenJDK?
>> >>>> >
>> >>>> > -Todd
>> >>>> >
>> >>>> > On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com>
>> wrote:
>> >>>> >
>> >>>> >> Version:         0.20.2+320 hdfs
>> >>>> >> .89 HBASE
>> >>>> >>
>> >>>> >> ulimit is 32k
>> >>>> >> xcievers is 5k
>> >>>> >>
>> >>>> >> Note from the jstack, I am not exceeding xcievers.
>> >>>> >>
>> >>>> >> -Jack
>> >>>> >>
>> >>>> >> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <
>> >>>> michael_segel@hotmail.com>
>> >>>> >> wrote:
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > What's your xceivers set to?
>> >>>> >> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say
>> >>>> which
>> >>>> >> release/version you were using.)
>> >>>> >> >
>> >>>> >> >> Date: Sun, 1 May 2011 17:47:18 -0700
>> >>>> >> >> Subject: one of our datanodes stops working after few hours
>> >>>> >> >> From: magnito@gmail.com
>> >>>> >> >> To: user@hbase.apache.org
>> >>>> >> >>
>> >>>> >> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few
>> hours,
>> >>>> its
>> >>>> >> >> literally staggers to a halt and gets very very slow... Any
>> ideas
>> >>>> >> >> whats its blocking on?
>> >>>> >> >> (main issue is that fsreads for RS get really slow when that
>> >>>> happens).
>> >>>> >> >>
>> >>>> >> >> -Jack
>> >>>> >> >
>> >>>> >>
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Todd Lipcon
>> >>>> > Software Engineer, Cloudera
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Todd Lipcon
>> >>> Software Engineer, Cloudera
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: one of our datanodes stops working after few hours

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Jack,

Try turning off your clienttrace logs in the DN log4j.properties, perhaps?

By any chance do you log to NFS?

Your blocked threads all seem to be waiting on appends to log4j.

-Todd

On Mon, May 2, 2011 at 7:29 PM, Jack Levin <ma...@gmail.com> wrote:

> As requested:
>
> http://pastebin.com/aySaTADp
>
> Note, blocked threads.
>
> -Jack
>
> On Mon, May 2, 2011 at 2:39 PM, Jean-Daniel Cryans <jd...@apache.org>
> wrote:
> > I think Todd was asking to have a jstack without yourkit, so it
> > shouldn't be an issue for you :)
> >
> > J-D
> >
> > On Mon, May 2, 2011 at 1:56 PM, Jack Levin <ma...@gmail.com> wrote:
> >> my yourkit version expired :)... but here is the jstack when it
> >> happens: http://pastebin.com/5v6mHg3t
> >>
> >> On Mon, May 2, 2011 at 1:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
> >>> On Mon, May 2, 2011 at 12:56 PM, Jack Levin <ma...@gmail.com> wrote:
> >>>
> >>>> Tried removing yourkit and run on javasun, same thing.  We have some
> >>>> threads blocked, does anyone know what they block on?
> >>>>
> >>>
> >>> Which threads are blocked? Can you get some jstacks without yourkit?
> >>>
> >>> -Todd
> >>>
> >>>
> >>>>
> >>>> -Jack
> >>>>
> >>>> On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com>
> wrote:
> >>>> > Hi Jack,
> >>>> >
> >>>> > Does this happen even if you aren't running Yourkit on the DN?
> >>>> >
> >>>> > Can you try using a Sun JDK instead of OpenJDK?
> >>>> >
> >>>> > -Todd
> >>>> >
> >>>> > On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com>
> wrote:
> >>>> >
> >>>> >> Version:         0.20.2+320 hdfs
> >>>> >> .89 HBASE
> >>>> >>
> >>>> >> ulimit is 32k
> >>>> >> xcievers is 5k
> >>>> >>
> >>>> >> Note from the jstack, I am not exceeding xcievers.
> >>>> >>
> >>>> >> -Jack
> >>>> >>
> >>>> >> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <
> >>>> michael_segel@hotmail.com>
> >>>> >> wrote:
> >>>> >> >
> >>>> >> >
> >>>> >> > What's your xceivers set to?
> >>>> >> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say
> >>>> which
> >>>> >> release/version you were using.)
> >>>> >> >
> >>>> >> >> Date: Sun, 1 May 2011 17:47:18 -0700
> >>>> >> >> Subject: one of our datanodes stops working after few hours
> >>>> >> >> From: magnito@gmail.com
> >>>> >> >> To: user@hbase.apache.org
> >>>> >> >>
> >>>> >> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few
> hours,
> >>>> its
> >>>> >> >> literally staggers to a halt and gets very very slow... Any
> ideas
> >>>> >> >> whats its blocking on?
> >>>> >> >> (main issue is that fsreads for RS get really slow when that
> >>>> happens).
> >>>> >> >>
> >>>> >> >> -Jack
> >>>> >> >
> >>>> >>
> >>>> >
> >>>> >
> >>>> >
> >>>> > --
> >>>> > Todd Lipcon
> >>>> > Software Engineer, Cloudera
> >>>> >
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >>
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: one of our datanodes stops working after few hours

Posted by Jack Levin <ma...@gmail.com>.
As requested:

http://pastebin.com/aySaTADp

Note, blocked threads.

-Jack

On Mon, May 2, 2011 at 2:39 PM, Jean-Daniel Cryans <jd...@apache.org> wrote:
> I think Todd was asking to have a jstack without yourkit, so it
> shouldn't be an issue for you :)
>
> J-D
>
> On Mon, May 2, 2011 at 1:56 PM, Jack Levin <ma...@gmail.com> wrote:
>> my yourkit version expired :)... but here is the jstack when it
>> happens: http://pastebin.com/5v6mHg3t
>>
>> On Mon, May 2, 2011 at 1:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>> On Mon, May 2, 2011 at 12:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>>
>>>> Tried removing yourkit and run on javasun, same thing.  We have some
>>>> threads blocked, does anyone know what they block on?
>>>>
>>>
>>> Which threads are blocked? Can you get some jstacks without yourkit?
>>>
>>> -Todd
>>>
>>>
>>>>
>>>> -Jack
>>>>
>>>> On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>>> > Hi Jack,
>>>> >
>>>> > Does this happen even if you aren't running Yourkit on the DN?
>>>> >
>>>> > Can you try using a Sun JDK instead of OpenJDK?
>>>> >
>>>> > -Todd
>>>> >
>>>> > On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> >
>>>> >> Version:         0.20.2+320 hdfs
>>>> >> .89 HBASE
>>>> >>
>>>> >> ulimit is 32k
>>>> >> xcievers is 5k
>>>> >>
>>>> >> Note from the jstack, I am not exceeding xcievers.
>>>> >>
>>>> >> -Jack
>>>> >>
>>>> >> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <
>>>> michael_segel@hotmail.com>
>>>> >> wrote:
>>>> >> >
>>>> >> >
>>>> >> > What's your xceivers set to?
>>>> >> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say
>>>> which
>>>> >> release/version you were using.)
>>>> >> >
>>>> >> >> Date: Sun, 1 May 2011 17:47:18 -0700
>>>> >> >> Subject: one of our datanodes stops working after few hours
>>>> >> >> From: magnito@gmail.com
>>>> >> >> To: user@hbase.apache.org
>>>> >> >>
>>>> >> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours,
>>>> its
>>>> >> >> literally staggers to a halt and gets very very slow... Any ideas
>>>> >> >> whats its blocking on?
>>>> >> >> (main issue is that fsreads for RS get really slow when that
>>>> happens).
>>>> >> >>
>>>> >> >> -Jack
>>>> >> >
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Todd Lipcon
>>>> > Software Engineer, Cloudera
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>

Re: one of our datanodes stops working after few hours

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I think Todd was asking to have a jstack without yourkit, so it
shouldn't be an issue for you :)

J-D

On Mon, May 2, 2011 at 1:56 PM, Jack Levin <ma...@gmail.com> wrote:
> my yourkit version expired :)... but here is the jstack when it
> happens: http://pastebin.com/5v6mHg3t
>
> On Mon, May 2, 2011 at 1:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> On Mon, May 2, 2011 at 12:56 PM, Jack Levin <ma...@gmail.com> wrote:
>>
>>> Tried removing yourkit and run on javasun, same thing.  We have some
>>> threads blocked, does anyone know what they block on?
>>>
>>
>> Which threads are blocked? Can you get some jstacks without yourkit?
>>
>> -Todd
>>
>>
>>>
>>> -Jack
>>>
>>> On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>> > Hi Jack,
>>> >
>>> > Does this happen even if you aren't running Yourkit on the DN?
>>> >
>>> > Can you try using a Sun JDK instead of OpenJDK?
>>> >
>>> > -Todd
>>> >
>>> > On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com> wrote:
>>> >
>>> >> Version:         0.20.2+320 hdfs
>>> >> .89 HBASE
>>> >>
>>> >> ulimit is 32k
>>> >> xcievers is 5k
>>> >>
>>> >> Note from the jstack, I am not exceeding xcievers.
>>> >>
>>> >> -Jack
>>> >>
>>> >> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <
>>> michael_segel@hotmail.com>
>>> >> wrote:
>>> >> >
>>> >> >
>>> >> > What's your xceivers set to?
>>> >> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say
>>> which
>>> >> release/version you were using.)
>>> >> >
>>> >> >> Date: Sun, 1 May 2011 17:47:18 -0700
>>> >> >> Subject: one of our datanodes stops working after few hours
>>> >> >> From: magnito@gmail.com
>>> >> >> To: user@hbase.apache.org
>>> >> >>
>>> >> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours,
>>> its
>>> >> >> literally staggers to a halt and gets very very slow... Any ideas
>>> >> >> whats its blocking on?
>>> >> >> (main issue is that fsreads for RS get really slow when that
>>> happens).
>>> >> >>
>>> >> >> -Jack
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Todd Lipcon
>>> > Software Engineer, Cloudera
>>> >
>>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>

Re: one of our datanodes stops working after few hours

Posted by Jack Levin <ma...@gmail.com>.
my yourkit version expired :)... but here is the jstack when it
happens: http://pastebin.com/5v6mHg3t

On Mon, May 2, 2011 at 1:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
> On Mon, May 2, 2011 at 12:56 PM, Jack Levin <ma...@gmail.com> wrote:
>
>> Tried removing yourkit and run on javasun, same thing.  We have some
>> threads blocked, does anyone know what they block on?
>>
>
> Which threads are blocked? Can you get some jstacks without yourkit?
>
> -Todd
>
>
>>
>> -Jack
>>
>> On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
>> > Hi Jack,
>> >
>> > Does this happen even if you aren't running Yourkit on the DN?
>> >
>> > Can you try using a Sun JDK instead of OpenJDK?
>> >
>> > -Todd
>> >
>> > On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com> wrote:
>> >
>> >> Version:         0.20.2+320 hdfs
>> >> .89 HBASE
>> >>
>> >> ulimit is 32k
>> >> xcievers is 5k
>> >>
>> >> Note from the jstack, I am not exceeding xcievers.
>> >>
>> >> -Jack
>> >>
>> >> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <
>> michael_segel@hotmail.com>
>> >> wrote:
>> >> >
>> >> >
>> >> > What's your xceivers set to?
>> >> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say
>> which
>> >> release/version you were using.)
>> >> >
>> >> >> Date: Sun, 1 May 2011 17:47:18 -0700
>> >> >> Subject: one of our datanodes stops working after few hours
>> >> >> From: magnito@gmail.com
>> >> >> To: user@hbase.apache.org
>> >> >>
>> >> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours,
>> its
>> >> >> literally staggers to a halt and gets very very slow... Any ideas
>> >> >> whats its blocking on?
>> >> >> (main issue is that fsreads for RS get really slow when that
>> happens).
>> >> >>
>> >> >> -Jack
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: one of our datanodes stops working after few hours

Posted by Todd Lipcon <to...@cloudera.com>.
On Mon, May 2, 2011 at 12:56 PM, Jack Levin <ma...@gmail.com> wrote:

> Tried removing yourkit and run on javasun, same thing.  We have some
> threads blocked, does anyone know what they block on?
>

Which threads are blocked? Can you get some jstacks without yourkit?

-Todd


>
> -Jack
>
> On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
> > Hi Jack,
> >
> > Does this happen even if you aren't running Yourkit on the DN?
> >
> > Can you try using a Sun JDK instead of OpenJDK?
> >
> > -Todd
> >
> > On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com> wrote:
> >
> >> Version:         0.20.2+320 hdfs
> >> .89 HBASE
> >>
> >> ulimit is 32k
> >> xcievers is 5k
> >>
> >> Note from the jstack, I am not exceeding xcievers.
> >>
> >> -Jack
> >>
> >> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <
> michael_segel@hotmail.com>
> >> wrote:
> >> >
> >> >
> >> > What's your xceivers set to?
> >> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say
> which
> >> release/version you were using.)
> >> >
> >> >> Date: Sun, 1 May 2011 17:47:18 -0700
> >> >> Subject: one of our datanodes stops working after few hours
> >> >> From: magnito@gmail.com
> >> >> To: user@hbase.apache.org
> >> >>
> >> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours,
> its
> >> >> literally staggers to a halt and gets very very slow... Any ideas
> >> >> whats its blocking on?
> >> >> (main issue is that fsreads for RS get really slow when that
> happens).
> >> >>
> >> >> -Jack
> >> >
> >>
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: one of our datanodes stops working after few hours

Posted by Jack Levin <ma...@gmail.com>.
Tried removing yourkit and run on javasun, same thing.  We have some
threads blocked, does anyone know what they block on?

-Jack

On Mon, May 2, 2011 at 7:53 AM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi Jack,
>
> Does this happen even if you aren't running Yourkit on the DN?
>
> Can you try using a Sun JDK instead of OpenJDK?
>
> -Todd
>
> On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com> wrote:
>
>> Version:         0.20.2+320 hdfs
>> .89 HBASE
>>
>> ulimit is 32k
>> xcievers is 5k
>>
>> Note from the jstack, I am not exceeding xcievers.
>>
>> -Jack
>>
>> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <mi...@hotmail.com>
>> wrote:
>> >
>> >
>> > What's your xceivers set to?
>> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say which
>> release/version you were using.)
>> >
>> >> Date: Sun, 1 May 2011 17:47:18 -0700
>> >> Subject: one of our datanodes stops working after few hours
>> >> From: magnito@gmail.com
>> >> To: user@hbase.apache.org
>> >>
>> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours, its
>> >> literally staggers to a halt and gets very very slow... Any ideas
>> >> whats its blocking on?
>> >> (main issue is that fsreads for RS get really slow when that happens).
>> >>
>> >> -Jack
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: one of our datanodes stops working after few hours

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Jack,

Does this happen even if you aren't running Yourkit on the DN?

Can you try using a Sun JDK instead of OpenJDK?

-Todd

On Sun, May 1, 2011 at 7:34 PM, Jack Levin <ma...@gmail.com> wrote:

> Version:         0.20.2+320 hdfs
> .89 HBASE
>
> ulimit is 32k
> xcievers is 5k
>
> Note from the jstack, I am not exceeding xcievers.
>
> -Jack
>
> On Sun, May 1, 2011 at 6:19 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> >
> >
> > What's your xceivers set to?
> > What's the ulimit -n  set for hdfs/hadoop user... (You didn't say which
> release/version you were using.)
> >
> >> Date: Sun, 1 May 2011 17:47:18 -0700
> >> Subject: one of our datanodes stops working after few hours
> >> From: magnito@gmail.com
> >> To: user@hbase.apache.org
> >>
> >> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours, its
> >> literally staggers to a halt and gets very very slow... Any ideas
> >> whats its blocking on?
> >> (main issue is that fsreads for RS get really slow when that happens).
> >>
> >> -Jack
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: one of our datanodes stops working after few hours

Posted by Jack Levin <ma...@gmail.com>.
Version: 	 0.20.2+320 hdfs
.89 HBASE

ulimit is 32k
xcievers is 5k

Note from the jstack, I am not exceeding xcievers.

-Jack

On Sun, May 1, 2011 at 6:19 PM, Michael Segel <mi...@hotmail.com> wrote:
>
>
> What's your xceivers set to?
> What's the ulimit -n  set for hdfs/hadoop user... (You didn't say which release/version you were using.)
>
>> Date: Sun, 1 May 2011 17:47:18 -0700
>> Subject: one of our datanodes stops working after few hours
>> From: magnito@gmail.com
>> To: user@hbase.apache.org
>>
>> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours, its
>> literally staggers to a halt and gets very very slow... Any ideas
>> whats its blocking on?
>> (main issue is that fsreads for RS get really slow when that happens).
>>
>> -Jack
>

RE: one of our datanodes stops working after few hours

Posted by Michael Segel <mi...@hotmail.com>.

What's your xceivers set to? 
What's the ulimit -n  set for hdfs/hadoop user... (You didn't say which release/version you were using.)

> Date: Sun, 1 May 2011 17:47:18 -0700
> Subject: one of our datanodes stops working after few hours
> From: magnito@gmail.com
> To: user@hbase.apache.org
> 
> I took a jstack (http://pastebin.com/5v6mHg3t).   After few hours, its
> literally staggers to a halt and gets very very slow... Any ideas
> whats its blocking on?
> (main issue is that fsreads for RS get really slow when that happens).
> 
> -Jack