You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Garhan Attebury <at...@cse.unl.edu> on 2009/03/09 16:17:02 UTC

DataNode gets 'stuck', ends up with two DataNode processes

On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/ 
task trackers) I've noticed many datanodes get 'stuck'. The nodes  
themselves seem fine with no network/memory problems, but in every  
instance I see two DataNode processes running, and the NameNode logs  
indicate the datanode in question simply stopped responding. This  
state persists until I come along and kill the DataNode processes and  
restart the DataNode on that particular machine.

I'm at a loss as to why this is happening, so here's all the relevant  
information I can think of sharing:

hadoop version = 0.19.1-dev, r (we possibly have some custom patches  
running, but nothing which would affect HDFS that I'm aware of)
number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory  
respectively

I managed to grab a stack dump via "kill -3" from two of these problem  
instances and threw up the logs at http://cse.unl.edu/~attebury/datanode_problem/ 
. The .log files honestly show nothing out of the ordinary, and having  
very little Java developing experience the .out files mean nothing to  
me. It's also worth mentioning that the NameNode logs at the time when  
these DataNodes got stuck show nothing out of the ordinary either --  
just the expected "lost heartbeat from node <x>" message. The DataNode  
daemon (the original process, not the second mysterious one) continues  
to respond to web requests like browsing the log directory during this  
time.

Whenever this happens I've just manually done a "kill -9" to remove  
the two stuck DataNode processes (I'm not even sure why there's two of  
them, as under normal operation there's only one). After killing the  
stuck ones, I simply do a "hadoop-daemon.sh start datanode" and all is  
normal again. I've not seen any dataloss or corruption as a result of  
this problem.

Has anyone seen anything like this happen before? Out of our ~100 node  
cluster I see this problem around once a day, and it seems to just  
strike random nodes at random times. It happens often enough that I  
would be happy to do additional debugging if anyone can tell me how.  
I'm not a developer at all, so I'm at the end of my knowledge on how  
to solve this problem. Thanks for any help!


===============================
Garhan Attebury
Systems Administrator
UNL Research Computing Facility
402-472-7761
===============================

Re: DataNode gets 'stuck', ends up with two DataNode processes

Posted by jason hadoop <ja...@gmail.com>.

I think the bug probably still exists and the fix only covers one trigger.

On Mon, Mar 9, 2009 at 12:45 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Jason,
>
> Looks like you're on the right track.  We are getting stuck after the JVM
> forks but before it exec's.  It appears to be this bug:
>
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051
>
> However, this is using Java 1.6.0_11, and that bug was marked as fixed in
> 1.6.0_6 :(
>
> Any other ideas?
>
> Brian
>
>
> On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:
>
>  There were a couple of fork timing errors in the jdk 1.5 that occasionally
>> caused a sub process fork to go bad, this could by the du/df being forked
>> off by the datanode and dying.
>>
>> I can't find the references I had saved away at one point, from the java
>> forums, but perhaps this will get you started.
>>
>> http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0
>>
>> On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman <bbockelm@cse.unl.edu
>> >wrote:
>>
>>  It's very strange.  It appears that the second process is the result of a
>>> fork call, yet has only one thread running whose gdb backtrace looks like
>>> this:
>>>
>>> (gdb) bt
>>> #0  0x0000003e10c0af8b in __lll_mutex_lock_wait () from
>>> /lib64/tls/libpthread.so.0
>>> #1  0x0000000000000000 in ?? ()
>>>
>>> Not very helpful!  I'd normally suspect some strange memory issue, but
>>> I've
>>> checked - there was plenty of memory available on the host when the
>>> second
>>> process was spawned and we weren't close to the file descriptor limit.
>>>
>>>
>>> Looking at this issue,
>>> https://issues.apache.org/jira/browse/HADOOP-2231
>>>
>>> it seems that the "df" call is avoidable now that we're in Java 1.6.
>>> However, the issue was closed and marked as a duplicate, but without
>>> noting
>>> what it was a duplicate of (grrr).  Is there an updated version of that
>>> patch?
>>>
>>> Brian
>>>
>>>
>>> On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:
>>>
>>> Philip Zeyliger wrote:
>>>
>>>>
>>>>  Very naively looking at the stack traces, a common theme is that
>>>>> there's
>>>>> a
>>>>> call out to "df" to find the system capacity.  If you see two data node
>>>>> processes, perhaps the fork/exec to call out to "df" is failing in some
>>>>> strange way.
>>>>>
>>>>>
>>>> that's deep into Java code. OpenJDK gives you more of that source. One
>>>> option here is to consider some kind of timeouts in the exec, but it's
>>>> pretty tricky to tack that on round the Java runtime APIs, because the
>>>> process APIs weren't really designed to be interrupted by other threads.
>>>>
>>>> -steve
>>>>
>>>> "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>>>>
>>>>> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
>>>>> [0x0000000042c54000..0x0000000042c54b30]
>>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>>>      at java.lang.Object.wait(Native Method)
>>>>>      at java.lang.Object.wait(Object.java:485)
>>>>>      at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
>>>>>      - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>>>>>      at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
>>>>>      at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>>>>      at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>>>>      at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>>>>>      at org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>>      at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
>>>>>      - locked <0x0000002a9ed97078> (a
>>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
>>>>>      at
>>>>>
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
>>>>>      at
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
>>>>>      at java.lang.Thread.run(Thread.java:619)
>>>>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>>  On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
>>>>>> job/task
>>>>>> trackers) I've noticed many datanodes get 'stuck'. The nodes
>>>>>> themselves
>>>>>> seem
>>>>>> fine with no network/memory problems, but in every instance I see two
>>>>>> DataNode processes running, and the NameNode logs indicate the
>>>>>> datanode
>>>>>> in
>>>>>> question simply stopped responding. This state persists until I come
>>>>>> along
>>>>>> and kill the DataNode processes and restart the DataNode on that
>>>>>> particular
>>>>>> machine.
>>>>>>
>>>>>> I'm at a loss as to why this is happening, so here's all the relevant
>>>>>> information I can think of sharing:
>>>>>>
>>>>>> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
>>>>>> running, but nothing which would affect HDFS that I'm aware of)
>>>>>> number of nodes = ~100
>>>>>> HDFS size = ~230TB
>>>>>> Java version =
>>>>>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>>>>>> respectively
>>>>>>
>>>>>> I managed to grab a stack dump via "kill -3" from two of these problem
>>>>>> instances and threw up the logs at
>>>>>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>
>>>>>> <http://cse.unl.edu/%7Eattebury/datanode_problem/>
>>>>>> <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
>>>>>> The .log files honestly show nothing out of the ordinary, and having
>>>>>> very
>>>>>> little Java developing experience the .out files mean nothing to me.
>>>>>> It's
>>>>>> also worth mentioning that the NameNode logs at the time when these
>>>>>> DataNodes got stuck show nothing out of the ordinary either -- just
>>>>>> the
>>>>>> expected "lost heartbeat from node <x>" message. The DataNode daemon
>>>>>> (the
>>>>>> original process, not the second mysterious one) continues to respond
>>>>>> to
>>>>>> web
>>>>>> requests like browsing the log directory during this time.
>>>>>>
>>>>>> Whenever this happens I've just manually done a "kill -9" to remove
>>>>>> the
>>>>>> two
>>>>>> stuck DataNode processes (I'm not even sure why there's two of them,
>>>>>> as
>>>>>> under normal operation there's only one). After killing the stuck
>>>>>> ones,
>>>>>> I
>>>>>> simply do a "hadoop-daemon.sh start datanode" and all is normal again.
>>>>>> I've
>>>>>> not seen any dataloss or corruption as a result of this problem.
>>>>>>
>>>>>> Has anyone seen anything like this happen before? Out of our ~100 node
>>>>>> cluster I see this problem around once a day, and it seems to just
>>>>>> strike
>>>>>> random nodes at random times. It happens often enough that I would be
>>>>>> happy
>>>>>> to do additional debugging if anyone can tell me how. I'm not a
>>>>>> developer at
>>>>>> all, so I'm at the end of my knowledge on how to solve this problem.
>>>>>> Thanks
>>>>>> for any help!
>>>>>>
>>>>>>
>>>>>> ===============================
>>>>>> Garhan Attebury
>>>>>> Systems Administrator
>>>>>> UNL Research Computing Facility
>>>>>> 402-472-7761
>>>>>> ===============================
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> Steve Loughran                  http://www.1060.org/blogxter/publish/5
>>>> Author: Ant in Action           http://antbook.org/
>>>>
>>>>
>>>
>>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: DataNode gets 'stuck', ends up with two DataNode processes

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Jason,

Looks like you're on the right track.  We are getting stuck after the  
JVM forks but before it exec's.  It appears to be this bug:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051

However, this is using Java 1.6.0_11, and that bug was marked as fixed  
in 1.6.0_6 :(

Any other ideas?

Brian

On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:

> There were a couple of fork timing errors in the jdk 1.5 that  
> occasionally
> caused a sub process fork to go bad, this could by the du/df being  
> forked
> off by the datanode and dying.
>
> I can't find the references I had saved away at one point, from the  
> java
> forums, but perhaps this will get you started.
>
> http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0
>
> On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman  
> <bb...@cse.unl.edu>wrote:
>
>> It's very strange.  It appears that the second process is the  
>> result of a
>> fork call, yet has only one thread running whose gdb backtrace  
>> looks like
>> this:
>>
>> (gdb) bt
>> #0  0x0000003e10c0af8b in __lll_mutex_lock_wait () from
>> /lib64/tls/libpthread.so.0
>> #1  0x0000000000000000 in ?? ()
>>
>> Not very helpful!  I'd normally suspect some strange memory issue,  
>> but I've
>> checked - there was plenty of memory available on the host when the  
>> second
>> process was spawned and we weren't close to the file descriptor  
>> limit.
>>
>>
>> Looking at this issue,
>> https://issues.apache.org/jira/browse/HADOOP-2231
>>
>> it seems that the "df" call is avoidable now that we're in Java 1.6.
>> However, the issue was closed and marked as a duplicate, but  
>> without noting
>> what it was a duplicate of (grrr).  Is there an updated version of  
>> that
>> patch?
>>
>> Brian
>>
>>
>> On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:
>>
>> Philip Zeyliger wrote:
>>>
>>>> Very naively looking at the stack traces, a common theme is that  
>>>> there's
>>>> a
>>>> call out to "df" to find the system capacity.  If you see two  
>>>> data node
>>>> processes, perhaps the fork/exec to call out to "df" is failing  
>>>> in some
>>>> strange way.
>>>>
>>>
>>> that's deep into Java code. OpenJDK gives you more of that source.  
>>> One
>>> option here is to consider some kind of timeouts in the exec, but  
>>> it's
>>> pretty tricky to tack that on round the Java runtime APIs, because  
>>> the
>>> process APIs weren't really designed to be interrupted by other  
>>> threads.
>>>
>>> -steve
>>>
>>> "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>>>> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
>>>> [0x0000000042c54000..0x0000000042c54b30]
>>>> java.lang.Thread.State: WAITING (on object monitor)
>>>>       at java.lang.Object.wait(Native Method)
>>>>       at java.lang.Object.wait(Object.java:485)
>>>>       at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java: 
>>>> 64)
>>>>       - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>>>>       at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
>>>>       at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>>>       at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>>>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>>>>       at org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>>       at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>>>>       at
>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset 
>>>> $FSVolume.getCapacity(FSDataset.java:341)
>>>>       at
>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset 
>>>> $FSVolumeSet.getCapacity(FSDataset.java:501)
>>>>       - locked <0x0000002a9ed97078> (a
>>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java: 
>>>> 697)
>>>>       at
>>>> org 
>>>> .apache 
>>>> .hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java: 
>>>> 671)
>>>>       at
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java: 
>>>> 1105)
>>>>       at java.lang.Thread.run(Thread.java:619)
>>>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
>>>>> wrote:
>>>>
>>>>> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
>>>>> job/task
>>>>> trackers) I've noticed many datanodes get 'stuck'. The nodes  
>>>>> themselves
>>>>> seem
>>>>> fine with no network/memory problems, but in every instance I  
>>>>> see two
>>>>> DataNode processes running, and the NameNode logs indicate the  
>>>>> datanode
>>>>> in
>>>>> question simply stopped responding. This state persists until I  
>>>>> come
>>>>> along
>>>>> and kill the DataNode processes and restart the DataNode on that
>>>>> particular
>>>>> machine.
>>>>>
>>>>> I'm at a loss as to why this is happening, so here's all the  
>>>>> relevant
>>>>> information I can think of sharing:
>>>>>
>>>>> hadoop version = 0.19.1-dev, r (we possibly have some custom  
>>>>> patches
>>>>> running, but nothing which would affect HDFS that I'm aware of)
>>>>> number of nodes = ~100
>>>>> HDFS size = ~230TB
>>>>> Java version =
>>>>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>>>>> respectively
>>>>>
>>>>> I managed to grab a stack dump via "kill -3" from two of these  
>>>>> problem
>>>>> instances and threw up the logs at
>>>>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/ 
>>>>> >
>>>>> <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
>>>>> The .log files honestly show nothing out of the ordinary, and  
>>>>> having
>>>>> very
>>>>> little Java developing experience the .out files mean nothing to  
>>>>> me.
>>>>> It's
>>>>> also worth mentioning that the NameNode logs at the time when  
>>>>> these
>>>>> DataNodes got stuck show nothing out of the ordinary either --  
>>>>> just the
>>>>> expected "lost heartbeat from node <x>" message. The DataNode  
>>>>> daemon
>>>>> (the
>>>>> original process, not the second mysterious one) continues to  
>>>>> respond to
>>>>> web
>>>>> requests like browsing the log directory during this time.
>>>>>
>>>>> Whenever this happens I've just manually done a "kill -9" to  
>>>>> remove the
>>>>> two
>>>>> stuck DataNode processes (I'm not even sure why there's two of  
>>>>> them, as
>>>>> under normal operation there's only one). After killing the  
>>>>> stuck ones,
>>>>> I
>>>>> simply do a "hadoop-daemon.sh start datanode" and all is normal  
>>>>> again.
>>>>> I've
>>>>> not seen any dataloss or corruption as a result of this problem.
>>>>>
>>>>> Has anyone seen anything like this happen before? Out of our  
>>>>> ~100 node
>>>>> cluster I see this problem around once a day, and it seems to just
>>>>> strike
>>>>> random nodes at random times. It happens often enough that I  
>>>>> would be
>>>>> happy
>>>>> to do additional debugging if anyone can tell me how. I'm not a
>>>>> developer at
>>>>> all, so I'm at the end of my knowledge on how to solve this  
>>>>> problem.
>>>>> Thanks
>>>>> for any help!
>>>>>
>>>>>
>>>>> ===============================
>>>>> Garhan Attebury
>>>>> Systems Administrator
>>>>> UNL Research Computing Facility
>>>>> 402-472-7761
>>>>> ===============================
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Steve Loughran                  http://www.1060.org/blogxter/publish/5
>>> Author: Ant in Action           http://antbook.org/
>>>
>>
>>
>
>
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422

Re: DataNode gets 'stuck', ends up with two DataNode processes

Posted by jason hadoop <ja...@gmail.com>.

There were a couple of fork timing errors in the jdk 1.5 that occasionally
caused a sub process fork to go bad, this could by the du/df being forked
off by the datanode and dying.

I can't find the references I had saved away at one point, from the java
forums, but perhaps this will get you started.

http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0

On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> It's very strange.  It appears that the second process is the result of a
> fork call, yet has only one thread running whose gdb backtrace looks like
> this:
>
> (gdb) bt
> #0  0x0000003e10c0af8b in __lll_mutex_lock_wait () from
> /lib64/tls/libpthread.so.0
> #1  0x0000000000000000 in ?? ()
>
> Not very helpful!  I'd normally suspect some strange memory issue, but I've
> checked - there was plenty of memory available on the host when the second
> process was spawned and we weren't close to the file descriptor limit.
>
>
> Looking at this issue,
> https://issues.apache.org/jira/browse/HADOOP-2231
>
> it seems that the "df" call is avoidable now that we're in Java 1.6.
>  However, the issue was closed and marked as a duplicate, but without noting
> what it was a duplicate of (grrr).  Is there an updated version of that
> patch?
>
> Brian
>
>
> On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:
>
>  Philip Zeyliger wrote:
>>
>>> Very naively looking at the stack traces, a common theme is that there's
>>> a
>>> call out to "df" to find the system capacity.  If you see two data node
>>> processes, perhaps the fork/exec to call out to "df" is failing in some
>>> strange way.
>>>
>>
>> that's deep into Java code. OpenJDK gives you more of that source. One
>> option here is to consider some kind of timeouts in the exec, but it's
>> pretty tricky to tack that on round the Java runtime APIs, because the
>> process APIs weren't really designed to be interrupted by other threads.
>>
>> -steve
>>
>>  "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>>> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
>>> [0x0000000042c54000..0x0000000042c54b30]
>>>  java.lang.Thread.State: WAITING (on object monitor)
>>>        at java.lang.Object.wait(Native Method)
>>>        at java.lang.Object.wait(Object.java:485)
>>>        at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
>>>        - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>>>        at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
>>>        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>>>        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>>>        at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>>>        at org.apache.hadoop.util.Shell.run(Shell.java:134)
>>>        at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
>>>        - locked <0x0000002a9ed97078> (a
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
>>>        at java.lang.Thread.run(Thread.java:619)
>>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
>>> >wrote:
>>>
>>>> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
>>>> job/task
>>>> trackers) I've noticed many datanodes get 'stuck'. The nodes themselves
>>>> seem
>>>> fine with no network/memory problems, but in every instance I see two
>>>> DataNode processes running, and the NameNode logs indicate the datanode
>>>> in
>>>> question simply stopped responding. This state persists until I come
>>>> along
>>>> and kill the DataNode processes and restart the DataNode on that
>>>> particular
>>>> machine.
>>>>
>>>> I'm at a loss as to why this is happening, so here's all the relevant
>>>> information I can think of sharing:
>>>>
>>>> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
>>>> running, but nothing which would affect HDFS that I'm aware of)
>>>> number of nodes = ~100
>>>> HDFS size = ~230TB
>>>> Java version =
>>>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>>>> respectively
>>>>
>>>> I managed to grab a stack dump via "kill -3" from two of these problem
>>>> instances and threw up the logs at
>>>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>
>>>> <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
>>>> The .log files honestly show nothing out of the ordinary, and having
>>>> very
>>>> little Java developing experience the .out files mean nothing to me.
>>>> It's
>>>> also worth mentioning that the NameNode logs at the time when these
>>>> DataNodes got stuck show nothing out of the ordinary either -- just the
>>>> expected "lost heartbeat from node <x>" message. The DataNode daemon
>>>> (the
>>>> original process, not the second mysterious one) continues to respond to
>>>> web
>>>> requests like browsing the log directory during this time.
>>>>
>>>> Whenever this happens I've just manually done a "kill -9" to remove the
>>>> two
>>>> stuck DataNode processes (I'm not even sure why there's two of them, as
>>>> under normal operation there's only one). After killing the stuck ones,
>>>> I
>>>> simply do a "hadoop-daemon.sh start datanode" and all is normal again.
>>>> I've
>>>> not seen any dataloss or corruption as a result of this problem.
>>>>
>>>> Has anyone seen anything like this happen before? Out of our ~100 node
>>>> cluster I see this problem around once a day, and it seems to just
>>>> strike
>>>> random nodes at random times. It happens often enough that I would be
>>>> happy
>>>> to do additional debugging if anyone can tell me how. I'm not a
>>>> developer at
>>>> all, so I'm at the end of my knowledge on how to solve this problem.
>>>> Thanks
>>>> for any help!
>>>>
>>>>
>>>> ===============================
>>>> Garhan Attebury
>>>> Systems Administrator
>>>> UNL Research Computing Facility
>>>> 402-472-7761
>>>> ===============================
>>>>
>>>>
>>>>
>>
>> --
>> Steve Loughran                  http://www.1060.org/blogxter/publish/5
>> Author: Ant in Action           http://antbook.org/
>>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: DataNode gets 'stuck', ends up with two DataNode processes

Posted by Brian Bockelman <bb...@cse.unl.edu>.

It's very strange.  It appears that the second process is the result  
of a fork call, yet has only one thread running whose gdb backtrace  
looks like this:

(gdb) bt
#0  0x0000003e10c0af8b in __lll_mutex_lock_wait () from /lib64/tls/ 
libpthread.so.0
#1  0x0000000000000000 in ?? ()

Not very helpful!  I'd normally suspect some strange memory issue, but  
I've checked - there was plenty of memory available on the host when  
the second process was spawned and we weren't close to the file  
descriptor limit.


Looking at this issue,
https://issues.apache.org/jira/browse/HADOOP-2231

it seems that the "df" call is avoidable now that we're in Java 1.6.   
However, the issue was closed and marked as a duplicate, but without  
noting what it was a duplicate of (grrr).  Is there an updated version  
of that patch?

Brian

On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

> Philip Zeyliger wrote:
>> Very naively looking at the stack traces, a common theme is that  
>> there's a
>> call out to "df" to find the system capacity.  If you see two data  
>> node
>> processes, perhaps the fork/exec to call out to "df" is failing in  
>> some
>> strange way.
>
> that's deep into Java code. OpenJDK gives you more of that source.  
> One option here is to consider some kind of timeouts in the exec,  
> but it's pretty tricky to tack that on round the Java runtime APIs,  
> because the process APIs weren't really designed to be interrupted  
> by other threads.
>
> -steve
>
>> "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
>> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
>> [0x0000000042c54000..0x0000000042c54b30]
>>   java.lang.Thread.State: WAITING (on object monitor)
>> 	at java.lang.Object.wait(Native Method)
>> 	at java.lang.Object.wait(Object.java:485)
>> 	at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
>> 	- locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
>> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
>> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
>> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
>> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
>> 	at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
>> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset 
>> $FSVolume.getCapacity(FSDataset.java:341)
>> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset 
>> $FSVolumeSet.getCapacity(FSDataset.java:501)
>> 	- locked <0x0000002a9ed97078> (a
>> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
>> 	at  
>> org 
>> .apache 
>> .hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java: 
>> 697)
>> 	at  
>> org 
>> .apache 
>> .hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
>> 	at  
>> org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java: 
>> 1105)
>> 	at java.lang.Thread.run(Thread.java:619)
>> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury  
>> <at...@cse.unl.edu>wrote:
>>> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no  
>>> job/task
>>> trackers) I've noticed many datanodes get 'stuck'. The nodes  
>>> themselves seem
>>> fine with no network/memory problems, but in every instance I see  
>>> two
>>> DataNode processes running, and the NameNode logs indicate the  
>>> datanode in
>>> question simply stopped responding. This state persists until I  
>>> come along
>>> and kill the DataNode processes and restart the DataNode on that  
>>> particular
>>> machine.
>>>
>>> I'm at a loss as to why this is happening, so here's all the  
>>> relevant
>>> information I can think of sharing:
>>>
>>> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
>>> running, but nothing which would affect HDFS that I'm aware of)
>>> number of nodes = ~100
>>> HDFS size = ~230TB
>>> Java version =
>>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>>> respectively
>>>
>>> I managed to grab a stack dump via "kill -3" from two of these  
>>> problem
>>> instances and threw up the logs at
>>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/ 
>>> >.
>>> The .log files honestly show nothing out of the ordinary, and  
>>> having very
>>> little Java developing experience the .out files mean nothing to  
>>> me. It's
>>> also worth mentioning that the NameNode logs at the time when these
>>> DataNodes got stuck show nothing out of the ordinary either --  
>>> just the
>>> expected "lost heartbeat from node <x>" message. The DataNode  
>>> daemon (the
>>> original process, not the second mysterious one) continues to  
>>> respond to web
>>> requests like browsing the log directory during this time.
>>>
>>> Whenever this happens I've just manually done a "kill -9" to  
>>> remove the two
>>> stuck DataNode processes (I'm not even sure why there's two of  
>>> them, as
>>> under normal operation there's only one). After killing the stuck  
>>> ones, I
>>> simply do a "hadoop-daemon.sh start datanode" and all is normal  
>>> again. I've
>>> not seen any dataloss or corruption as a result of this problem.
>>>
>>> Has anyone seen anything like this happen before? Out of our ~100  
>>> node
>>> cluster I see this problem around once a day, and it seems to just  
>>> strike
>>> random nodes at random times. It happens often enough that I would  
>>> be happy
>>> to do additional debugging if anyone can tell me how. I'm not a  
>>> developer at
>>> all, so I'm at the end of my knowledge on how to solve this  
>>> problem. Thanks
>>> for any help!
>>>
>>>
>>> ===============================
>>> Garhan Attebury
>>> Systems Administrator
>>> UNL Research Computing Facility
>>> 402-472-7761
>>> ===============================
>>>
>>>
>
>
> -- 
> Steve Loughran                  http://www.1060.org/blogxter/publish/5
> Author: Ant in Action           http://antbook.org/

Re: DataNode gets 'stuck', ends up with two DataNode processes

Posted by Steve Loughran <st...@apache.org>.

Philip Zeyliger wrote:
> Very naively looking at the stack traces, a common theme is that there's a
> call out to "df" to find the system capacity.  If you see two data node
> processes, perhaps the fork/exec to call out to "df" is failing in some
> strange way.

that's deep into Java code. OpenJDK gives you more of that source. One 
option here is to consider some kind of timeouts in the exec, but it's 
pretty tricky to tack that on round the Java runtime APIs, because the 
process APIs weren't really designed to be interrupted by other threads.

-steve

> 
> "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
> tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
> [0x0000000042c54000..0x0000000042c54b30]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	at java.lang.Object.wait(Object.java:485)
> 	at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
> 	- locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
> 	at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
> 	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
> 	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:134)
> 	at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
> 	- locked <0x0000002a9ed97078> (a
> org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
> 	at org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
> 	at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
> 	at java.lang.Thread.run(Thread.java:619)
> 
> 
> 
> On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <at...@cse.unl.edu>wrote:
> 
>> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
>> trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
>> fine with no network/memory problems, but in every instance I see two
>> DataNode processes running, and the NameNode logs indicate the datanode in
>> question simply stopped responding. This state persists until I come along
>> and kill the DataNode processes and restart the DataNode on that particular
>> machine.
>>
>> I'm at a loss as to why this is happening, so here's all the relevant
>> information I can think of sharing:
>>
>> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
>> running, but nothing which would affect HDFS that I'm aware of)
>> number of nodes = ~100
>> HDFS size = ~230TB
>> Java version =
>> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
>> respectively
>>
>> I managed to grab a stack dump via "kill -3" from two of these problem
>> instances and threw up the logs at
>> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>.
>> The .log files honestly show nothing out of the ordinary, and having very
>> little Java developing experience the .out files mean nothing to me. It's
>> also worth mentioning that the NameNode logs at the time when these
>> DataNodes got stuck show nothing out of the ordinary either -- just the
>> expected "lost heartbeat from node <x>" message. The DataNode daemon (the
>> original process, not the second mysterious one) continues to respond to web
>> requests like browsing the log directory during this time.
>>
>> Whenever this happens I've just manually done a "kill -9" to remove the two
>> stuck DataNode processes (I'm not even sure why there's two of them, as
>> under normal operation there's only one). After killing the stuck ones, I
>> simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
>> not seen any dataloss or corruption as a result of this problem.
>>
>> Has anyone seen anything like this happen before? Out of our ~100 node
>> cluster I see this problem around once a day, and it seems to just strike
>> random nodes at random times. It happens often enough that I would be happy
>> to do additional debugging if anyone can tell me how. I'm not a developer at
>> all, so I'm at the end of my knowledge on how to solve this problem. Thanks
>> for any help!
>>
>>
>> ===============================
>> Garhan Attebury
>> Systems Administrator
>> UNL Research Computing Facility
>> 402-472-7761
>> ===============================
>>
>>
> 


-- 
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: DataNode gets 'stuck', ends up with two DataNode processes

Posted by Philip Zeyliger <ph...@cloudera.com>.

Very naively looking at the stack traces, a common theme is that there's a
call out to "df" to find the system capacity.  If you see two data node
processes, perhaps the fork/exec to call out to "df" is failing in some
strange way.

"DataNode: [/hadoop-data/dfs/data]" daemon prio=10
tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
[0x0000000042c54000..0x0000000042c54b30]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:485)
	at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
	- locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
	at java.lang.ProcessImpl.start(ProcessImpl.java:65)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
	at org.apache.hadoop.util.Shell.run(Shell.java:134)
	at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
	- locked <0x0000002a9ed97078> (a
org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
	at org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
	at java.lang.Thread.run(Thread.java:619)



On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <at...@cse.unl.edu>wrote:

> On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
> trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
> fine with no network/memory problems, but in every instance I see two
> DataNode processes running, and the NameNode logs indicate the datanode in
> question simply stopped responding. This state persists until I come along
> and kill the DataNode processes and restart the DataNode on that particular
> machine.
>
> I'm at a loss as to why this is happening, so here's all the relevant
> information I can think of sharing:
>
> hadoop version = 0.19.1-dev, r (we possibly have some custom patches
> running, but nothing which would affect HDFS that I'm aware of)
> number of nodes = ~100
> HDFS size = ~230TB
> Java version =
> OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
> respectively
>
> I managed to grab a stack dump via "kill -3" from two of these problem
> instances and threw up the logs at
> http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>.
> The .log files honestly show nothing out of the ordinary, and having very
> little Java developing experience the .out files mean nothing to me. It's
> also worth mentioning that the NameNode logs at the time when these
> DataNodes got stuck show nothing out of the ordinary either -- just the
> expected "lost heartbeat from node <x>" message. The DataNode daemon (the
> original process, not the second mysterious one) continues to respond to web
> requests like browsing the log directory during this time.
>
> Whenever this happens I've just manually done a "kill -9" to remove the two
> stuck DataNode processes (I'm not even sure why there's two of them, as
> under normal operation there's only one). After killing the stuck ones, I
> simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
> not seen any dataloss or corruption as a result of this problem.
>
> Has anyone seen anything like this happen before? Out of our ~100 node
> cluster I see this problem around once a day, and it seems to just strike
> random nodes at random times. It happens often enough that I would be happy
> to do additional debugging if anyone can tell me how. I'm not a developer at
> all, so I'm at the end of my knowledge on how to solve this problem. Thanks
> for any help!
>
>
> ===============================
> Garhan Attebury
> Systems Administrator
> UNL Research Computing Facility
> 402-472-7761
> ===============================
>
>