You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Terry Healy <th...@bnl.gov> on 2012/05/18 15:51:49 UTC

Unable to start NN after rack assignment attempt

Running Apache 1.0.2 ~12 datanodes

Ran FSCK / -> OK, before, everything running as expected.

Started trying to use a script to assign nodes to racks, which required
several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh /
start-all.sh too if that matters.

Got past errors in script and data file, but dfsadmin -report still
showed all assigned to default rack. I tried replacing one system name
in the rack mapping file with it's IP address. At this point the NN
failed to start up.

So I commented out the topology.script.file.name property statements in
hdfs-site.xml

NN still fails to start; trace below indicating EOF Exception, but I
don't know what file it can't read.

As always your patience with a noob appreciated; any suggestions to get
started again? (I can forget about the rack assignment for now)

Thanks.



Re: Unable to start NN after rack assignment attempt

Posted by Terry Healy <th...@bnl.gov>.
Todd-

Thanks for your reply. I went out on a limb and started digging in the
source code and figures it was FSImage. So I saved it, and copied over
the copy from my checkpoint directory and got running again.

I ran a few jobs to test and returned to getting a problem new node
running. Once again it looks like I will have to manually force an exit
from safe mode to run fsck -move

I sent mail to Harsh earlier - I think I must migrate to CDH as I fear
my manual hacking with configs and such has caused the fragile state
that the cluster is in now.

Thanks,

Terry

On 05/18/2012 12:34 PM, Todd Lipcon wrote:
> Hi Terry,
> 
> It seems like something got truncated in your FSImage... though it's
> unclear how that might have happened.
> 
> If you're able to share your logs and your dfs.name.dir contents, feel
> free to contact me off-list and I can try to take a look to diagnose
> the issue and try to recover the system. Of course whenever any
> corruption issue occurs we take it seriously and want to get at a root
> cause to prevent future occurrences!
> 
> Thanks
> -Todd
> 
> On Fri, May 18, 2012 at 6:57 AM, Terry Healy <th...@bnl.gov> wrote:
>> Sorry, forgot to attach the trace:
>> <code>
>> 2012-05-18 09:54:45,355 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128
>> 2012-05-18 09:54:45,379 ERROR
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
>> initialization failed.
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>> 2012-05-18 09:54:45,380 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>        at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> 2012-05-18 09:54:45,380 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>> /************************************************************
>> SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx
>> ************************************************************/
>>
>> </code>
>>
>>
>>
>> On 05/18/2012 09:51 AM, Terry Healy wrote:
>>> Running Apache 1.0.2 ~12 datanodes
>>>
>>> Ran FSCK / -> OK, before, everything running as expected.
>>>
>>> Started trying to use a script to assign nodes to racks, which required
>>> several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh /
>>> start-all.sh too if that matters.
>>>
>>> Got past errors in script and data file, but dfsadmin -report still
>>> showed all assigned to default rack. I tried replacing one system name
>>> in the rack mapping file with it's IP address. At this point the NN
>>> failed to start up.
>>>
>>> So I commented out the topology.script.file.name property statements in
>>> hdfs-site.xml
>>>
>>> NN still fails to start; trace below indicating EOF Exception, but I
>>> don't know what file it can't read.
>>>
>>> As always your patience with a noob appreciated; any suggestions to get
>>> started again? (I can forget about the rack assignment for now)
>>>
>>> Thanks.
>>>
>>>
>>
>>
> 
> 
> 

-- 
Terry Healy / thealy@bnl.gov
Cyber Security Operations
Brookhaven National Laboratory
Building 515, Upton N.Y. 11973

Re: Unable to start NN after rack assignment attempt

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Terry,

It seems like something got truncated in your FSImage... though it's
unclear how that might have happened.

If you're able to share your logs and your dfs.name.dir contents, feel
free to contact me off-list and I can try to take a look to diagnose
the issue and try to recover the system. Of course whenever any
corruption issue occurs we take it seriously and want to get at a root
cause to prevent future occurrences!

Thanks
-Todd

On Fri, May 18, 2012 at 6:57 AM, Terry Healy <th...@bnl.gov> wrote:
> Sorry, forgot to attach the trace:
> <code>
> 2012-05-18 09:54:45,355 INFO
> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128
> 2012-05-18 09:54:45,379 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> 2012-05-18 09:54:45,380 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> 2012-05-18 09:54:45,380 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx
> ************************************************************/
>
> </code>
>
>
>
> On 05/18/2012 09:51 AM, Terry Healy wrote:
>> Running Apache 1.0.2 ~12 datanodes
>>
>> Ran FSCK / -> OK, before, everything running as expected.
>>
>> Started trying to use a script to assign nodes to racks, which required
>> several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh /
>> start-all.sh too if that matters.
>>
>> Got past errors in script and data file, but dfsadmin -report still
>> showed all assigned to default rack. I tried replacing one system name
>> in the rack mapping file with it's IP address. At this point the NN
>> failed to start up.
>>
>> So I commented out the topology.script.file.name property statements in
>> hdfs-site.xml
>>
>> NN still fails to start; trace below indicating EOF Exception, but I
>> don't know what file it can't read.
>>
>> As always your patience with a noob appreciated; any suggestions to get
>> started again? (I can forget about the rack assignment for now)
>>
>> Thanks.
>>
>>
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Unable to start NN after rack assignment attempt

Posted by Terry Healy <th...@bnl.gov>.
Sorry, forgot to attach the trace:
<code>
2012-05-18 09:54:45,355 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128
2012-05-18 09:54:45,379 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
	at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
	at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
	at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
	at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
	at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
	at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
2012-05-18 09:54:45,380 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824)
	at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372)
	at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
	at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
	at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
	at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
	at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
	at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)

2012-05-18 09:54:45,380 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx
************************************************************/

</code>



On 05/18/2012 09:51 AM, Terry Healy wrote:
> Running Apache 1.0.2 ~12 datanodes
> 
> Ran FSCK / -> OK, before, everything running as expected.
> 
> Started trying to use a script to assign nodes to racks, which required
> several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh /
> start-all.sh too if that matters.
> 
> Got past errors in script and data file, but dfsadmin -report still
> showed all assigned to default rack. I tried replacing one system name
> in the rack mapping file with it's IP address. At this point the NN
> failed to start up.
> 
> So I commented out the topology.script.file.name property statements in
> hdfs-site.xml
> 
> NN still fails to start; trace below indicating EOF Exception, but I
> don't know what file it can't read.
> 
> As always your patience with a noob appreciated; any suggestions to get
> started again? (I can forget about the rack assignment for now)
> 
> Thanks.
> 
>