You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2012/09/04 18:41:26 UTC

could only be replicated to 0 nodes, instead of 1

I've been running up against the good old fashioned "replicated to 0 nodes" gremlin quite a bit recently.  My system (a set of processes interacting with hadoop, and of course hadoop itself) runs for a while (a day or so) and then I get plagued with these errors.  This is a very simple system, a single node running pseudo-distributed.  Obviously, the replication factor is implicitly 1 and the datanode is the same machine as the namenode.  None of the typical culprits seem to explain the situation and I'm not sure what to do.  I'm also not sure how I'm getting around it so far.  I fiddle desperately for a few hours and things start running again, but that's not really a solution...I've tried stopping and restarting hdfs, but that doesn't seem to improve things.

So, to go through the common suspects one by one, as quoted on http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:

• No DataNode instances being up and running. Action: look at the servers, see if the processes are running.

I can interact with hdfs through the command line (doing directory listings for example).  Furthermore, I can see that the relevant java processes are all running (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker).

• The DataNode instances cannot talk to the server, through networking or Hadoop configuration problems. Action: look at the logs of one of the DataNodes.

Obviously irrelevant in a single-node scenario.  Anyway, like I said, I can perform basic hdfs listings, I just can't upload new data.

• Your DataNode instances have no hard disk space in their configured data directories. Action: look at the dfs.data.dir list in the node configurations, verify that at least one of the directories exists, and is writeable by the user running the Hadoop processes. Then look at the logs.

There's plenty of space, at least 50GB.

• Your DataNode instances have run out of space. Look at the disk capacity via the Namenode web pages. Delete old files. Compress under-used files. Buy more disks for existing servers (if there is room), upgrade the existing servers to bigger drives, or add some more servers.

Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.

• The reserved space for a DN (as set in dfs.datanode.du.reserved is greater than the remaining free space, so the DN thinks it has no free space

I grepped all the files in the conf directory and couldn't find this parameter so I don't really know anything about it.  At any rate, it seems rather esoteric, I doubt it is related to my problem.  Any thoughts on this?

• You may also get this message due to permissions, eg if JT can not create jobtracker.info on startup.

Meh, like I said, the system basicaslly works...and then stops working.  The only explanation that would really make sense in that context is running out of space...which isn't happening. If this were a permission error, or a configuration error, or anything weird like that, then the whole system would never get up and running in the first place.

Why would a properly running hadoop system start exhibiting this error without running out of disk space?  THAT's the real question on the table here.

Any ideas?

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
                                           --  Edwin A. Abbott, Flatland
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
Good to know.  The bottom line is I was really short-roping everything on resources.  I just need to jack the machine up some.

Thanks.

On Sep 4, 2012, at 19:41 , Harsh J wrote:

> Keith,
> 
> The NameNode has a resource-checker thread in it by design to help
> prevent cases of on-disk metadata corruption in event of filled up
> dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
> up if the free disk space (among its configured metadata mounts)
> reaches a value < 100 MB, controlled by
> dfs.namenode.resource.du.reserved. You can probably set that to 0 if
> you do not want such an automatic preventive measure. Its not exactly
> a need, just a check to help avoid accidental data loss due to
> non-monitoring of disk space.
> 
> On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>> 
>> Problem is solved for now.
>> 
>> 
>> Thanks.
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "I used to be with it, but then they changed what it was.  Now, what I'm with
>> isn't it, and what's it seems weird and scary to me."
>>                                           --  Abe (Grandpa) Simpson
>> ________________________________________________________________________________
>> 
> 
> 
> 
> -- 
> Harsh J


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
Good to know.  The bottom line is I was really short-roping everything on resources.  I just need to jack the machine up some.

Thanks.

On Sep 4, 2012, at 19:41 , Harsh J wrote:

> Keith,
> 
> The NameNode has a resource-checker thread in it by design to help
> prevent cases of on-disk metadata corruption in event of filled up
> dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
> up if the free disk space (among its configured metadata mounts)
> reaches a value < 100 MB, controlled by
> dfs.namenode.resource.du.reserved. You can probably set that to 0 if
> you do not want such an automatic preventive measure. Its not exactly
> a need, just a check to help avoid accidental data loss due to
> non-monitoring of disk space.
> 
> On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>> 
>> Problem is solved for now.
>> 
>> 
>> Thanks.
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "I used to be with it, but then they changed what it was.  Now, what I'm with
>> isn't it, and what's it seems weird and scary to me."
>>                                           --  Abe (Grandpa) Simpson
>> ________________________________________________________________________________
>> 
> 
> 
> 
> -- 
> Harsh J


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
Good to know.  The bottom line is I was really short-roping everything on resources.  I just need to jack the machine up some.

Thanks.

On Sep 4, 2012, at 19:41 , Harsh J wrote:

> Keith,
> 
> The NameNode has a resource-checker thread in it by design to help
> prevent cases of on-disk metadata corruption in event of filled up
> dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
> up if the free disk space (among its configured metadata mounts)
> reaches a value < 100 MB, controlled by
> dfs.namenode.resource.du.reserved. You can probably set that to 0 if
> you do not want such an automatic preventive measure. Its not exactly
> a need, just a check to help avoid accidental data loss due to
> non-monitoring of disk space.
> 
> On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>> 
>> Problem is solved for now.
>> 
>> 
>> Thanks.
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "I used to be with it, but then they changed what it was.  Now, what I'm with
>> isn't it, and what's it seems weird and scary to me."
>>                                           --  Abe (Grandpa) Simpson
>> ________________________________________________________________________________
>> 
> 
> 
> 
> -- 
> Harsh J


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
Good to know.  The bottom line is I was really short-roping everything on resources.  I just need to jack the machine up some.

Thanks.

On Sep 4, 2012, at 19:41 , Harsh J wrote:

> Keith,
> 
> The NameNode has a resource-checker thread in it by design to help
> prevent cases of on-disk metadata corruption in event of filled up
> dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
> up if the free disk space (among its configured metadata mounts)
> reaches a value < 100 MB, controlled by
> dfs.namenode.resource.du.reserved. You can probably set that to 0 if
> you do not want such an automatic preventive measure. Its not exactly
> a need, just a check to help avoid accidental data loss due to
> non-monitoring of disk space.
> 
> On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>> 
>> Problem is solved for now.
>> 
>> 
>> Thanks.
>> ________________________________________________________________________________
>> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>> 
>> "I used to be with it, but then they changed what it was.  Now, what I'm with
>> isn't it, and what's it seems weird and scary to me."
>>                                           --  Abe (Grandpa) Simpson
>> ________________________________________________________________________________
>> 
> 
> 
> 
> -- 
> Harsh J


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Harsh J <ha...@cloudera.com>.
Keith,

The NameNode has a resource-checker thread in it by design to help
prevent cases of on-disk metadata corruption in event of filled up
dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
up if the free disk space (among its configured metadata mounts)
reaches a value < 100 MB, controlled by
dfs.namenode.resource.du.reserved. You can probably set that to 0 if
you do not want such an automatic preventive measure. Its not exactly
a need, just a check to help avoid accidental data loss due to
non-monitoring of disk space.

On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>
> Problem is solved for now.
>
>
> Thanks.
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm with
> isn't it, and what's it seems weird and scary to me."
>                                            --  Abe (Grandpa) Simpson
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0 nodes, instead of 1

Posted by Harsh J <ha...@cloudera.com>.
Keith,

The NameNode has a resource-checker thread in it by design to help
prevent cases of on-disk metadata corruption in event of filled up
dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
up if the free disk space (among its configured metadata mounts)
reaches a value < 100 MB, controlled by
dfs.namenode.resource.du.reserved. You can probably set that to 0 if
you do not want such an automatic preventive measure. Its not exactly
a need, just a check to help avoid accidental data loss due to
non-monitoring of disk space.

On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>
> Problem is solved for now.
>
>
> Thanks.
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm with
> isn't it, and what's it seems weird and scary to me."
>                                            --  Abe (Grandpa) Simpson
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0 nodes, instead of 1

Posted by Harsh J <ha...@cloudera.com>.
Keith,

The NameNode has a resource-checker thread in it by design to help
prevent cases of on-disk metadata corruption in event of filled up
dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
up if the free disk space (among its configured metadata mounts)
reaches a value < 100 MB, controlled by
dfs.namenode.resource.du.reserved. You can probably set that to 0 if
you do not want such an automatic preventive measure. Its not exactly
a need, just a check to help avoid accidental data loss due to
non-monitoring of disk space.

On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>
> Problem is solved for now.
>
>
> Thanks.
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm with
> isn't it, and what's it seems weird and scary to me."
>                                            --  Abe (Grandpa) Simpson
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0 nodes, instead of 1

Posted by Harsh J <ha...@cloudera.com>.
Keith,

The NameNode has a resource-checker thread in it by design to help
prevent cases of on-disk metadata corruption in event of filled up
dfs.namenode.name.dir disks, etc.. By default, an NN will lock itself
up if the free disk space (among its configured metadata mounts)
reaches a value < 100 MB, controlled by
dfs.namenode.resource.du.reserved. You can probably set that to 0 if
you do not want such an automatic preventive measure. Its not exactly
a need, just a check to help avoid accidental data loss due to
non-monitoring of disk space.

On Tue, Sep 4, 2012 at 11:33 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.
>
> Problem is solved for now.
>
>
> Thanks.
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "I used to be with it, but then they changed what it was.  Now, what I'm with
> isn't it, and what's it seems weird and scary to me."
>                                            --  Abe (Grandpa) Simpson
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.

Problem is solved for now.


Thanks.
________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.

Problem is solved for now.


Thanks.
________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.

Problem is solved for now.


Thanks.
________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
I had moved the data directory to the larger disk but left the namenode directory on the smaller disk figuring it didn't need much room.  Moving that to the larger disk seems to have improved the situation...although I'm still surprised the NN needed so much room.

Problem is solved for now.


Thanks.
________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
Keith,

Assuming that you were seeing the problem when you captured the namenode
webUI info, it is not related to what I suspect. This might be a good
question for CDH forums given this is not an Apache release.

Regards,
Suresh

On Tue, Sep 4, 2012 at 10:20 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:
>
> > When these errors are thrown, please send the namenode web UI
> information. It has storage related information in the cluster summary.
> That will help debug.
>
> Sure thing.  Thanks.  Here's what I currently see.  It looks like the
> problem isn't the datanode, but rather the namenode.  Would you agree with
> that assessment?
>
> NameNode 'localhost:9000'
>
> Started:         Tue Sep 04 10:06:52 PDT 2012
> Version:         0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9
> Compiled:        Thu Jan 26 11:55:16 PST 2012 by root from Unknown
> Upgrades:        There are no upgrades in progress.
>
> Browse the filesystem
> Namenode Logs
> Cluster Summary
>
> Safe mode is ON. Resources are low on NN. Safe mode must be turned off
> manually.
> 1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB
> / 888.94 MB (4%)
> Configured Capacity      :       49.21 GB
> DFS Used         :       9.9 MB
> Non DFS Used     :       2.68 GB
> DFS Remaining    :       46.53 GB
> DFS Used%        :       0.02 %
> DFS Remaining%   :       94.54 %
> Live Nodes       :       1
> Dead Nodes       :       0
> Decommissioning Nodes    :       0
> Number of Under-Replicated Blocks        :       5
>
> NameNode Storage:
>
> Storage Directory       Type    State
> /var/lib/hadoop-0.20/cache/hadoop/dfs/name      IMAGE_AND_EDITS Active
>
> Cloudera's Distribution including Apache Hadoop, 2012.
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "And what if we picked the wrong religion?  Every week, we're just making
> God
> madder and madder!"
>                                            --  Homer Simpson
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
Keith,

Assuming that you were seeing the problem when you captured the namenode
webUI info, it is not related to what I suspect. This might be a good
question for CDH forums given this is not an Apache release.

Regards,
Suresh

On Tue, Sep 4, 2012 at 10:20 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:
>
> > When these errors are thrown, please send the namenode web UI
> information. It has storage related information in the cluster summary.
> That will help debug.
>
> Sure thing.  Thanks.  Here's what I currently see.  It looks like the
> problem isn't the datanode, but rather the namenode.  Would you agree with
> that assessment?
>
> NameNode 'localhost:9000'
>
> Started:         Tue Sep 04 10:06:52 PDT 2012
> Version:         0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9
> Compiled:        Thu Jan 26 11:55:16 PST 2012 by root from Unknown
> Upgrades:        There are no upgrades in progress.
>
> Browse the filesystem
> Namenode Logs
> Cluster Summary
>
> Safe mode is ON. Resources are low on NN. Safe mode must be turned off
> manually.
> 1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB
> / 888.94 MB (4%)
> Configured Capacity      :       49.21 GB
> DFS Used         :       9.9 MB
> Non DFS Used     :       2.68 GB
> DFS Remaining    :       46.53 GB
> DFS Used%        :       0.02 %
> DFS Remaining%   :       94.54 %
> Live Nodes       :       1
> Dead Nodes       :       0
> Decommissioning Nodes    :       0
> Number of Under-Replicated Blocks        :       5
>
> NameNode Storage:
>
> Storage Directory       Type    State
> /var/lib/hadoop-0.20/cache/hadoop/dfs/name      IMAGE_AND_EDITS Active
>
> Cloudera's Distribution including Apache Hadoop, 2012.
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "And what if we picked the wrong religion?  Every week, we're just making
> God
> madder and madder!"
>                                            --  Homer Simpson
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
Keith,

Assuming that you were seeing the problem when you captured the namenode
webUI info, it is not related to what I suspect. This might be a good
question for CDH forums given this is not an Apache release.

Regards,
Suresh

On Tue, Sep 4, 2012 at 10:20 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:
>
> > When these errors are thrown, please send the namenode web UI
> information. It has storage related information in the cluster summary.
> That will help debug.
>
> Sure thing.  Thanks.  Here's what I currently see.  It looks like the
> problem isn't the datanode, but rather the namenode.  Would you agree with
> that assessment?
>
> NameNode 'localhost:9000'
>
> Started:         Tue Sep 04 10:06:52 PDT 2012
> Version:         0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9
> Compiled:        Thu Jan 26 11:55:16 PST 2012 by root from Unknown
> Upgrades:        There are no upgrades in progress.
>
> Browse the filesystem
> Namenode Logs
> Cluster Summary
>
> Safe mode is ON. Resources are low on NN. Safe mode must be turned off
> manually.
> 1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB
> / 888.94 MB (4%)
> Configured Capacity      :       49.21 GB
> DFS Used         :       9.9 MB
> Non DFS Used     :       2.68 GB
> DFS Remaining    :       46.53 GB
> DFS Used%        :       0.02 %
> DFS Remaining%   :       94.54 %
> Live Nodes       :       1
> Dead Nodes       :       0
> Decommissioning Nodes    :       0
> Number of Under-Replicated Blocks        :       5
>
> NameNode Storage:
>
> Storage Directory       Type    State
> /var/lib/hadoop-0.20/cache/hadoop/dfs/name      IMAGE_AND_EDITS Active
>
> Cloudera's Distribution including Apache Hadoop, 2012.
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "And what if we picked the wrong religion?  Every week, we're just making
> God
> madder and madder!"
>                                            --  Homer Simpson
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
Keith,

Assuming that you were seeing the problem when you captured the namenode
webUI info, it is not related to what I suspect. This might be a good
question for CDH forums given this is not an Apache release.

Regards,
Suresh

On Tue, Sep 4, 2012 at 10:20 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:
>
> > When these errors are thrown, please send the namenode web UI
> information. It has storage related information in the cluster summary.
> That will help debug.
>
> Sure thing.  Thanks.  Here's what I currently see.  It looks like the
> problem isn't the datanode, but rather the namenode.  Would you agree with
> that assessment?
>
> NameNode 'localhost:9000'
>
> Started:         Tue Sep 04 10:06:52 PDT 2012
> Version:         0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9
> Compiled:        Thu Jan 26 11:55:16 PST 2012 by root from Unknown
> Upgrades:        There are no upgrades in progress.
>
> Browse the filesystem
> Namenode Logs
> Cluster Summary
>
> Safe mode is ON. Resources are low on NN. Safe mode must be turned off
> manually.
> 1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB
> / 888.94 MB (4%)
> Configured Capacity      :       49.21 GB
> DFS Used         :       9.9 MB
> Non DFS Used     :       2.68 GB
> DFS Remaining    :       46.53 GB
> DFS Used%        :       0.02 %
> DFS Remaining%   :       94.54 %
> Live Nodes       :       1
> Dead Nodes       :       0
> Decommissioning Nodes    :       0
> Number of Under-Replicated Blocks        :       5
>
> NameNode Storage:
>
> Storage Directory       Type    State
> /var/lib/hadoop-0.20/cache/hadoop/dfs/name      IMAGE_AND_EDITS Active
>
> Cloudera's Distribution including Apache Hadoop, 2012.
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "And what if we picked the wrong religion?  Every week, we're just making
> God
> madder and madder!"
>                                            --  Homer Simpson
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:

> When these errors are thrown, please send the namenode web UI information. It has storage related information in the cluster summary. That will help debug.

Sure thing.  Thanks.  Here's what I currently see.  It looks like the problem isn't the datanode, but rather the namenode.  Would you agree with that assessment?

NameNode 'localhost:9000'

Started:	 Tue Sep 04 10:06:52 PDT 2012
Version:	 0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9 
Compiled:	 Thu Jan 26 11:55:16 PST 2012 by root from Unknown
Upgrades:	 There are no upgrades in progress.

Browse the filesystem
Namenode Logs
Cluster Summary

Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually.
1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB / 888.94 MB (4%) 
Configured Capacity	 :	 49.21 GB
DFS Used	 :	 9.9 MB
Non DFS Used	 :	 2.68 GB
DFS Remaining	 :	 46.53 GB
DFS Used%	 :	 0.02 %
DFS Remaining%	 :	 94.54 %
Live Nodes	 :	 1
Dead Nodes	 :	 0
Decommissioning Nodes	 :	 0
Number of Under-Replicated Blocks	 :	 5

NameNode Storage:

Storage Directory	Type	State
/var/lib/hadoop-0.20/cache/hadoop/dfs/name	IMAGE_AND_EDITS	Active

Cloudera's Distribution including Apache Hadoop, 2012.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:

> When these errors are thrown, please send the namenode web UI information. It has storage related information in the cluster summary. That will help debug.

Sure thing.  Thanks.  Here's what I currently see.  It looks like the problem isn't the datanode, but rather the namenode.  Would you agree with that assessment?

NameNode 'localhost:9000'

Started:	 Tue Sep 04 10:06:52 PDT 2012
Version:	 0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9 
Compiled:	 Thu Jan 26 11:55:16 PST 2012 by root from Unknown
Upgrades:	 There are no upgrades in progress.

Browse the filesystem
Namenode Logs
Cluster Summary

Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually.
1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB / 888.94 MB (4%) 
Configured Capacity	 :	 49.21 GB
DFS Used	 :	 9.9 MB
Non DFS Used	 :	 2.68 GB
DFS Remaining	 :	 46.53 GB
DFS Used%	 :	 0.02 %
DFS Remaining%	 :	 94.54 %
Live Nodes	 :	 1
Dead Nodes	 :	 0
Decommissioning Nodes	 :	 0
Number of Under-Replicated Blocks	 :	 5

NameNode Storage:

Storage Directory	Type	State
/var/lib/hadoop-0.20/cache/hadoop/dfs/name	IMAGE_AND_EDITS	Active

Cloudera's Distribution including Apache Hadoop, 2012.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:

> When these errors are thrown, please send the namenode web UI information. It has storage related information in the cluster summary. That will help debug.

Sure thing.  Thanks.  Here's what I currently see.  It looks like the problem isn't the datanode, but rather the namenode.  Would you agree with that assessment?

NameNode 'localhost:9000'

Started:	 Tue Sep 04 10:06:52 PDT 2012
Version:	 0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9 
Compiled:	 Thu Jan 26 11:55:16 PST 2012 by root from Unknown
Upgrades:	 There are no upgrades in progress.

Browse the filesystem
Namenode Logs
Cluster Summary

Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually.
1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB / 888.94 MB (4%) 
Configured Capacity	 :	 49.21 GB
DFS Used	 :	 9.9 MB
Non DFS Used	 :	 2.68 GB
DFS Remaining	 :	 46.53 GB
DFS Used%	 :	 0.02 %
DFS Remaining%	 :	 94.54 %
Live Nodes	 :	 1
Dead Nodes	 :	 0
Decommissioning Nodes	 :	 0
Number of Under-Replicated Blocks	 :	 5

NameNode Storage:

Storage Directory	Type	State
/var/lib/hadoop-0.20/cache/hadoop/dfs/name	IMAGE_AND_EDITS	Active

Cloudera's Distribution including Apache Hadoop, 2012.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Keith Wiley <kw...@keithwiley.com>.
On Sep 4, 2012, at 10:05 , Suresh Srinivas wrote:

> When these errors are thrown, please send the namenode web UI information. It has storage related information in the cluster summary. That will help debug.

Sure thing.  Thanks.  Here's what I currently see.  It looks like the problem isn't the datanode, but rather the namenode.  Would you agree with that assessment?

NameNode 'localhost:9000'

Started:	 Tue Sep 04 10:06:52 PDT 2012
Version:	 0.20.2-cdh3u3, 03b655719d13929bd68bb2c2f9cee615b389cea9 
Compiled:	 Thu Jan 26 11:55:16 PST 2012 by root from Unknown
Upgrades:	 There are no upgrades in progress.

Browse the filesystem
Namenode Logs
Cluster Summary

Safe mode is ON. Resources are low on NN. Safe mode must be turned off manually.
1639 files and directories, 585 blocks = 2224 total. Heap Size is 39.55 MB / 888.94 MB (4%) 
Configured Capacity	 :	 49.21 GB
DFS Used	 :	 9.9 MB
Non DFS Used	 :	 2.68 GB
DFS Remaining	 :	 46.53 GB
DFS Used%	 :	 0.02 %
DFS Remaining%	 :	 94.54 %
Live Nodes	 :	 1
Dead Nodes	 :	 0
Decommissioning Nodes	 :	 0
Number of Under-Replicated Blocks	 :	 5

NameNode Storage:

Storage Directory	Type	State
/var/lib/hadoop-0.20/cache/hadoop/dfs/name	IMAGE_AND_EDITS	Active

Cloudera's Distribution including Apache Hadoop, 2012.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
- A datanode is typically kept free with up to 5 free blocks (HDFS block
size) of space.
- Disk space is used by mapreduce jobs to store temporary shuffle spills
also. This is what "dfs.datanode.du.reserved" is used to configure. The
configuration is available in hdfs-site.xml. If you have not configured it
then reserved space is 0. Not only mapreduce, other files also might take
up the disk space.

When these errors are thrown, please send the namenode web UI information.
It has storage related information in the cluster summary. That will help
debug.


On Tue, Sep 4, 2012 at 9:41 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> I've been running up against the good old fashioned "replicated to 0
> nodes" gremlin quite a bit recently.  My system (a set of processes
> interacting with hadoop, and of course hadoop itself) runs for a while (a
> day or so) and then I get plagued with these errors.  This is a very simple
> system, a single node running pseudo-distributed.  Obviously, the
> replication factor is implicitly 1 and the datanode is the same machine as
> the namenode.  None of the typical culprits seem to explain the situation
> and I'm not sure what to do.  I'm also not sure how I'm getting around it
> so far.  I fiddle desperately for a few hours and things start running
> again, but that's not really a solution...I've tried stopping and
> restarting hdfs, but that doesn't seem to improve things.
>
> So, to go through the common suspects one by one, as quoted on
> http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:
>
> • No DataNode instances being up and running. Action: look at the servers,
> see if the processes are running.
>
> I can interact with hdfs through the command line (doing directory
> listings for example).  Furthermore, I can see that the relevant java
> processes are all running (NameNode, SecondaryNameNode, DataNode,
> JobTracker, TaskTracker).
>
> • The DataNode instances cannot talk to the server, through networking or
> Hadoop configuration problems. Action: look at the logs of one of the
> DataNodes.
>
> Obviously irrelevant in a single-node scenario.  Anyway, like I said, I
> can perform basic hdfs listings, I just can't upload new data.
>
> • Your DataNode instances have no hard disk space in their configured data
> directories. Action: look at the dfs.data.dir list in the node
> configurations, verify that at least one of the directories exists, and is
> writeable by the user running the Hadoop processes. Then look at the logs.
>
> There's plenty of space, at least 50GB.
>
> • Your DataNode instances have run out of space. Look at the disk capacity
> via the Namenode web pages. Delete old files. Compress under-used files.
> Buy more disks for existing servers (if there is room), upgrade the
> existing servers to bigger drives, or add some more servers.
>
> Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.
>
> • The reserved space for a DN (as set in dfs.datanode.du.reserved is
> greater than the remaining free space, so the DN thinks it has no free space
>
> I grepped all the files in the conf directory and couldn't find this
> parameter so I don't really know anything about it.  At any rate, it seems
> rather esoteric, I doubt it is related to my problem.  Any thoughts on this?
>
> • You may also get this message due to permissions, eg if JT can not
> create jobtracker.info on startup.
>
> Meh, like I said, the system basicaslly works...and then stops working.
>  The only explanation that would really make sense in that context is
> running out of space...which isn't happening. If this were a permission
> error, or a configuration error, or anything weird like that, then the
> whole system would never get up and running in the first place.
>
> Why would a properly running hadoop system start exhibiting this error
> without running out of disk space?  THAT's the real question on the table
> here.
>
> Any ideas?
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Yet mark his perfect self-contentment, and hence learn his lesson, that
> to be
> self-contented is to be vile and ignorant, and that to aspire is better
> than to
> be blindly and impotently happy."
>                                            --  Edwin A. Abbott, Flatland
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
- A datanode is typically kept free with up to 5 free blocks (HDFS block
size) of space.
- Disk space is used by mapreduce jobs to store temporary shuffle spills
also. This is what "dfs.datanode.du.reserved" is used to configure. The
configuration is available in hdfs-site.xml. If you have not configured it
then reserved space is 0. Not only mapreduce, other files also might take
up the disk space.

When these errors are thrown, please send the namenode web UI information.
It has storage related information in the cluster summary. That will help
debug.


On Tue, Sep 4, 2012 at 9:41 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> I've been running up against the good old fashioned "replicated to 0
> nodes" gremlin quite a bit recently.  My system (a set of processes
> interacting with hadoop, and of course hadoop itself) runs for a while (a
> day or so) and then I get plagued with these errors.  This is a very simple
> system, a single node running pseudo-distributed.  Obviously, the
> replication factor is implicitly 1 and the datanode is the same machine as
> the namenode.  None of the typical culprits seem to explain the situation
> and I'm not sure what to do.  I'm also not sure how I'm getting around it
> so far.  I fiddle desperately for a few hours and things start running
> again, but that's not really a solution...I've tried stopping and
> restarting hdfs, but that doesn't seem to improve things.
>
> So, to go through the common suspects one by one, as quoted on
> http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:
>
> • No DataNode instances being up and running. Action: look at the servers,
> see if the processes are running.
>
> I can interact with hdfs through the command line (doing directory
> listings for example).  Furthermore, I can see that the relevant java
> processes are all running (NameNode, SecondaryNameNode, DataNode,
> JobTracker, TaskTracker).
>
> • The DataNode instances cannot talk to the server, through networking or
> Hadoop configuration problems. Action: look at the logs of one of the
> DataNodes.
>
> Obviously irrelevant in a single-node scenario.  Anyway, like I said, I
> can perform basic hdfs listings, I just can't upload new data.
>
> • Your DataNode instances have no hard disk space in their configured data
> directories. Action: look at the dfs.data.dir list in the node
> configurations, verify that at least one of the directories exists, and is
> writeable by the user running the Hadoop processes. Then look at the logs.
>
> There's plenty of space, at least 50GB.
>
> • Your DataNode instances have run out of space. Look at the disk capacity
> via the Namenode web pages. Delete old files. Compress under-used files.
> Buy more disks for existing servers (if there is room), upgrade the
> existing servers to bigger drives, or add some more servers.
>
> Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.
>
> • The reserved space for a DN (as set in dfs.datanode.du.reserved is
> greater than the remaining free space, so the DN thinks it has no free space
>
> I grepped all the files in the conf directory and couldn't find this
> parameter so I don't really know anything about it.  At any rate, it seems
> rather esoteric, I doubt it is related to my problem.  Any thoughts on this?
>
> • You may also get this message due to permissions, eg if JT can not
> create jobtracker.info on startup.
>
> Meh, like I said, the system basicaslly works...and then stops working.
>  The only explanation that would really make sense in that context is
> running out of space...which isn't happening. If this were a permission
> error, or a configuration error, or anything weird like that, then the
> whole system would never get up and running in the first place.
>
> Why would a properly running hadoop system start exhibiting this error
> without running out of disk space?  THAT's the real question on the table
> here.
>
> Any ideas?
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Yet mark his perfect self-contentment, and hence learn his lesson, that
> to be
> self-contented is to be vile and ignorant, and that to aspire is better
> than to
> be blindly and impotently happy."
>                                            --  Edwin A. Abbott, Flatland
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
- A datanode is typically kept free with up to 5 free blocks (HDFS block
size) of space.
- Disk space is used by mapreduce jobs to store temporary shuffle spills
also. This is what "dfs.datanode.du.reserved" is used to configure. The
configuration is available in hdfs-site.xml. If you have not configured it
then reserved space is 0. Not only mapreduce, other files also might take
up the disk space.

When these errors are thrown, please send the namenode web UI information.
It has storage related information in the cluster summary. That will help
debug.


On Tue, Sep 4, 2012 at 9:41 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> I've been running up against the good old fashioned "replicated to 0
> nodes" gremlin quite a bit recently.  My system (a set of processes
> interacting with hadoop, and of course hadoop itself) runs for a while (a
> day or so) and then I get plagued with these errors.  This is a very simple
> system, a single node running pseudo-distributed.  Obviously, the
> replication factor is implicitly 1 and the datanode is the same machine as
> the namenode.  None of the typical culprits seem to explain the situation
> and I'm not sure what to do.  I'm also not sure how I'm getting around it
> so far.  I fiddle desperately for a few hours and things start running
> again, but that's not really a solution...I've tried stopping and
> restarting hdfs, but that doesn't seem to improve things.
>
> So, to go through the common suspects one by one, as quoted on
> http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:
>
> • No DataNode instances being up and running. Action: look at the servers,
> see if the processes are running.
>
> I can interact with hdfs through the command line (doing directory
> listings for example).  Furthermore, I can see that the relevant java
> processes are all running (NameNode, SecondaryNameNode, DataNode,
> JobTracker, TaskTracker).
>
> • The DataNode instances cannot talk to the server, through networking or
> Hadoop configuration problems. Action: look at the logs of one of the
> DataNodes.
>
> Obviously irrelevant in a single-node scenario.  Anyway, like I said, I
> can perform basic hdfs listings, I just can't upload new data.
>
> • Your DataNode instances have no hard disk space in their configured data
> directories. Action: look at the dfs.data.dir list in the node
> configurations, verify that at least one of the directories exists, and is
> writeable by the user running the Hadoop processes. Then look at the logs.
>
> There's plenty of space, at least 50GB.
>
> • Your DataNode instances have run out of space. Look at the disk capacity
> via the Namenode web pages. Delete old files. Compress under-used files.
> Buy more disks for existing servers (if there is room), upgrade the
> existing servers to bigger drives, or add some more servers.
>
> Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.
>
> • The reserved space for a DN (as set in dfs.datanode.du.reserved is
> greater than the remaining free space, so the DN thinks it has no free space
>
> I grepped all the files in the conf directory and couldn't find this
> parameter so I don't really know anything about it.  At any rate, it seems
> rather esoteric, I doubt it is related to my problem.  Any thoughts on this?
>
> • You may also get this message due to permissions, eg if JT can not
> create jobtracker.info on startup.
>
> Meh, like I said, the system basicaslly works...and then stops working.
>  The only explanation that would really make sense in that context is
> running out of space...which isn't happening. If this were a permission
> error, or a configuration error, or anything weird like that, then the
> whole system would never get up and running in the first place.
>
> Why would a properly running hadoop system start exhibiting this error
> without running out of disk space?  THAT's the real question on the table
> here.
>
> Any ideas?
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Yet mark his perfect self-contentment, and hence learn his lesson, that
> to be
> self-contented is to be vile and ignorant, and that to aspire is better
> than to
> be blindly and impotently happy."
>                                            --  Edwin A. Abbott, Flatland
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0: TL;DR

Posted by Harsh J <ha...@cloudera.com>.
Hi Keith,

See http://search-hadoop.com/m/z9oYUIhhUg and the method isGoodTarget
under http://search-hadoop.com/c/Hadoop:/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java||isGoodTarget

On Tue, Sep 4, 2012 at 10:24 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?
>
> Thanks.
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0: TL;DR

Posted by Harsh J <ha...@cloudera.com>.
Hi Keith,

See http://search-hadoop.com/m/z9oYUIhhUg and the method isGoodTarget
under http://search-hadoop.com/c/Hadoop:/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java||isGoodTarget

On Tue, Sep 4, 2012 at 10:24 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?
>
> Thanks.
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0: TL;DR

Posted by Harsh J <ha...@cloudera.com>.
Hi Keith,

See http://search-hadoop.com/m/z9oYUIhhUg and the method isGoodTarget
under http://search-hadoop.com/c/Hadoop:/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java||isGoodTarget

On Tue, Sep 4, 2012 at 10:24 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?
>
> Thanks.
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0: TL;DR

Posted by Harsh J <ha...@cloudera.com>.
Hi Keith,

See http://search-hadoop.com/m/z9oYUIhhUg and the method isGoodTarget
under http://search-hadoop.com/c/Hadoop:/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java||isGoodTarget

On Tue, Sep 4, 2012 at 10:24 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?
>
> Thanks.
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
>



-- 
Harsh J

Re: could only be replicated to 0: TL;DR

Posted by Keith Wiley <kw...@keithwiley.com>.
If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: could only be replicated to 0: TL;DR

Posted by Keith Wiley <kw...@keithwiley.com>.
If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: could only be replicated to 0: TL;DR

Posted by Keith Wiley <kw...@keithwiley.com>.
If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________


Re: could only be replicated to 0 nodes, instead of 1

Posted by Suresh Srinivas <su...@hortonworks.com>.
- A datanode is typically kept free with up to 5 free blocks (HDFS block
size) of space.
- Disk space is used by mapreduce jobs to store temporary shuffle spills
also. This is what "dfs.datanode.du.reserved" is used to configure. The
configuration is available in hdfs-site.xml. If you have not configured it
then reserved space is 0. Not only mapreduce, other files also might take
up the disk space.

When these errors are thrown, please send the namenode web UI information.
It has storage related information in the cluster summary. That will help
debug.


On Tue, Sep 4, 2012 at 9:41 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> I've been running up against the good old fashioned "replicated to 0
> nodes" gremlin quite a bit recently.  My system (a set of processes
> interacting with hadoop, and of course hadoop itself) runs for a while (a
> day or so) and then I get plagued with these errors.  This is a very simple
> system, a single node running pseudo-distributed.  Obviously, the
> replication factor is implicitly 1 and the datanode is the same machine as
> the namenode.  None of the typical culprits seem to explain the situation
> and I'm not sure what to do.  I'm also not sure how I'm getting around it
> so far.  I fiddle desperately for a few hours and things start running
> again, but that's not really a solution...I've tried stopping and
> restarting hdfs, but that doesn't seem to improve things.
>
> So, to go through the common suspects one by one, as quoted on
> http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo:
>
> • No DataNode instances being up and running. Action: look at the servers,
> see if the processes are running.
>
> I can interact with hdfs through the command line (doing directory
> listings for example).  Furthermore, I can see that the relevant java
> processes are all running (NameNode, SecondaryNameNode, DataNode,
> JobTracker, TaskTracker).
>
> • The DataNode instances cannot talk to the server, through networking or
> Hadoop configuration problems. Action: look at the logs of one of the
> DataNodes.
>
> Obviously irrelevant in a single-node scenario.  Anyway, like I said, I
> can perform basic hdfs listings, I just can't upload new data.
>
> • Your DataNode instances have no hard disk space in their configured data
> directories. Action: look at the dfs.data.dir list in the node
> configurations, verify that at least one of the directories exists, and is
> writeable by the user running the Hadoop processes. Then look at the logs.
>
> There's plenty of space, at least 50GB.
>
> • Your DataNode instances have run out of space. Look at the disk capacity
> via the Namenode web pages. Delete old files. Compress under-used files.
> Buy more disks for existing servers (if there is room), upgrade the
> existing servers to bigger drives, or add some more servers.
>
> Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.
>
> • The reserved space for a DN (as set in dfs.datanode.du.reserved is
> greater than the remaining free space, so the DN thinks it has no free space
>
> I grepped all the files in the conf directory and couldn't find this
> parameter so I don't really know anything about it.  At any rate, it seems
> rather esoteric, I doubt it is related to my problem.  Any thoughts on this?
>
> • You may also get this message due to permissions, eg if JT can not
> create jobtracker.info on startup.
>
> Meh, like I said, the system basicaslly works...and then stops working.
>  The only explanation that would really make sense in that context is
> running out of space...which isn't happening. If this were a permission
> error, or a configuration error, or anything weird like that, then the
> whole system would never get up and running in the first place.
>
> Why would a properly running hadoop system start exhibiting this error
> without running out of disk space?  THAT's the real question on the table
> here.
>
> Any ideas?
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "Yet mark his perfect self-contentment, and hence learn his lesson, that
> to be
> self-contented is to be vile and ignorant, and that to aspire is better
> than to
> be blindly and impotently happy."
>                                            --  Edwin A. Abbott, Flatland
>
> ________________________________________________________________________________
>
>


-- 
http://hortonworks.com/download/

Re: could only be replicated to 0: TL;DR

Posted by Keith Wiley <kw...@keithwiley.com>.
If the datanode is definitely not running out of space, and the overall system has basically been working leading up to the "replicated to 0 nodes" error (which proves the configuration and permissions are all basically correct), then what other explanations are there for why hdfs would suddenly start exhibiting this error out of the blue?

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________