You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Mark Kerzner <ma...@gmail.com> on 2011/03/04 00:44:44 UTC

Is that a good practice?

Hi,

in my small development cluster I have a master/slave node and a slave node,
and I shut down the slave node at night. I often see that my HDFS is
corrupted, and I have to reformat the name node and to delete the data
directory.

It finally dawns on me that with such small cluster I better shut the
daemons down, for otherwise they are trying too hard to compensate for the
missing node and eventually it goes bad. Is my understanding correct?

Thank you,
Mark

Re: Is that a good practice?

Posted by Mark Kerzner <ma...@gmail.com>.
Eric,

I shut it down at night, because the slave server is in my bedroom, and I
use the replication factor of 1, because that is what my CDH install did, so
I accepted it. I will bump it up to 3.

But the most important advice that you give is "put it into safe mode" - and
that is what I am going to do all the time that I am not working on it,
because it is purely my development cluster. I might even shut the daemons
down completely.

Thank you,
Mark

On Thu, Mar 3, 2011 at 5:55 PM, Eric Sammer <es...@cloudera.com> wrote:

> On Thu, Mar 3, 2011 at 6:44 PM, Mark Kerzner <ma...@gmail.com>wrote:
>
>> Hi,
>>
>> in my small development cluster I have a master/slave node and a slave
>> node,
>> and I shut down the slave node at night. I often see that my HDFS is
>> corrupted, and I have to reformat the name node and to delete the data
>> directory.
>>
>
> Why do you shut down the slave at night? HDFS should only be corrupted if
> you're missing all copies of a block. With a replication factor of 3
> (default) you should have 100% of the data on both nodes (if you only have 2
> nodes). If you've dialed it down to 1, simply starting the slave back up
> should "un-corrupt" HDFS. You definitely don't want to be doing this to HDFS
> regularly (dropping nodes from the cluster and re-adding them unless you're
> trying to test HDFS' failure semantics.
>
> It finally dawns on me that with such small cluster I better shut the
>> daemons down, for otherwise they are trying too hard to compensate for the
>> missing node and eventually it goes bad. Is my understanding correct?
>>
>
> It doesn't "eventually go bad." If the NN sees a DN disappear it may start
> re-replicating data to another node. In such a small cluster, maybe there's
> no where else to get the blocks from, but I bet you dialed the replication
> factor down to 1 (or have code that writes files with a rep factor of 1 like
> teragen / terasort).
>
> In short, if you're going to shut down nodes like this put the NN into safe
> mode so it doesn't freak out (which will also make the cluster unusable
> during that time) but there's definitely no need to be reformatting HDFS.
> Just re-introduce the DN you shut down to the cluster.
>
>
>>
>> Thank you,
>> Mark
>>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>

Re: Is that a good practice?

Posted by Eric Sammer <es...@cloudera.com>.
On Thu, Mar 3, 2011 at 6:44 PM, Mark Kerzner <ma...@gmail.com> wrote:

> Hi,
>
> in my small development cluster I have a master/slave node and a slave
> node,
> and I shut down the slave node at night. I often see that my HDFS is
> corrupted, and I have to reformat the name node and to delete the data
> directory.
>

Why do you shut down the slave at night? HDFS should only be corrupted if
you're missing all copies of a block. With a replication factor of 3
(default) you should have 100% of the data on both nodes (if you only have 2
nodes). If you've dialed it down to 1, simply starting the slave back up
should "un-corrupt" HDFS. You definitely don't want to be doing this to HDFS
regularly (dropping nodes from the cluster and re-adding them unless you're
trying to test HDFS' failure semantics.

It finally dawns on me that with such small cluster I better shut the
> daemons down, for otherwise they are trying too hard to compensate for the
> missing node and eventually it goes bad. Is my understanding correct?
>

It doesn't "eventually go bad." If the NN sees a DN disappear it may start
re-replicating data to another node. In such a small cluster, maybe there's
no where else to get the blocks from, but I bet you dialed the replication
factor down to 1 (or have code that writes files with a rep factor of 1 like
teragen / terasort).

In short, if you're going to shut down nodes like this put the NN into safe
mode so it doesn't freak out (which will also make the cluster unusable
during that time) but there's definitely no need to be reformatting HDFS.
Just re-introduce the DN you shut down to the cluster.


>
> Thank you,
> Mark
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: Is that a good practice?

Posted by Mark Kerzner <ma...@gmail.com>.
Harsh,

indeed, this has bitten me a while back, but now the default Cloudera
distribution configures them outside of /tmp

Really, as Eric has pointed out, I was making failure a regular occasion, by
bringing one computer down.

Thank you,
Mark

On Thu, Mar 3, 2011 at 10:05 PM, Harsh J <qw...@gmail.com> wrote:

> This appears to me like the simple case of your OS clearing out your
> /tmp at every boot. You will lose all data + fsimage if you haven't
> configured your dfs.data.dir and dfs.name.dir to not be located on
> /tmp.
>
> On Fri, Mar 4, 2011 at 5:14 AM, Mark Kerzner <ma...@gmail.com>
> wrote:
> > It finally dawns on me that with such small cluster I better shut the
> > daemons down, for otherwise they are trying too hard to compensate for
> the
> > missing node and eventually it goes bad. Is my understanding correct?
>
> Are your dfs.data.dir and/or dfs.name.dir properties pointing to
> locations on /tmp, by the way? This appears to me like the simple case
> of your OS clearing out your /tmp on boot. You will lose all data +
> fsimage this way if you haven't configured your dfs.data.dir and
> dfs.name.dir to not be located on /tmp.
>
> --
> Harsh J
> www.harshj.com
>

Re: Is that a good practice?

Posted by Harsh J <qw...@gmail.com>.
This appears to me like the simple case of your OS clearing out your
/tmp at every boot. You will lose all data + fsimage if you haven't
configured your dfs.data.dir and dfs.name.dir to not be located on
/tmp.

On Fri, Mar 4, 2011 at 5:14 AM, Mark Kerzner <ma...@gmail.com> wrote:
> It finally dawns on me that with such small cluster I better shut the
> daemons down, for otherwise they are trying too hard to compensate for the
> missing node and eventually it goes bad. Is my understanding correct?

Are your dfs.data.dir and/or dfs.name.dir properties pointing to
locations on /tmp, by the way? This appears to me like the simple case
of your OS clearing out your /tmp on boot. You will lose all data +
fsimage this way if you haven't configured your dfs.data.dir and
dfs.name.dir to not be located on /tmp.

-- 
Harsh J
www.harshj.com