You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ossi <lo...@gmail.com> on 2011/10/21 11:26:04 UTC

lost data with 1 failed datanode and replication factor 3 in 6 node cluster

hi,

We managed to lost data when 1 datanode broke down in a cluster of 6
datanodes with
replication factor 3.

As far as I know, that shouldn't happen, since each blocks should have 1
copy in
3 different hosts. So, loosing even 2 nodes should be fine.

Earlier we did some tests with replication factor 2, but reverted from that:
   88  2011-10-12 06:46:49 hadoop dfs -setrep -w 2 -R /
  148  2011-10-12 10:22:09 hadoop dfs -setrep -w 3 -R /

The lost data was generated after replication factor was set back to 3.
And even if replication factor would have been 2, data shouldn't have been
lost, right?

We wonder how that is possible and in what situations that could happen?


br, Ossi

Re: lost data with 1 failed datanode and replication factor 3 in 6 node cluster

Posted by Uma Maheswara Rao G 72686 <ma...@huawei.com>.

----- Original Message -----
From: Ossi <lo...@gmail.com>
Date: Friday, October 21, 2011 2:57 pm
Subject: lost data with 1 failed datanode and replication factor 3 in 6 node cluster
To: common-user@hadoop.apache.org

> hi,
> 
> We managed to lost data when 1 datanode broke down in a cluster of 6
> datanodes with
> replication factor 3.
> 
> As far as I know, that shouldn't happen, since each blocks should 
> have 1
> copy in
> 3 different hosts. So, loosing even 2 nodes should be fine.
> 
> Earlier we did some tests with replication factor 2, but reverted 
> from that:
>   88  2011-10-12 06:46:49 hadoop dfs -setrep -w 2 -R /
>  148  2011-10-12 10:22:09 hadoop dfs -setrep -w 3 -R /
> 
> The lost data was generated after replication factor was set back 
> to 3.
First of all the question is how are you measuring the dataloss?
Any read failure with block missing exceptions?

My guess is that, you are measuring the dataloss by dfsused space. If i am correct, the dfsused space will be calculated by complete data available DNs. 
So, when one datanode goes down, then dfs used and ramainig also will reduce relatively. This can not be taken as data loss.
Please correct me, if my understanding is wrong with the question.
> And even if replication factor would have been 2, data shouldn't 
> have been
> lost, right?
> 
> We wonder how that is possible and in what situations that could 
> happen?
> 
> br, Ossi
> 
Regards,
Uma

Re: lost data with 1 failed datanode and replication factor 3 in 6 node cluster

Posted by Ossi <lo...@gmail.com>.

On Fri, Oct 21, 2011 at 3:04 PM, modemide <mo...@gmail.com> wrote:

> Hi Ossi,
>
> I'm not sure about how experienced you are with hadoop.  I'm still
> learning myself.  But here's my guess as to what happened.  I
> apologize in advance if this is below your current knowledge of
> Hadoop.
>

Hi and thanks for the reply (first one for my 3 posts)! :)
Neither am I sure about my experience and I'm learning too (like most of us,
I'd assume).


>
> There are a couple of pieces which I know of that determine file
> replication in your situation.  One is your manually setting
> replication factor, the other is the config on the client from which
> you uploaded the data from.
>

So far we have used only data either generated by benchmark suites or
or our own data which has been copied manually to hdfs (with hadoop -put...
etc).
The lost data was generated by Terasort benchmark suite. And probably
all data has been generated or copied from namenode server (no datanodes
running).

Anyway since all lost data was generated by Terasort output I did some
searching
and found out that it is meant to be so:
"The output of the reduce has replication set to 1, instead of the default
3, because the contest does not require the output data be replicated on to
multiple nodes."
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/examples/terasort/package-summary.html


Assuming your namenode didn't complain about missing blocks on its web
> control panel, you may have had your client set to a replication
> factor of 1.  If this was the case, the file that was uploaded (from
> the client) to HDFS will have a replication factor of one.
>
> A couple of ways to confirm/disprove this theory:
> 1) Go to the name node control panel (http://<NAMENODE>:50070 by default).
>    Browse the file system
>    Navigate to a file that was created after setting the replication
> factor on the cluster
>    The 4th column from the left is a field called replication.  That
> should tell you what the replication factor is for any particular
> file.
>

So far we have mostly used command line tools to work with hadoop. I assume
that
hadoop fsck / -files -locations -blocks gives quite same results?

Anyway, this case is now solved, great and thanks for helping!


>
> 2) On the client that you use to upload files to HDFS, check your
> hdfs-site.xml
>    Should be located in $HADOOP_HOME/conf/hdfs-site.xml
>
>
> Hope that helps!
>
>
> Tim
>
>
>
>
>
>
> On 10/21/11, Ossi <lo...@gmail.com> wrote:
> > hi,
> >
> > We managed to lost data when 1 datanode broke down in a cluster of 6
> > datanodes with
> > replication factor 3.
> >
> > As far as I know, that shouldn't happen, since each blocks should have 1
> > copy in
> > 3 different hosts. So, loosing even 2 nodes should be fine.
> >
> > Earlier we did some tests with replication factor 2, but reverted from
> that:
> >    88  2011-10-12 06:46:49 hadoop dfs -setrep -w 2 -R /
> >   148  2011-10-12 10:22:09 hadoop dfs -setrep -w 3 -R /
> >
> > The lost data was generated after replication factor was set back to 3.
> > And even if replication factor would have been 2, data shouldn't have
> been
> > lost, right?
> >
> > We wonder how that is possible and in what situations that could happen?
> >
> >
> > br, Ossi
> >
>

Re: lost data with 1 failed datanode and replication factor 3 in 6 node cluster

Posted by modemide <mo...@gmail.com>.

Hi Ossi,

I'm not sure about how experienced you are with hadoop.  I'm still
learning myself.  But here's my guess as to what happened.  I
apologize in advance if this is below your current knowledge of
Hadoop.

There are a couple of pieces which I know of that determine file
replication in your situation.  One is your manually setting
replication factor, the other is the config on the client from which
you uploaded the data from.

Assuming your namenode didn't complain about missing blocks on its web
control panel, you may have had your client set to a replication
factor of 1.  If this was the case, the file that was uploaded (from
the client) to HDFS will have a replication factor of one.

A couple of ways to confirm/disprove this theory:
1) Go to the name node control panel (http://<NAMENODE>:50070 by default).
    Browse the file system
    Navigate to a file that was created after setting the replication
factor on the cluster
    The 4th column from the left is a field called replication.  That
should tell you what the replication factor is for any particular
file.

2) On the client that you use to upload files to HDFS, check your hdfs-site.xml
    Should be located in $HADOOP_HOME/conf/hdfs-site.xml

Hope that helps!

Tim

On 10/21/11, Ossi <lo...@gmail.com> wrote:
> hi,
>
> We managed to lost data when 1 datanode broke down in a cluster of 6
> datanodes with
> replication factor 3.
>
> As far as I know, that shouldn't happen, since each blocks should have 1
> copy in
> 3 different hosts. So, loosing even 2 nodes should be fine.
>
> Earlier we did some tests with replication factor 2, but reverted from that:
>    88  2011-10-12 06:46:49 hadoop dfs -setrep -w 2 -R /
>   148  2011-10-12 10:22:09 hadoop dfs -setrep -w 3 -R /
>
> The lost data was generated after replication factor was set back to 3.
> And even if replication factor would have been 2, data shouldn't have been
> lost, right?
>
> We wonder how that is possible and in what situations that could happen?
>
>
> br, Ossi
>