You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Ted Dunning <td...@veoh.com> on 2008/02/08 02:05:07 UTC

Namenode fails to replicate file


Chris Kline reported a problem in early January where a file which had too
few replicated blocks did not get replicated until a DFS restart.

I just saw a similar issue.  I had a file that had a block with 1 replica (2
required) that did not get replicated.  I changed the number of required
replicates, but nothing caused any action.  Changing the number of required
replicas on other files got them to be replicated.

I eventually copied the file to temp, deleted the original and moved the
copy back to the original place.  I was also able to read the entire file
which shows that the problem was not due to slow reporting from a down
datanode.

This happened just after I had a node failure which was why I was messing
with replication at all.  Since I was in the process of increasing the
replication on nearly 10,000 large files, my log files are full of other
stuff, but I am pretty sure that there is a bug here.

This was on a relatively small cluster with 13 data nodes.

It also brings up a related issue that has come up before in that there are
times when you may want to increase the number of replicas of a file right
NOW.  I don't know of any way to force this replication.  Is there such a
way?

Re: Namenode fails to replicate file

Posted by Ted Dunning <td...@veoh.com>.


I will see if I can replicate the problem and do as you suggest.


On 2/8/08 4:29 PM, "Raghu Angadi" <ra...@yahoo-inc.com> wrote:

> Ted Dunning wrote:
>> That makes it wait, but I don't think it increases the urgency on the part
>> of the namenode.
>> 
>> As an interesting experiment, I had a cluster with lots of pending
>> replication to do that was happening slowly.  Restarting the name node
>> caused the rate of replication to increase massively.  The difference was
>> highly visible on the ganglia graph because the amount of I/O wait time on
>> the cluster increased to >15% from near zero.
> 
> I think we should file a jira on this. There is no reason for Namenode
> replicate faster after restart than before, unless it is remembering
> something that it should not. The issue Chris Kline reported earlier in
> Jan was unresolved as well.. even after looking through multiple logs.
> 
> Ted, could you run 'dfsadmin -metasave' and attach relevant info to a
> new jira? Any or all the logs help.
> 
> Raghu.
> 
>

Re: Namenode fails to replicate file

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Ted Dunning wrote:
> That makes it wait, but I don't think it increases the urgency on the part
> of the namenode.
> 
> As an interesting experiment, I had a cluster with lots of pending
> replication to do that was happening slowly.  Restarting the name node
> caused the rate of replication to increase massively.  The difference was
> highly visible on the ganglia graph because the amount of I/O wait time on
> the cluster increased to >15% from near zero.

I think we should file a jira on this. There is no reason for Namenode 
replicate faster after restart than before, unless it is remembering 
something that it should not. The issue Chris Kline reported earlier in 
Jan was unresolved as well.. even after looking through multiple logs.

Ted, could you run 'dfsadmin -metasave' and attach relevant info to a 
new jira? Any or all the logs help.

Raghu.

Re: Namenode fails to replicate file

Posted by Ted Dunning <td...@veoh.com>.

That makes it wait, but I don't think it increases the urgency on the part
of the namenode.

As an interesting experiment, I had a cluster with lots of pending
replication to do that was happening slowly.  Restarting the name node
caused the rate of replication to increase massively.  The difference was
highly visible on the ganglia graph because the amount of I/O wait time on
the cluster increased to >15% from near zero.


On 2/7/08 11:39 PM, "dhruba Borthakur" <dh...@yahoo-inc.com> wrote:

> You have to use the -w parameter to the setrep command to make it wait
> till the replication is complete. The following command
> 
> bin/hadoop dfs -setrep 10 -w filename
> 
> will block till all blocks of the file achieves a replication factor of
> 10.
> 
> Thanks,
> dhruba
> 
> -----Original Message-----
> From: Tim Wintle [mailto:tim.wintle@teamrubber.com]
> Sent: Thursday, February 07, 2008 11:05 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Namenode fails to replicate file
> 
> Doesn't the -setrep command force the replication to be increased
> immediately?
> 
> ./hadoop dfs -setrep [replication] path
> 
> (I may have misunderstood)
> 
> 
> On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote:
>> 
>> Chris Kline reported a problem in early January where a file which had
> too
>> few replicated blocks did not get replicated until a DFS restart.
>> 
>> I just saw a similar issue.  I had a file that had a block with 1
> replica (2
>> required) that did not get replicated.  I changed the number of
> required
>> replicates, but nothing caused any action.  Changing the number of
> required
>> replicas on other files got them to be replicated.
>> 
>> I eventually copied the file to temp, deleted the original and moved
> the
>> copy back to the original place.  I was also able to read the entire
> file
>> which shows that the problem was not due to slow reporting from a down
>> datanode.
>> 
>> This happened just after I had a node failure which was why I was
> messing
>> with replication at all.  Since I was in the process of increasing the
>> replication on nearly 10,000 large files, my log files are full of
> other
>> stuff, but I am pretty sure that there is a bug here.
>> 
>> This was on a relatively small cluster with 13 data nodes.
>> 
>> It also brings up a related issue that has come up before in that
> there are
>> times when you may want to increase the number of replicas of a file
> right
>> NOW.  I don't know of any way to force this replication.  Is there
> such a
>> way?
>> 
>> 
>> 
>

RE: Namenode fails to replicate file

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

You have to use the -w parameter to the setrep command to make it wait
till the replication is complete. The following command

bin/hadoop dfs -setrep 10 -w filename

will block till all blocks of the file achieves a replication factor of
10.

Thanks,
dhruba

-----Original Message-----
From: Tim Wintle [mailto:tim.wintle@teamrubber.com] 
Sent: Thursday, February 07, 2008 11:05 PM
To: core-user@hadoop.apache.org
Subject: Re: Namenode fails to replicate file

Doesn't the -setrep command force the replication to be increased
immediately?

./hadoop dfs -setrep [replication] path

(I may have misunderstood)


On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote:
> 
> Chris Kline reported a problem in early January where a file which had
too
> few replicated blocks did not get replicated until a DFS restart.
> 
> I just saw a similar issue.  I had a file that had a block with 1
replica (2
> required) that did not get replicated.  I changed the number of
required
> replicates, but nothing caused any action.  Changing the number of
required
> replicas on other files got them to be replicated.
> 
> I eventually copied the file to temp, deleted the original and moved
the
> copy back to the original place.  I was also able to read the entire
file
> which shows that the problem was not due to slow reporting from a down
> datanode.
> 
> This happened just after I had a node failure which was why I was
messing
> with replication at all.  Since I was in the process of increasing the
> replication on nearly 10,000 large files, my log files are full of
other
> stuff, but I am pretty sure that there is a bug here.
> 
> This was on a relatively small cluster with 13 data nodes.
> 
> It also brings up a related issue that has come up before in that
there are
> times when you may want to increase the number of replicas of a file
right
> NOW.  I don't know of any way to force this replication.  Is there
such a
> way?
> 
> 
>

Re: Namenode fails to replicate file

Posted by Ted Dunning <td...@veoh.com>.

It doesn't happen immediately.  It happens SLOWLY.


On 2/7/08 11:05 PM, "Tim Wintle" <ti...@teamrubber.com> wrote:

> Doesn't the -setrep command force the replication to be increased
> immediately?
> 
> ./hadoop dfs -setrep [replication] path
> 
> (I may have misunderstood)
> 
> 
> On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote:
>> 
>> Chris Kline reported a problem in early January where a file which had too
>> few replicated blocks did not get replicated until a DFS restart.
>> 
>> I just saw a similar issue.  I had a file that had a block with 1 replica (2
>> required) that did not get replicated.  I changed the number of required
>> replicates, but nothing caused any action.  Changing the number of required
>> replicas on other files got them to be replicated.
>> 
>> I eventually copied the file to temp, deleted the original and moved the
>> copy back to the original place.  I was also able to read the entire file
>> which shows that the problem was not due to slow reporting from a down
>> datanode.
>> 
>> This happened just after I had a node failure which was why I was messing
>> with replication at all.  Since I was in the process of increasing the
>> replication on nearly 10,000 large files, my log files are full of other
>> stuff, but I am pretty sure that there is a bug here.
>> 
>> This was on a relatively small cluster with 13 data nodes.
>> 
>> It also brings up a related issue that has come up before in that there are
>> times when you may want to increase the number of replicas of a file right
>> NOW.  I don't know of any way to force this replication.  Is there such a
>> way?
>> 
>> 
>> 
>

Re: Namenode fails to replicate file

Posted by Tim Wintle <ti...@teamrubber.com>.

Doesn't the -setrep command force the replication to be increased
immediately?

./hadoop dfs -setrep [replication] path

(I may have misunderstood)


On Thu, 2008-02-07 at 17:05 -0800, Ted Dunning wrote:
> 
> Chris Kline reported a problem in early January where a file which had too
> few replicated blocks did not get replicated until a DFS restart.
> 
> I just saw a similar issue.  I had a file that had a block with 1 replica (2
> required) that did not get replicated.  I changed the number of required
> replicates, but nothing caused any action.  Changing the number of required
> replicas on other files got them to be replicated.
> 
> I eventually copied the file to temp, deleted the original and moved the
> copy back to the original place.  I was also able to read the entire file
> which shows that the problem was not due to slow reporting from a down
> datanode.
> 
> This happened just after I had a node failure which was why I was messing
> with replication at all.  Since I was in the process of increasing the
> replication on nearly 10,000 large files, my log files are full of other
> stuff, but I am pretty sure that there is a bug here.
> 
> This was on a relatively small cluster with 13 data nodes.
> 
> It also brings up a related issue that has come up before in that there are
> times when you may want to increase the number of replicas of a file right
> NOW.  I don't know of any way to force this replication.  Is there such a
> way?
> 
> 
>