You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Jones, Nick" <ni...@amd.com> on 2010/10/27 15:40:39 UTC

Large amount of corruption after balancer

Hello everyone,

I just recently started running the balancer to fix job errors where a particular task runs out of local disk; however, I've noticed that I usually end up with a significant amount of corruption after it completes.  Has anyone else observed this behavior?

I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer completes successfully.

Thanks.

Nick Jones

Re: Large amount of corruption after balancer

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hi Nick,

Sorry, I can only come up with a longshot theory.  If the files are corrupted, then that would have happened when block A got copied from node X to Y.  The copying logic is independent of the balancer - the balancer just requests the copy gets made.  After the transfer, the destination block gets checksummed and the checksum is reported to the NN.  If the file is over-replicated, then one other copy gets deleted.

If there's a bug in the logic in the deletion logic and the new copy is corrupt, you could end up deleting the wrong copy.  A bug like this was fixed in 0.18/0.19.  If you additionally only have 2 replicas, you would end up with a corrupt block.

You should be able to see this sequence in your NN logs.  Look to see when the NN realized a given block was first corrupted.  Pick your favorite corrupt block name, and grep out its history from the logs.

However, let's say your cluster is corrupting data at a network level at a large scale.  Then, why would you see it only with the balancer running?

It's hard to see this as a plausible scenario, but, on the other hand, something happened.  It's possible it's just an outright coincidence.

Brian

On Oct 27, 2010, at 11:31 AM, Jones, Nick wrote:

> Hi Brian,
> 
> I'm seeing thousands of corrupt blocks reported (not under replication errors) from fsck.  We haven't been seeing corruption at all until I started running the balancer.
> 
> I did try Michael's comment about bouncing the cloud.  I originally saw ~25% under replication, but I still have about the same number of blocks showing up as corrupted after the replication leveled out.
> 
> Nick Jones
> 
> -----Original Message-----
> From: Brian Bockelman [mailto:bbockelm@cse.unl.edu] 
> Sent: Wednesday, October 27, 2010 9:48 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Large amount of corruption after balancer
> 
> Hi Nick,
> 
> What do you mean by "corruption" and how do you determine this?  The way the balancer is implemented, I would be surprised if it could cause corruption without you also seeing corruption day-to-day.
> 
> Brian
> 
> On Oct 27, 2010, at 9:01 AM, Jones, Nick wrote:
> 
>> Hi Patrick,
>> 
>> I first started by running fsck / which reported healthy.  I also know from jobtracker that nothing was running during the balancing time.  I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.
>> 
>> Thanks again,
>> 
>> Nick Jones
>> 
>> -----Original Message-----
>> From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
>> Sent: Wednesday, October 27, 2010 8:54 AM
>> To: common-user@hadoop.apache.org
>> Subject: Re: Large amount of corruption after balancer
>> 
>> Nick,
>> 
>> The corruption may have been caused by running out of disk space. At that
>> point, even after rebalancing, you will still have corruption. Under normal
>> circumstances, balancing by itself should not result in corruption.
>> 
>> Regards,
>> 
>> - Patrick
>> 
>> On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick <ni...@amd.com> wrote:
>> 
>>> Hello everyone,
>>> 
>>> I just recently started running the balancer to fix job errors where a
>>> particular task runs out of local disk; however, I've noticed that I usually
>>> end up with a significant amount of corruption after it completes.  Has
>>> anyone else observed this behavior?
>>> 
>>> I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
>>> completes successfully.
>>> 
>>> Thanks.
>>> 
>>> Nick Jones
>>> 
>

RE: Large amount of corruption after balancer

Posted by "Jones, Nick" <ni...@amd.com>.

Hi Brian,

I'm seeing thousands of corrupt blocks reported (not under replication errors) from fsck.  We haven't been seeing corruption at all until I started running the balancer.

I did try Michael's comment about bouncing the cloud.  I originally saw ~25% under replication, but I still have about the same number of blocks showing up as corrupted after the replication leveled out.

Nick Jones

-----Original Message-----
From: Brian Bockelman [mailto:bbockelm@cse.unl.edu] 
Sent: Wednesday, October 27, 2010 9:48 AM
To: common-user@hadoop.apache.org
Subject: Re: Large amount of corruption after balancer

Hi Nick,

What do you mean by "corruption" and how do you determine this?  The way the balancer is implemented, I would be surprised if it could cause corruption without you also seeing corruption day-to-day.

Brian

On Oct 27, 2010, at 9:01 AM, Jones, Nick wrote:

> Hi Patrick,
> 
> I first started by running fsck / which reported healthy.  I also know from jobtracker that nothing was running during the balancing time.  I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.
> 
> Thanks again,
> 
> Nick Jones
> 
> -----Original Message-----
> From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
> Sent: Wednesday, October 27, 2010 8:54 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Large amount of corruption after balancer
> 
> Nick,
> 
> The corruption may have been caused by running out of disk space. At that
> point, even after rebalancing, you will still have corruption. Under normal
> circumstances, balancing by itself should not result in corruption.
> 
> Regards,
> 
> - Patrick
> 
> On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick <ni...@amd.com> wrote:
> 
>> Hello everyone,
>> 
>> I just recently started running the balancer to fix job errors where a
>> particular task runs out of local disk; however, I've noticed that I usually
>> end up with a significant amount of corruption after it completes.  Has
>> anyone else observed this behavior?
>> 
>> I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
>> completes successfully.
>> 
>> Thanks.
>> 
>> Nick Jones
>>

Re: Large amount of corruption after balancer

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hi Nick,

What do you mean by "corruption" and how do you determine this?  The way the balancer is implemented, I would be surprised if it could cause corruption without you also seeing corruption day-to-day.

Brian

On Oct 27, 2010, at 9:01 AM, Jones, Nick wrote:

> Hi Patrick,
> 
> I first started by running fsck / which reported healthy.  I also know from jobtracker that nothing was running during the balancing time.  I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.
> 
> Thanks again,
> 
> Nick Jones
> 
> -----Original Message-----
> From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
> Sent: Wednesday, October 27, 2010 8:54 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Large amount of corruption after balancer
> 
> Nick,
> 
> The corruption may have been caused by running out of disk space. At that
> point, even after rebalancing, you will still have corruption. Under normal
> circumstances, balancing by itself should not result in corruption.
> 
> Regards,
> 
> - Patrick
> 
> On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick <ni...@amd.com> wrote:
> 
>> Hello everyone,
>> 
>> I just recently started running the balancer to fix job errors where a
>> particular task runs out of local disk; however, I've noticed that I usually
>> end up with a significant amount of corruption after it completes.  Has
>> anyone else observed this behavior?
>> 
>> I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
>> completes successfully.
>> 
>> Thanks.
>> 
>> Nick Jones
>>

RE: Large amount of corruption after balancer

Posted by Michael Segel <mi...@hotmail.com>.

Uhm...

I see that you're still running CDH2.
You may want to go to CDH3b3.

We tended to see corruption too, albeit in our HBase files. 
What happens if you bounce your cloud, wait 5-10 mins for things to sort themselves out and then try running an FSCK?


> From: nick.jones@amd.com
> To: common-user@hadoop.apache.org
> Date: Wed, 27 Oct 2010 09:01:15 -0500
> Subject: RE: Large amount of corruption after balancer
> 
> Hi Patrick,
> 
> I first started by running fsck / which reported healthy.  I also know from jobtracker that nothing was running during the balancing time.  I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.
> 
> Thanks again,
> 
> Nick Jones
> 
> -----Original Message-----
> From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
> Sent: Wednesday, October 27, 2010 8:54 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Large amount of corruption after balancer
> 
> Nick,
> 
> The corruption may have been caused by running out of disk space. At that
> point, even after rebalancing, you will still have corruption. Under normal
> circumstances, balancing by itself should not result in corruption.
> 
> Regards,
> 
> - Patrick
> 
> On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick <ni...@amd.com> wrote:
> 
> > Hello everyone,
> >
> > I just recently started running the balancer to fix job errors where a
> > particular task runs out of local disk; however, I've noticed that I usually
> > end up with a significant amount of corruption after it completes.  Has
> > anyone else observed this behavior?
> >
> > I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
> > completes successfully.
> >
> > Thanks.
> >
> > Nick Jones
> >
>

RE: Large amount of corruption after balancer

Posted by "Jones, Nick" <ni...@amd.com>.

Hi Patrick,

I first started by running fsck / which reported healthy.  I also know from jobtracker that nothing was running during the balancing time.  I filed a ticket with Cloudera, but I still appreciate any insight you or others may have.

Thanks again,

Nick Jones

-----Original Message-----
From: patrickangeles@gmail.com [mailto:patrickangeles@gmail.com] On Behalf Of Patrick Angeles
Sent: Wednesday, October 27, 2010 8:54 AM
To: common-user@hadoop.apache.org
Subject: Re: Large amount of corruption after balancer

Nick,

The corruption may have been caused by running out of disk space. At that
point, even after rebalancing, you will still have corruption. Under normal
circumstances, balancing by itself should not result in corruption.

Regards,

- Patrick

On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick <ni...@amd.com> wrote:

> Hello everyone,
>
> I just recently started running the balancer to fix job errors where a
> particular task runs out of local disk; however, I've noticed that I usually
> end up with a significant amount of corruption after it completes.  Has
> anyone else observed this behavior?
>
> I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
> completes successfully.
>
> Thanks.
>
> Nick Jones
>

Re: Large amount of corruption after balancer

Posted by Patrick Angeles <pa...@cloudera.com>.

Nick,

The corruption may have been caused by running out of disk space. At that
point, even after rebalancing, you will still have corruption. Under normal
circumstances, balancing by itself should not result in corruption.

Regards,

- Patrick

On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick <ni...@amd.com> wrote:

> Hello everyone,
>
> I just recently started running the balancer to fix job errors where a
> particular task runs out of local disk; however, I've noticed that I usually
> end up with a significant amount of corruption after it completes.  Has
> anyone else observed this behavior?
>
> I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer
> completes successfully.
>
> Thanks.
>
> Nick Jones
>

Re: Large amount of corruption after balancer

Posted by Allen Wittenauer <aw...@linkedin.com>.

On Oct 27, 2010, at 6:40 AM, Jones, Nick wrote:

> Hello everyone,
> 
> I just recently started running the balancer to fix job errors where a particular task runs out of local disk; however, I've noticed that I usually end up with a significant amount of corruption after it completes.  Has anyone else observed this behavior?

	With apache, no.

> I'm using Cloudera's CDH2 distribution (0.20.1+169.89) and the balancer completes successfully.

	Sounds like a bug in their distro.

> 
> Thanks.
> 
> Nick Jones