You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stas Oskin <st...@gmail.com> on 2009/04/10 18:11:45 UTC

Two degrees of replications reliability

Hi.

I know that there were some hard to find bugs with replication set to 2,
which caused data loss to HDFS users.

Was there any progress with these issues, and if there any fixes which were
introduced?

Regards.

Re: Two degrees of replications reliability

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Apr 10, 2009, at 1:54 PM, Stas Oskin wrote:

> Actually, now I remember that you posted some time ago about your  
> University
> loosing about 300 files.
> So since then the situation has improved I presume?

Yup!  The only files we lose now are due to multiple simultaneous  
hardware loss.  Since January: 11 files to accidentally reformatting 2  
nodes at once, 35 to a night with 2 dead nodes.  Make no mistake -  
HDFS with 2 replicas is *not* an archive-quality file system.  HDFS  
does not replace tape storage for long term storage.

Brian

>
>
> 2009/4/10 Stas Oskin <st...@gmail.com>
>
>> 2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
>>
>>> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is  
>>> going to
>>> be even better.
>>>
>>> We run about 300TB @ 2 replicas, and haven't had file loss that was
>>> Hadoop's fault since about January.
>>>
>>> Brian
>>>
>>>
>> And you running 0.19.1?
>>
>> Regards.
>>

Re: Two degrees of replications reliability

Posted by Stas Oskin <st...@gmail.com>.

Actually, now I remember that you posted some time ago about your University
loosing about 300 files.
So since then the situation has improved I presume?

2009/4/10 Stas Oskin <st...@gmail.com>

> 2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
>
>> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is going to
>> be even better.
>>
>> We run about 300TB @ 2 replicas, and haven't had file loss that was
>> Hadoop's fault since about January.
>>
>> Brian
>>
>>
> And you running 0.19.1?
>
> Regards.
>

Re: Two degrees of replications reliability

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Apr 10, 2009, at 2:06 PM, Todd Lipcon wrote:

> On Fri, Apr 10, 2009 at 12:03 PM, Brian Bockelman <bbockelm@cse.unl.edu 
> >wrote:
>
>>
>>
>> 0.19.1 with a few convenience patches (mostly, they improve logging  
>> so the
>> local file system researchers can play around with our data  
>> patterns).
>>
>
> Hey Brian,
>
> I'm curious about this. Could you elaborate a bit on what kind of  
> stuff
> you're logging? I'm interested in what FS metrics you're looking at  
> and how
> you instrumented the code.
>
> -Todd

No clue what they're doing *with* the data, but I know what we've  
applied to HDFS to get the data.  We apply both of these patches:
http://issues.apache.org/jira/browse/HADOOP-5222
https://issues.apache.org/jira/browse/HADOOP-5625

This adds the duration and offset to each read.  Each read is then  
logged through the HDFS audit mechanisms.  We've been pulling the logs  
through the web interface and putting them back into HDFS, then  
processing them (actually, today we've been playing with log  
collection via Chukwa).

There is a student who is looking at our cluster's I/O access  
patterns, and there's a few folks who do work in designing metadata  
caching algorithms that love to see application traces.  Personally,  
I'm interested in hooking the logfiles up to our I/O accounting system  
so I can keep historical records of transfers and compare it to our  
other file systems.

Brian

Re: Two degrees of replications reliability

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Apr 10, 2009 at 12:03 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

>
>
> 0.19.1 with a few convenience patches (mostly, they improve logging so the
> local file system researchers can play around with our data patterns).
>

Hey Brian,

I'm curious about this. Could you elaborate a bit on what kind of stuff
you're logging? I'm interested in what FS metrics you're looking at and how
you instrumented the code.

-Todd

Re: Two degrees of replications reliability

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Apr 10, 2009, at 1:53 PM, Stas Oskin wrote:

> 2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
>
>> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is  
>> going to be
>> even better.
>>
>> We run about 300TB @ 2 replicas, and haven't had file loss that was
>> Hadoop's fault since about January.
>>
>> Brian
>>
>>
> And you running 0.19.1?

0.19.1 with a few convenience patches (mostly, they improve logging so  
the local file system researchers can play around with our data  
patterns).

Brian

Re: Two degrees of replications reliability

Posted by Stas Oskin <st...@gmail.com>.

2009/4/10 Brian Bockelman <bb...@cse.unl.edu>

> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is going to be
> even better.
>
> We run about 300TB @ 2 replicas, and haven't had file loss that was
> Hadoop's fault since about January.
>
> Brian
>
>
And you running 0.19.1?

Regards.

Re: Two degrees of replications reliability

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is going  
to be even better.

We run about 300TB @ 2 replicas, and haven't had file loss that was  
Hadoop's fault since about January.

Brian

On Apr 10, 2009, at 11:11 AM, Stas Oskin wrote:

> Hi.
>
> I know that there were some hard to find bugs with replication set  
> to 2,
> which caused data loss to HDFS users.
>
> Was there any progress with these issues, and if there any fixes  
> which were
> introduced?
>
> Regards.