You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Stas Oskin <st...@gmail.com> on 2009/04/10 18:11:45 UTC
Two degrees of replications reliability
Hi.
I know that there were some hard to find bugs with replication set to 2,
which caused data loss to HDFS users.
Was there any progress with these issues, and if there any fixes which were
introduced?
Regards.
Re: Two degrees of replications reliability
Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Apr 10, 2009, at 1:54 PM, Stas Oskin wrote:
> Actually, now I remember that you posted some time ago about your
> University
> loosing about 300 files.
> So since then the situation has improved I presume?
Yup! The only files we lose now are due to multiple simultaneous
hardware loss. Since January: 11 files to accidentally reformatting 2
nodes at once, 35 to a night with 2 dead nodes. Make no mistake -
HDFS with 2 replicas is *not* an archive-quality file system. HDFS
does not replace tape storage for long term storage.
Brian
>
>
> 2009/4/10 Stas Oskin <st...@gmail.com>
>
>> 2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
>>
>>> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is
>>> going to
>>> be even better.
>>>
>>> We run about 300TB @ 2 replicas, and haven't had file loss that was
>>> Hadoop's fault since about January.
>>>
>>> Brian
>>>
>>>
>> And you running 0.19.1?
>>
>> Regards.
>>
Re: Two degrees of replications reliability
Posted by Stas Oskin <st...@gmail.com>.
Actually, now I remember that you posted some time ago about your University
loosing about 300 files.
So since then the situation has improved I presume?
2009/4/10 Stas Oskin <st...@gmail.com>
> 2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
>
>> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is going to
>> be even better.
>>
>> We run about 300TB @ 2 replicas, and haven't had file loss that was
>> Hadoop's fault since about January.
>>
>> Brian
>>
>>
> And you running 0.19.1?
>
> Regards.
>
Re: Two degrees of replications reliability
Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Apr 10, 2009, at 2:06 PM, Todd Lipcon wrote:
> On Fri, Apr 10, 2009 at 12:03 PM, Brian Bockelman <bbockelm@cse.unl.edu
> >wrote:
>
>>
>>
>> 0.19.1 with a few convenience patches (mostly, they improve logging
>> so the
>> local file system researchers can play around with our data
>> patterns).
>>
>
> Hey Brian,
>
> I'm curious about this. Could you elaborate a bit on what kind of
> stuff
> you're logging? I'm interested in what FS metrics you're looking at
> and how
> you instrumented the code.
>
> -Todd
No clue what they're doing *with* the data, but I know what we've
applied to HDFS to get the data. We apply both of these patches:
http://issues.apache.org/jira/browse/HADOOP-5222
https://issues.apache.org/jira/browse/HADOOP-5625
This adds the duration and offset to each read. Each read is then
logged through the HDFS audit mechanisms. We've been pulling the logs
through the web interface and putting them back into HDFS, then
processing them (actually, today we've been playing with log
collection via Chukwa).
There is a student who is looking at our cluster's I/O access
patterns, and there's a few folks who do work in designing metadata
caching algorithms that love to see application traces. Personally,
I'm interested in hooking the logfiles up to our I/O accounting system
so I can keep historical records of transfers and compare it to our
other file systems.
Brian
Re: Two degrees of replications reliability
Posted by Todd Lipcon <to...@cloudera.com>.
On Fri, Apr 10, 2009 at 12:03 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:
>
>
> 0.19.1 with a few convenience patches (mostly, they improve logging so the
> local file system researchers can play around with our data patterns).
>
Hey Brian,
I'm curious about this. Could you elaborate a bit on what kind of stuff
you're logging? I'm interested in what FS metrics you're looking at and how
you instrumented the code.
-Todd
Re: Two degrees of replications reliability
Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Apr 10, 2009, at 1:53 PM, Stas Oskin wrote:
> 2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
>
>> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is
>> going to be
>> even better.
>>
>> We run about 300TB @ 2 replicas, and haven't had file loss that was
>> Hadoop's fault since about January.
>>
>> Brian
>>
>>
> And you running 0.19.1?
0.19.1 with a few convenience patches (mostly, they improve logging so
the local file system researchers can play around with our data
patterns).
Brian
Re: Two degrees of replications reliability
Posted by Stas Oskin <st...@gmail.com>.
2009/4/10 Brian Bockelman <bb...@cse.unl.edu>
> Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is going to be
> even better.
>
> We run about 300TB @ 2 replicas, and haven't had file loss that was
> Hadoop's fault since about January.
>
> Brian
>
>
And you running 0.19.1?
Regards.
Re: Two degrees of replications reliability
Posted by Brian Bockelman <bb...@cse.unl.edu>.
Most of the issues were resolved in 0.19.1 -- I think 0.20.0 is going
to be even better.
We run about 300TB @ 2 replicas, and haven't had file loss that was
Hadoop's fault since about January.
Brian
On Apr 10, 2009, at 11:11 AM, Stas Oskin wrote:
> Hi.
>
> I know that there were some hard to find bugs with replication set
> to 2,
> which caused data loss to HDFS users.
>
> Was there any progress with these issues, and if there any fixes
> which were
> introduced?
>
> Regards.