You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Eli Collins <el...@cloudera.com> on 2012/03/21 01:37:31 UTC

[DISCUSS] Remove append?

Hey gang,

I'd like to get people's thoughts on the following proposal. I think
we should consider removing append from HDFS.

Where we are today.. append was added in the 0.17-19 releases
(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
issues. It and sync were re-designed, re-implemented, and shipped in
21.0 (HDFS-265). To my knowledge, there has been no real production
use. Anecdotally people who worked on branch-20-append have told me
they think the new trunk code is substantially less well-tested than
the branch-20-append code (at least for sync, append was never well
tested). It has certainly gotten way less pounding from HBase users.
The design however, is much improved, and people think we can get
hsync (and append) stabilized in trunk (mostly testing and bug
fixing).

Rationale follows..

Append does not seem to be an important requirement, hflush was. There
has not been much demand for append, from users or downstream
projects. Because Hadoop 1.x does not have a working append
implementation (see HDFS-3120, the branch-20-append work was focused
on sync not getting append working) which is not enabled by default
and downstream projects will want to support Hadoop 1.x releases for
years, most will not introduce dependencies on append anyway. This is
not to say demand does not exist, just that if it does, it's been much
smaller than security, sync, HA, backwards compatbile RPC, etc. This
probably explains why, over 5 years after the original implementation
started, we don't have a stable release with append.

Append introduces non-trivial design and code complexity, which is not
worth the cost if we don't have real users. Removing append means we
have the property that HDFS blocks, when finalized, are immutable.
This significantly simplifies the design and code, which significantly
simplifies the implementation of other features like snapshots,
HDFS-level caching, dedupe, etc.

The vast majority of the HDFS-265 effort is still leveraged w/o
append. The new data durability and read consistency behavior was the
key part.

GFS, which HDFS' design is based on, has append (and atomic record
append) so obviously a workable design does not preclude append.
However we also should not ape the GFS feature set simply because it
exists. I've had conversations with people who worked on GFS that
regret adding record append (see also
http://queue.acm.org/detail.cfm?id=1594206). In short, unless append
is a real priority for our users I think we should focus our energy
elsewhere.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 1:57 PM, Sanjay Radia <sa...@hortonworks.com> wrote:
> On Tue, Mar 20, 2012 at 5:37 PM, Eli Collins <el...@cloudera.com> wrote:
>
>>
>>
>> Append introduces non-trivial design and code complexity, which is not
>> worth the cost if we don't have real users.
>
> The bulk of the complexity of HDFS-265 ("the new Append") was around
> Hflush, concurrent readers, the pipeline etc. The code and complexity  for
> appending to previously closed file was not that large.
>

And we'd still leverage that work.  Which is not to say that append
isn't complicated. There were a fair number of append bugs that were
found in branch-20-append that we think are present in the new append
implementation (not sure if there are jiras for all of them).

Also, append + truncate removes the current invariant that we maintain
eg around visible length. So append opens the doors for lots of
additional complexity. We could decide to keep append but not add
truncate but I suspect that will be hard because once you open up the
doors to a lot of new use cases it's hard to close them.  The larger
issue is how simple we'd like to keep HDFS, how many use cases we'd
like to grow it to.

>
>
>> Removing append means we
>> have the property that HDFS blocks, when finalized, are immutable.
>> This significantly simplifies the design and code, which significantly
>> simplifies the implementation of other features like snapshots,
>> HDFS-level caching, dedupe, etc.
>>
>
> While Snapshots  are challenging with Append, it is solvable - the snapshot
> needs to remember the length of the file. (We have a working prototype - we
> will posting the design and the code soon).
>

Will check it out. When I read "Snapshots in Hadoop Distributed File
System" it looked like the bulk of the complexity was due to the
protocol for append:
http://www.cs.berkeley.edu/~sameerag/hdfs-snapshots.pdf

>
> I agree that the notion of an immutable file is useful since it lets the
> system and tools optimize certain things.  A xerox-parc file system in the
> 80s had this feature that the system exploited. I would support adding the
> notion of an immutable file to Hadoop.
>

Good point, we could leverage this property on a per-file, rather than
per-filesystem basis.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Sanjay Radia <sa...@hortonworks.com>.

On Tue, Mar 20, 2012 at 5:37 PM, Eli Collins <el...@cloudera.com> wrote:

>
>
> Append introduces non-trivial design and code complexity, which is not
> worth the cost if we don't have real users.

The bulk of the complexity of HDFS-265 ("the new Append") was around
Hflush, concurrent readers, the pipeline etc. The code and complexity  for
appending to previously closed file was not that large.

> Removing append means we
> have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly
> simplifies the implementation of other features like snapshots,
> HDFS-level caching, dedupe, etc.
>

While Snapshots  are challenging with Append, it is solvable - the snapshot
needs to remember the length of the file. (We have a working prototype - we
will posting the design and the code soon).

I agree that the notion of an immutable file is useful since it lets the
system and tools optimize certain things.  A xerox-parc file system in the
80s had this feature that the system exploited. I would support adding the
notion of an immutable file to Hadoop.

sanjay

Re: [DISCUSS] Remove append?

Posted by Daryn Sharp <da...@yahoo-inc.com>.

I think Yarn/MR might be able to benefit from the ability to append to logs in hdfs.  It might reduce some of the after-the-fact copying of logs into hdfs.

Daryn


On Mar 22, 2012, at 8:18 PM, Dhruba Borthakur wrote:

> I think "append" would be useful. But not precisely sure which applications
> would use it. I would vote to keep the code though and not remove it.
> 
> -dhruba
> 
> On Thu, Mar 22, 2012 at 5:49 PM, Eli Collins <el...@cloudera.com> wrote:
> 
>> On Thu, Mar 22, 2012 at 5:03 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>>> @Eli, Removing a feature would simplify the design and code.  I think
>> this is a generally true statement but not specific to Append.  The
>> question is whether Append is useless and it should be removed?  I think it
>> is clear from this email thread that the answer is no.
>> 
>> @Nicholas, no one is saying append is "useless and should be removed."
>> The discussion is perhaps a little more subtle than you've understood
>> it to be.
>> 
>> If there are a lot of good use cases I'm all for it, I just don't see
>> downstream projects using it any time soon (which is not to say they
>> don't want it, just that they can't depend on something not in 1.x),
>> and I haven't seen much demand.  I wanted to hear from others if they
>> had.  When I brought it up with a room of hdfs developers from 3
>> different companies no one felt strongly. And so far only a handful of
>> people have chimed in, I actually thought more would.
>> 
>> Thanks,
>> Eli
>> 
> 
> 
> 
> -- 
> Subscribe to my posts at http://www.facebook.com/dhruba

Re: [DISCUSS] Remove append?

Posted by CHANG Lei <ch...@gmail.com>.

Append is already useful for our current project. It makes it possible
for us not to implement extra tricky logic to compact a large number
of small files regularly.

Thanks
Lei

On Thu, Mar 22, 2012 at 6:18 PM, Dhruba Borthakur <dh...@gmail.com> wrote:
> I think "append" would be useful. But not precisely sure which applications
> would use it. I would vote to keep the code though and not remove it.
>
> -dhruba
>
> On Thu, Mar 22, 2012 at 5:49 PM, Eli Collins <el...@cloudera.com> wrote:
>
>> On Thu, Mar 22, 2012 at 5:03 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>> > @Eli, Removing a feature would simplify the design and code.  I think
>> this is a generally true statement but not specific to Append.  The
>> question is whether Append is useless and it should be removed?  I think it
>> is clear from this email thread that the answer is no.
>>
>> @Nicholas, no one is saying append is "useless and should be removed."
>>  The discussion is perhaps a little more subtle than you've understood
>> it to be.
>>
>> If there are a lot of good use cases I'm all for it, I just don't see
>> downstream projects using it any time soon (which is not to say they
>> don't want it, just that they can't depend on something not in 1.x),
>> and I haven't seen much demand.  I wanted to hear from others if they
>> had.  When I brought it up with a room of hdfs developers from 3
>> different companies no one felt strongly. And so far only a handful of
>> people have chimed in, I actually thought more would.
>>
>> Thanks,
>> Eli
>>
>
>
>
> --
> Subscribe to my posts at http://www.facebook.com/dhruba

Re: [DISCUSS] Remove append?

Posted by Dhruba Borthakur <dh...@gmail.com>.

I think "append" would be useful. But not precisely sure which applications
would use it. I would vote to keep the code though and not remove it.

-dhruba

On Thu, Mar 22, 2012 at 5:49 PM, Eli Collins <el...@cloudera.com> wrote:

> On Thu, Mar 22, 2012 at 5:03 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
> > @Eli, Removing a feature would simplify the design and code.  I think
> this is a generally true statement but not specific to Append.  The
> question is whether Append is useless and it should be removed?  I think it
> is clear from this email thread that the answer is no.
>
> @Nicholas, no one is saying append is "useless and should be removed."
>  The discussion is perhaps a little more subtle than you've understood
> it to be.
>
> If there are a lot of good use cases I'm all for it, I just don't see
> downstream projects using it any time soon (which is not to say they
> don't want it, just that they can't depend on something not in 1.x),
> and I haven't seen much demand.  I wanted to hear from others if they
> had.  When I brought it up with a room of hdfs developers from 3
> different companies no one felt strongly. And so far only a handful of
> people have chimed in, I actually thought more would.
>
> Thanks,
> Eli
>



-- 
Subscribe to my posts at http://www.facebook.com/dhruba

Re: [DISCUSS] Remove append?

Posted by Tsz Wo Sze <sz...@yahoo.com>.

Hi Colin,

Please feel free to file JIRAs if you see unit test failures.

Let's continue the immutable file discussion on HDFS-3154.

Nicholas

________________________________
 From: Colin McCabe <cm...@alumni.cmu.edu>
To: hdfs-dev@hadoop.apache.org; Tsz Wo Sze <sz...@yahoo.com> 
Sent: Monday, March 26, 2012 2:31 PM
Subject: Re: [DISCUSS] Remove append?

On Mon, Mar 26, 2012 at 1:55 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>> Just one comment: If we do decide to keep append in, we should get it
>> to be actually stable and usable.  In my opinion, this should
>> definitely happen before adding any new operations.
>
> @Colin, append is currently stable and, of course, usable.  Many people in different organizations have tested it
> in small and large scale.  However, it is not yet in a stable release and so it is not yet heavy used.

The append unit test failed on me recently on Jenkins.  It's possible
that this was due to a Jenkins timeout, or something, but I assumed it
was due to instability at the time.  If it happens again, I'll be sure
to check the backtrace and file a JIRA if needed.

>> I agree that the notion of an immutable file is useful since it lets the
>> system and tools optimize certain things.  A xerox-parc file system in the
>> 80s had this feature that the system exploited. I would support adding the
>> notion of an immutable file to Hadoop.

I think Eli was hoping that making files immutable would make the
system simpler, and hopefully, less buggy.  You won't get that benefit
if only certain files are immutable.  In fact, quite the contrary--
you'll just be adding more complexity.

I'd also like to see what the "certain things" are that having certain
files, but not others, be immutable would allow you to optimize.  The
thread you linked to from the JIRA has no information on this.

I am aware of at least two "filesystems" (in the loose sense of the
word) that have immutable files.  One is Venti from Plan9, and the
other is git, by Linus Torvalds.  Both of them are significantly
simpler because of their invariant that files cannot change.  However,
both of them are append-only, meaning that files can never be deleted.
This seems unsuitable for the HDFS use case, and in fact, I see no
reason to believe that having some, but not all, files be immutable
would provide any benefit.

Feel free to prove me wrong if you think of something, though!

cheers,
Colin

>
> @Sanjay, I filed HDFS-3154.
>
> @Eli and others, it turns out that the discussion is very useful!  Thanks.
>
> Nicholas

Re: [DISCUSS] Remove append?

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Mon, Mar 26, 2012 at 1:55 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>> Just one comment: If we do decide to keep append in, we should get it
>> to be actually stable and usable.  In my opinion, this should
>> definitely happen before adding any new operations.
>
> @Colin, append is currently stable and, of course, usable.  Many people in different organizations have tested it
> in small and large scale.  However, it is not yet in a stable release and so it is not yet heavy used.

The append unit test failed on me recently on Jenkins.  It's possible
that this was due to a Jenkins timeout, or something, but I assumed it
was due to instability at the time.  If it happens again, I'll be sure
to check the backtrace and file a JIRA if needed.

>> I agree that the notion of an immutable file is useful since it lets the
>> system and tools optimize certain things.  A xerox-parc file system in the
>> 80s had this feature that the system exploited. I would support adding the
>> notion of an immutable file to Hadoop.

I think Eli was hoping that making files immutable would make the
system simpler, and hopefully, less buggy.  You won't get that benefit
if only certain files are immutable.  In fact, quite the contrary--
you'll just be adding more complexity.

I'd also like to see what the "certain things" are that having certain
files, but not others, be immutable would allow you to optimize.  The
thread you linked to from the JIRA has no information on this.

I am aware of at least two "filesystems" (in the loose sense of the
word) that have immutable files.  One is Venti from Plan9, and the
other is git, by Linus Torvalds.  Both of them are significantly
simpler because of their invariant that files cannot change.  However,
both of them are append-only, meaning that files can never be deleted.
 This seems unsuitable for the HDFS use case, and in fact, I see no
reason to believe that having some, but not all, files be immutable
would provide any benefit.

Feel free to prove me wrong if you think of something, though!

cheers,
Colin

>
> @Sanjay, I filed HDFS-3154.
>
> @Eli and others, it turns out that the discussion is very useful!  Thanks.
>
> Nicholas

Re: [DISCUSS] Remove append?

Posted by Tsz Wo Sze <sz...@yahoo.com>.

> Just one comment: If we do decide to keep append in, we should get it
> to be actually stable and usable.  In my opinion, this should
> definitely happen before adding any new operations.

@Colin, append is currently stable and, of course, usable.  Many people in different organizations have tested it in small and large scale.  However, it is not yet in a stable release and so it is not yet heavy used.

> I agree that the notion of an immutable file is useful since it lets the
> system and tools optimize certain things.  A xerox-parc file system in the
> 80s had this feature that the system exploited. I would support adding the
> notion of an immutable file to Hadoop.

@Sanjay, I filed HDFS-3154.

@Eli and others, it turns out that the discussion is very useful!  Thanks.

Nicholas

Re: [DISCUSS] Remove append?

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Thu, Mar 22, 2012 at 5:49 PM, Eli Collins <el...@cloudera.com> wrote:
> On Thu, Mar 22, 2012 at 5:03 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>> @Eli, Removing a feature would simplify the design and code.  I think this is a generally true statement but not specific to Append.  The question is whether Append is useless and it should be removed?  I think it is clear from this email thread that the answer is no.
>
> @Nicholas, no one is saying append is "useless and should be removed."
>  The discussion is perhaps a little more subtle than you've understood
> it to be.
>
> If there are a lot of good use cases I'm all for it, I just don't see
> downstream projects using it any time soon (which is not to say they
> don't want it, just that they can't depend on something not in 1.x),
> and I haven't seen much demand.  I wanted to hear from others if they
> had.  When I brought it up with a room of hdfs developers from 3
> different companies no one felt strongly. And so far only a handful of
> people have chimed in, I actually thought more would.

Just one comment: If we do decide to keep append in, we should get it
to be actually stable and usable.  In my opinion, this should
definitely happen before adding any new operations.

Colin

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Mar 22, 2012 at 5:03 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
> @Eli, Removing a feature would simplify the design and code.  I think this is a generally true statement but not specific to Append.  The question is whether Append is useless and it should be removed?  I think it is clear from this email thread that the answer is no.

@Nicholas, no one is saying append is "useless and should be removed."
 The discussion is perhaps a little more subtle than you've understood
it to be.

If there are a lot of good use cases I'm all for it, I just don't see
downstream projects using it any time soon (which is not to say they
don't want it, just that they can't depend on something not in 1.x),
and I haven't seen much demand.  I wanted to hear from others if they
had.  When I brought it up with a room of hdfs developers from 3
different companies no one felt strongly. And so far only a handful of
people have chimed in, I actually thought more would.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Tsz Wo Sze <sz...@yahoo.com>.

@Eli, Removing a feature would simplify the design and code.  I think this is a generally true statement but not specific to Append.  The question is whether Append is useless and it should be removed?  I think it is clear from this email thread that the answer is no.

@Milind, I agree with you.  BTW, we are proposing truncate on closed file.  So it is nothing to do with visible length.

Regards,

Nichlas

----- Original Message -----
From: "Milind.Bhandarkar@emc.com" <Mi...@emc.com>
To: hdfs-dev@hadoop.apache.org; szetszwo@yahoo.com
Cc: 
Sent: Thursday, March 22, 2012 4:27 PM
Subject: Re: [DISCUSS] Remove append?

Eli,

I think by "current definition of visible length", you mean that once a
client opens a file and gets block list, it will always be able to read up
to the length at open.

However, correct me if I am wrong, but this definition is already
violated, if file is deleted after open.

So, truncate does add some complexity, but not a whole lot. If client gets
an EOF before length at open, it must retry to see if the new visible
length is different (rather than to see if the file does not exist
anymore).

Right ?

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)

On 3/22/12 4:03 PM, "Eli Collins" <el...@cloudera.com> wrote:

>On Thu, Mar 22, 2012 at 3:57 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>>> Do you think having the invariant that blocks are not mutated would
>>> significantly simply the design?
>>
>> No.  As mentioned in my previous email and others, the complexity is in
>>hflush.  Once we have hflush, append is straightforward.
>
>I understand that append is a small delta once you have hflush, what
>I'm saying is that the overall design of the file system is
>significantly simplified if you can assume blocks are not mutated. Eg
>see the way truncate is going to interact with the current definition
>of visible length (it violates it). Resolving issues like that are
>non-trivial.
>
>Thanks,
>Eli
>

Re: [DISCUSS] Remove append?

Posted by Scott Carey <sc...@richrelevance.com>.


On 3/22/12 5:41 PM, "Eli Collins" <el...@cloudera.com> wrote:

>On Thu, Mar 22, 2012 at 4:27 PM,  <Mi...@emc.com> wrote:
>> Eli,
>>
>> I think by "current definition of visible length", you mean that once a
>> client opens a file and gets block list, it will always be able to read
>>up
>> to the length at open.
>>
>
>I was thinking of the definition from the design doc. See my last
>comment on HDFS-2288, part of the confusion is that we're using the
>same name for two different things.
>
>> However, correct me if I am wrong, but this definition is already
>> violated, if file is deleted after open.
>
>I think you're right.

Another thing that could be fixed with COW blocks and MVCC principles.  If
a file was opened, then deleted the blocks on the opened file would still
be visible to that client, but no new ones.

>
>> So, truncate does add some complexity, but not a whole lot. If client
>>gets
>> an EOF before length at open, it must retry to see if the new visible
>> length is different (rather than to see if the file does not exist
>> anymore).
>>
>> Right ?
>>
>
>Makes sense. I was thinking you were talking about truncate on open
>files, which be harder.  You can already truncate a file on open, you
>just can't choose the offset you want to truncate at (the NN
>implements this by deleting the file).
>
>Thanks,
>Eli
>
>>
>> ---
>> Milind Bhandarkar
>> Greenplum Labs, EMC
>> (Disclaimer: Opinions expressed in this email are those of the author,
>>and
>> do not necessarily represent the views of any organization, past or
>> present, the author might be affiliated with.)
>>
>>
>>
>> On 3/22/12 4:03 PM, "Eli Collins" <el...@cloudera.com> wrote:
>>
>>>On Thu, Mar 22, 2012 at 3:57 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>>>>> Do you think having the invariant that blocks are not mutated would
>>>>> significantly simply the design?
>>>>
>>>> No.  As mentioned in my previous email and others, the complexity is
>>>>in
>>>>hflush.  Once we have hflush, append is straightforward.
>>>
>>>I understand that append is a small delta once you have hflush, what
>>>I'm saying is that the overall design of the file system is
>>>significantly simplified if you can assume blocks are not mutated. Eg
>>>see the way truncate is going to interact with the current definition
>>>of visible length (it violates it). Resolving issues like that are
>>>non-trivial.
>>>
>>>Thanks,
>>>Eli
>>>
>>

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Mar 22, 2012 at 4:27 PM,  <Mi...@emc.com> wrote:
> Eli,
>
> I think by "current definition of visible length", you mean that once a
> client opens a file and gets block list, it will always be able to read up
> to the length at open.
>

I was thinking of the definition from the design doc. See my last
comment on HDFS-2288, part of the confusion is that we're using the
same name for two different things.

> However, correct me if I am wrong, but this definition is already
> violated, if file is deleted after open.

I think you're right.

> So, truncate does add some complexity, but not a whole lot. If client gets
> an EOF before length at open, it must retry to see if the new visible
> length is different (rather than to see if the file does not exist
> anymore).
>
> Right ?
>

Makes sense. I was thinking you were talking about truncate on open
files, which be harder.  You can already truncate a file on open, you
just can't choose the offset you want to truncate at (the NN
implements this by deleting the file).

Thanks,
Eli

>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>
>
>
> On 3/22/12 4:03 PM, "Eli Collins" <el...@cloudera.com> wrote:
>
>>On Thu, Mar 22, 2012 at 3:57 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>>>> Do you think having the invariant that blocks are not mutated would
>>>> significantly simply the design?
>>>
>>> No.  As mentioned in my previous email and others, the complexity is in
>>>hflush.  Once we have hflush, append is straightforward.
>>
>>I understand that append is a small delta once you have hflush, what
>>I'm saying is that the overall design of the file system is
>>significantly simplified if you can assume blocks are not mutated. Eg
>>see the way truncate is going to interact with the current definition
>>of visible length (it violates it). Resolving issues like that are
>>non-trivial.
>>
>>Thanks,
>>Eli
>>
>

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

Eli,

I think by "current definition of visible length", you mean that once a
client opens a file and gets block list, it will always be able to read up
to the length at open.

However, correct me if I am wrong, but this definition is already
violated, if file is deleted after open.

So, truncate does add some complexity, but not a whole lot. If client gets
an EOF before length at open, it must retry to see if the new visible
length is different (rather than to see if the file does not exist
anymore).

Right ?

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)

On 3/22/12 4:03 PM, "Eli Collins" <el...@cloudera.com> wrote:

>On Thu, Mar 22, 2012 at 3:57 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>>> Do you think having the invariant that blocks are not mutated would
>>> significantly simply the design?
>>
>> No.  As mentioned in my previous email and others, the complexity is in
>>hflush.  Once we have hflush, append is straightforward.
>
>I understand that append is a small delta once you have hflush, what
>I'm saying is that the overall design of the file system is
>significantly simplified if you can assume blocks are not mutated. Eg
>see the way truncate is going to interact with the current definition
>of visible length (it violates it). Resolving issues like that are
>non-trivial.
>
>Thanks,
>Eli
>

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Mar 22, 2012 at 3:57 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>> Do you think having the invariant that blocks are not mutated would
>> significantly simply the design?
>
> No.  As mentioned in my previous email and others, the complexity is in hflush.  Once we have hflush, append is straightforward.

I understand that append is a small delta once you have hflush, what
I'm saying is that the overall design of the file system is
significantly simplified if you can assume blocks are not mutated. Eg
see the way truncate is going to interact with the current definition
of visible length (it violates it). Resolving issues like that are
non-trivial.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Tsz Wo Sze <sz...@yahoo.com>.

> Do you think having the invariant that blocks are not mutated would
> significantly simply the design?

No.  As mentioned in my previous email and others, the complexity is in hflush.  Once we have hflush, append is straightforward.

Nicholas

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 1:31 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>
> Some of the information in the email is not correct.  Let me clarify them.
>
>> Where we are today.. append was added in the 0.17-19
> releases
>> (HADOOP-1700) . . .
>
> We never have append/sync in 0.17.  Sync was added to 0.18 but not append.  Append was added to 0.19.  By append/sync above, I mean the
> implementation by HADOOP-1700.  We also
> have HDFS-265, the new append/hflush.  Below are the details.
>
> Versions         Features
> <= 0.17:          no sync/append
> 0.18:               1700
> sync
> 0.19.0:             1700
> append
> 0.19.1, 0.20:   1700 append disabled
> 0.20-append:append branch used by facebook
> 0.20.205.0:     merged 1700 append to 0.20
>>= 0.21:          265 append/hflush
>

Thanks for fleshing out the specifics, I put "17-19" to indicate that
parts went in over a series of releases.

>> . . . To my knowledge, there has
> been no real production use. . .
>
> The reason of no production use today
> is simply that append is not yet in a stable release.  Besides, it does not mean append is not
> useful.
>

Agree, not saying it isn't useful.   "usefulness" is necessary but not
sufficient. There are plenty of useful things we may not want to put
in HDFS.

>> . . . The design however, is much
> improved, and people think we can get
>> hsync (and append) stabilized in
> trunk (mostly testing and bug fixing).
>
> hsync is not yet implemented.  I think you may mean hflush.
>

Yup, good catch,  I meant hflush. (For those following along hsync is
implemented, just not according to the design since today it just
calls hflush).

>> . . . This probably explains why,
> over 5 years after the original implementation
>> started, we don't have a stable
> release with append.
>
> HADOOP-1700 was committed on July 25,
> 2008.  I don’t know how it could be “over
> 5 years”.  It is well known that append
> from 0.20.x releases is not stable and hence probably not used.  It is not the case that we don’t have a
> stable release because append is not stable.
>
>> Append introduces non-trivial
> design and code complexity, which is not
>> worth the cost if we don't have
> real users. . . .
>
> I don’t agree.  The non-trivial design and code complexity
> come from hflush but not append.  Once we
> have hflush, append is straightforward.  Roughly speaking, the append work is about 10% of the entire
> append/hflush work.

Do you think having the invariant that blocks are not mutated would
significantly simply the design?

Thanks,
Eli

>
> Moreover, there are real users/use
> cases as mentioned by Dave and Milind.
>
> The jira that you have created to split
> the flag into hflush supported and append supported is a good idea. Folks who
> do not need append, but need hflush, can still disable append.
>
> Regards,
> Nicholas
>
>
>
> ________________________________
>  From: Eli Collins <el...@cloudera.com>
> To: hdfs-dev@hadoop.apache.org
> Sent: Tuesday, March 20, 2012 5:37 PM
> Subject: [DISCUSS] Remove append?
>
> Hey gang,
>
> I'd like to get people's thoughts on the following proposal. I think
> we should consider removing append from HDFS.
>
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
> issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production
> use. Anecdotally people who worked on branch-20-append have told me
> they think the new trunk code is substantially less well-tested than
> the branch-20-append code (at least for sync, append was never well
> tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get
> hsync (and append) stabilized in trunk (mostly testing and bug
> fixing).
>
> Rationale follows..
>
> Append does not seem to be an important requirement, hflush was. There
> has not been much demand for append, from users or downstream
> projects. Because Hadoop 1.x does not have a working append
> implementation (see HDFS-3120, the branch-20-append work was focused
> on sync not getting append working) which is not enabled by default
> and downstream projects will want to support Hadoop 1.x releases for
> years, most will not introduce dependencies on append anyway. This is
> not to say demand does not exist, just that if it does, it's been much
> smaller than security, sync, HA, backwards compatbile RPC, etc. This
> probably explains why, over 5 years after the original implementation
> started, we don't have a stable release with append.
>
> Append introduces non-trivial design and code complexity, which is not
> worth the cost if we don't have real users. Removing append means we
> have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly
> simplifies the implementation of other features like snapshots,
> HDFS-level caching, dedupe, etc.
>
> The vast majority of the HDFS-265 effort is still leveraged w/o
> append. The new data durability and read consistency behavior was the
> key part.
>
> GFS, which HDFS' design is based on, has append (and atomic record
> append) so obviously a workable design does not preclude append.
> However we also should not ape the GFS feature set simply because it
> exists. I've had conversations with people who worked on GFS that
> regret adding record append (see also
> http://queue.acm.org/detail.cfm?id=1594206). In short, unless append
> is a real priority for our users I think we should focus our energy
> elsewhere.
>
> Thanks,
> Eli

Re: [DISCUSS] Remove append?

Posted by Tsz Wo Sze <sz...@yahoo.com>.

 
Some of the information in the email is not correct.  Let me clarify them.
 
> Where we are today.. append was added in the 0.17-19
releases
> (HADOOP-1700) . . .
 
We never have append/sync in 0.17.  Sync was added to 0.18 but not append.  Append was added to 0.19.  By append/sync above, I mean the
implementation by HADOOP-1700.  We also
have HDFS-265, the new append/hflush.  Below are the details.
 
Versions         Features
<= 0.17:          no sync/append
0.18:               1700
sync
0.19.0:             1700
append
0.19.1, 0.20:   1700 append disabled
0.20-append:append branch used by facebook
0.20.205.0:     merged 1700 append to 0.20
>= 0.21:          265 append/hflush
 
> . . . To my knowledge, there has
been no real production use. . .
 
The reason of no production use today
is simply that append is not yet in a stable release.  Besides, it does not mean append is not
useful.
 
> . . . The design however, is much
improved, and people think we can get
> hsync (and append) stabilized in
trunk (mostly testing and bug fixing).
 
hsync is not yet implemented.  I think you may mean hflush.
 
> . . . This probably explains why,
over 5 years after the original implementation
> started, we don't have a stable
release with append.
 
HADOOP-1700 was committed on July 25,
2008.  I don’t know how it could be “over
5 years”.  It is well known that append
from 0.20.x releases is not stable and hence probably not used.  It is not the case that we don’t have a
stable release because append is not stable.
 
> Append introduces non-trivial
design and code complexity, which is not
> worth the cost if we don't have
real users. . . .
 
I don’t agree.  The non-trivial design and code complexity
come from hflush but not append.  Once we
have hflush, append is straightforward.  Roughly speaking, the append work is about 10% of the entire
append/hflush work.
 
Moreover, there are real users/use
cases as mentioned by Dave and Milind.
 
The jira that you have created to split
the flag into hflush supported and append supported is a good idea. Folks who
do not need append, but need hflush, can still disable append.
 
Regards,
Nicholas



________________________________
 From: Eli Collins <el...@cloudera.com>
To: hdfs-dev@hadoop.apache.org 
Sent: Tuesday, March 20, 2012 5:37 PM
Subject: [DISCUSS] Remove append?
 
Hey gang,

I'd like to get people's thoughts on the following proposal. I think
we should consider removing append from HDFS.

Where we are today.. append was added in the 0.17-19 releases
(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
issues. It and sync were re-designed, re-implemented, and shipped in
21.0 (HDFS-265). To my knowledge, there has been no real production
use. Anecdotally people who worked on branch-20-append have told me
they think the new trunk code is substantially less well-tested than
the branch-20-append code (at least for sync, append was never well
tested). It has certainly gotten way less pounding from HBase users.
The design however, is much improved, and people think we can get
hsync (and append) stabilized in trunk (mostly testing and bug
fixing).

Rationale follows..

Append does not seem to be an important requirement, hflush was. There
has not been much demand for append, from users or downstream
projects. Because Hadoop 1.x does not have a working append
implementation (see HDFS-3120, the branch-20-append work was focused
on sync not getting append working) which is not enabled by default
and downstream projects will want to support Hadoop 1.x releases for
years, most will not introduce dependencies on append anyway. This is
not to say demand does not exist, just that if it does, it's been much
smaller than security, sync, HA, backwards compatbile RPC, etc. This
probably explains why, over 5 years after the original implementation
started, we don't have a stable release with append.

Append introduces non-trivial design and code complexity, which is not
worth the cost if we don't have real users. Removing append means we
have the property that HDFS blocks, when finalized, are immutable.
This significantly simplifies the design and code, which significantly
simplifies the implementation of other features like snapshots,
HDFS-level caching, dedupe, etc.

The vast majority of the HDFS-265 effort is still leveraged w/o
append. The new data durability and read consistency behavior was the
key part.

GFS, which HDFS' design is based on, has append (and atomic record
append) so obviously a workable design does not preclude append.
However we also should not ape the GFS feature set simply because it
exists. I've had conversations with people who worked on GFS that
regret adding record append (see also
http://queue.acm.org/detail.cfm?id=1594206). In short, unless append
is a real priority for our users I think we should focus our energy
elsewhere.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Scott Carey <sc...@richrelevance.com>.

On 3/26/12 12:53 PM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:

>On Fri, Mar 23, 2012 at 7:44 PM, Scott Carey <sc...@richrelevance.com>
>wrote:
>>
>>
>> On 3/22/12 10:25 AM, "Eli Collins" <el...@cloudera.com> wrote:
>>
>>>On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
>>><sh...@gmail.com> wrote:
>>>> Eli,
>>>>
>>>> I went over the entire discussion on the topic, and did not get it. Is
>>>> there a problem with append? We know it does not work in hadoop-1,
>>>> only flush() does. Is there anything wrong with the new append
>>>> (HDFS-265)? If so please file a bug.
>>>> I tested it in Hadoop-0.22 branch it works fine.
>>>>
>>>> I agree with people who were involved with the implementation of the
>>>> new append that the complexity is mainly in
>>>> 1. pipeline recovery
>>>> 2. consistent client reading while writing, and
>>>> 3. hflush()
>>>> Once it is done the append itself, which is reopening of previously
>>>> closed files for adding data, is not complex.
>>>>
>>>
>>>I agree that much of the complexity is in #1-3 above, which is why
>>>HDFS-265 is leveraged.
>>>The primary simplicity of not having append (and truncate) comes from
>>>not leveraging the invariant that finalized blocks are immutable, that
>>>blocks once written won't eg shrink in size (which we assume today).
>>
>> That invariant can co-exist with append via copy-on-write.  The new
>>state
>> and old state would co-exist until the old state was not needed, a
>>file's
>> block map would have to use a persistent data structure. Copy on write
>> semantics with blocks in file systems is all the rage these days.  Free
>> snapshots, atomic transactions for operations on multiple blocks, etc.
>
>Hi Scott,
>
>If a client accesses a file, and then the client becomes unresponsive,
>how long should you wait before declaring the blocks he was looking at
>unused?  
>No matter how long or how short a period you choose, someone
>will argue with it.

How long does the NN wait now?  What if a client is reading a file, then
becomes unresponsive, then another deletes the file today?  At some point
the NN has to unlock the file and allow for delete.
If you choose locking you have the question of when to expire a lock. With
MVCC you have the question of when to retire a reference.  It is the same,
exact problem.

>And having to track this kind of state in the
>NameNode introduces a huge amount of complexity, not to mention extra
>memory consumption.  Basically, we would have to track the ID of every
>block that any client looked at, at all times.

There are simple, almost trivial solutions.  java.lang.ref.WeakReference
makes it trivial to track when an object (block reference) is no longer
referenced by client objects so that it can be logged as dead.  Persistent
data structures make it truly trivial to reference only exactly what is
visible to open transactions.  I strongly feel that the result would be
many fewer lines of code and complexity.

Solutions for the sort of data structures required have been solved by
others in the last 35 years -- but mostly for functional languages -- but
there is still plenty of innovation -- the Immutable Bitmapped Vector Trie
is a powerful and fascinating example.  The following presentation is
excellent, and covers the sort of data structures solve the problems you
list above without the complexity that would be required if the NN block
map was an ephemeral data structure:
http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey

In addition to allowing for atomic transaction batches and lockless file
access, file system snapshots become trivial as well -- they are
equivalent to a permanently open transaction. The space needed for such a
snapshot is proportional to the delta between the snapshot and the current
state.

>
>Colin
>

Re: [DISCUSS] Remove append?

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Fri, Mar 23, 2012 at 7:44 PM, Scott Carey <sc...@richrelevance.com> wrote:
>
>
> On 3/22/12 10:25 AM, "Eli Collins" <el...@cloudera.com> wrote:
>
>>On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
>><sh...@gmail.com> wrote:
>>> Eli,
>>>
>>> I went over the entire discussion on the topic, and did not get it. Is
>>> there a problem with append? We know it does not work in hadoop-1,
>>> only flush() does. Is there anything wrong with the new append
>>> (HDFS-265)? If so please file a bug.
>>> I tested it in Hadoop-0.22 branch it works fine.
>>>
>>> I agree with people who were involved with the implementation of the
>>> new append that the complexity is mainly in
>>> 1. pipeline recovery
>>> 2. consistent client reading while writing, and
>>> 3. hflush()
>>> Once it is done the append itself, which is reopening of previously
>>> closed files for adding data, is not complex.
>>>
>>
>>I agree that much of the complexity is in #1-3 above, which is why
>>HDFS-265 is leveraged.
>>The primary simplicity of not having append (and truncate) comes from
>>not leveraging the invariant that finalized blocks are immutable, that
>>blocks once written won't eg shrink in size (which we assume today).
>
> That invariant can co-exist with append via copy-on-write.  The new state
> and old state would co-exist until the old state was not needed, a file's
> block map would have to use a persistent data structure. Copy on write
> semantics with blocks in file systems is all the rage these days.  Free
> snapshots, atomic transactions for operations on multiple blocks, etc.

Hi Scott,

If a client accesses a file, and then the client becomes unresponsive,
how long should you wait before declaring the blocks he was looking at
unused?  No matter how long or how short a period you choose, someone
will argue with it.  And having to track this kind of state in the
NameNode introduces a huge amount of complexity, not to mention extra
memory consumption.  Basically, we would have to track the ID of every
block that any client looked at, at all times.

Colin


>
>>
>>> You mentioned it and I agree you indeed should be more involved with
>>> your customer base. As for eBay, append was of the motivations to work
>>> on stabilizing 0.22 branch. And there is a lot of use cases which
>>> require append for our customers.
>>> Some of them were mentioned in this discussion.
>>>
>>
> >From what I've seen 0.22 isn't ready for production use. Aside from
>>not supporting critical features like security, it doesn't have a
>>size-able user-base behind it testing and fixing bugs, etc. All things
>>I'd imagine an org like eBay would want.  I've never gotten a request
>>to support 0.22 from a customer.
>>
>>Thanks,
>>Eli
>

Re: [DISCUSS] Remove append?

Posted by Konstantin Boudnik <co...@apache.org>.

On Thu, Mar 22, 2012 at 03:22PM, Eli Collins wrote:
> On Thu, Mar 22, 2012 at 3:11 PM, Konstantin Boudnik <co...@apache.org> wrote:
> > On Thu, Mar 22, 2012 at 10:25AM, Eli Collins wrote:
> >> On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
> >> <sh...@gmail.com> wrote:
> >> > Eli,
> >> >
> >> > I went over the entire discussion on the topic, and did not get it. Is
> >> > there a problem with append? We know it does not work in hadoop-1,
> >> > only flush() does. Is there anything wrong with the new append
> >> > (HDFS-265)? If so please file a bug.
> >> > I tested it in Hadoop-0.22 branch it works fine.
> >> >
> >> > I agree with people who were involved with the implementation of the
> >> > new append that the complexity is mainly in
> >> > 1. pipeline recovery
> >> > 2. consistent client reading while writing, and
> >> > 3. hflush()
> >> > Once it is done the append itself, which is reopening of previously
> >> > closed files for adding data, is not complex.
> >> >
> >> > You mentioned it and I agree you indeed should be more involved with
> >> > your customer base. As for eBay, append was of the motivations to work
> >> > on stabilizing 0.22 branch. And there is a lot of use cases which
> >> > require append for our customers.
> >> > Some of them were mentioned in this discussion.
> >> >
> >>
> >> From what I've seen 0.22 isn't ready for production use. Aside from
> >> not supporting critical features like security, it doesn't have a
> >> size-able user-base behind it testing and fixing bugs, etc. All things
> >> I'd imagine an org like eBay would want. ═I've never gotten a request
> >> to support 0.22 from a customer.
> >
> > This statement looks like FUD to me, because eBay (and a coupla other shops,
> > as has been stated elsewhere) are using 0.22 in the production and are
> > seemingly happy with that.
> >
> 
> That's my experience, take it for what it's worth.
> 
> Not having important features like security, having very few commits,
> etc is not FUD, you can check that via svn.

Agree. svn statistics are not. However, stating that 
">> From what I've seen 0.22 isn't ready for production use"
despite the evidence of contrary is.

Cos

> 
> Thanks,
> Eli

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Mar 22, 2012 at 3:11 PM, Konstantin Boudnik <co...@apache.org> wrote:
> On Thu, Mar 22, 2012 at 10:25AM, Eli Collins wrote:
>> On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
>> <sh...@gmail.com> wrote:
>> > Eli,
>> >
>> > I went over the entire discussion on the topic, and did not get it. Is
>> > there a problem with append? We know it does not work in hadoop-1,
>> > only flush() does. Is there anything wrong with the new append
>> > (HDFS-265)? If so please file a bug.
>> > I tested it in Hadoop-0.22 branch it works fine.
>> >
>> > I agree with people who were involved with the implementation of the
>> > new append that the complexity is mainly in
>> > 1. pipeline recovery
>> > 2. consistent client reading while writing, and
>> > 3. hflush()
>> > Once it is done the append itself, which is reopening of previously
>> > closed files for adding data, is not complex.
>> >
>> > You mentioned it and I agree you indeed should be more involved with
>> > your customer base. As for eBay, append was of the motivations to work
>> > on stabilizing 0.22 branch. And there is a lot of use cases which
>> > require append for our customers.
>> > Some of them were mentioned in this discussion.
>> >
>>
>> From what I've seen 0.22 isn't ready for production use. Aside from
>> not supporting critical features like security, it doesn't have a
>> size-able user-base behind it testing and fixing bugs, etc. All things
>> I'd imagine an org like eBay would want.  I've never gotten a request
>> to support 0.22 from a customer.
>
> This statement looks like FUD to me, because eBay (and a coupla other shops,
> as has been stated elsewhere) are using 0.22 in the production and are
> seemingly happy with that.
>

That's my experience, take it for what it's worth.

Not having important features like security, having very few commits,
etc is not FUD, you can check that via svn.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Konstantin Boudnik <co...@apache.org>.

On Thu, Mar 22, 2012 at 10:25AM, Eli Collins wrote:
> On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
> <sh...@gmail.com> wrote:
> > Eli,
> >
> > I went over the entire discussion on the topic, and did not get it. Is
> > there a problem with append? We know it does not work in hadoop-1,
> > only flush() does. Is there anything wrong with the new append
> > (HDFS-265)? If so please file a bug.
> > I tested it in Hadoop-0.22 branch it works fine.
> >
> > I agree with people who were involved with the implementation of the
> > new append that the complexity is mainly in
> > 1. pipeline recovery
> > 2. consistent client reading while writing, and
> > 3. hflush()
> > Once it is done the append itself, which is reopening of previously
> > closed files for adding data, is not complex.
> >
> > You mentioned it and I agree you indeed should be more involved with
> > your customer base. As for eBay, append was of the motivations to work
> > on stabilizing 0.22 branch. And there is a lot of use cases which
> > require append for our customers.
> > Some of them were mentioned in this discussion.
> >
> 
> From what I've seen 0.22 isn't ready for production use. Aside from
> not supporting critical features like security, it doesn't have a
> size-able user-base behind it testing and fixing bugs, etc. All things
> I'd imagine an org like eBay would want.  I've never gotten a request
> to support 0.22 from a customer.

This statement looks like FUD to me, because eBay (and a coupla other shops,
as has been stated elsewhere) are using 0.22 in the production and are
seemingly happy with that.

And employing FUD always means that there's a reason to bring it about.

Cos

> Thanks,
> Eli

Re: [DISCUSS] Remove append?

Posted by Scott Carey <sc...@richrelevance.com>.


On 3/22/12 10:25 AM, "Eli Collins" <el...@cloudera.com> wrote:

>On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
><sh...@gmail.com> wrote:
>> Eli,
>>
>> I went over the entire discussion on the topic, and did not get it. Is
>> there a problem with append? We know it does not work in hadoop-1,
>> only flush() does. Is there anything wrong with the new append
>> (HDFS-265)? If so please file a bug.
>> I tested it in Hadoop-0.22 branch it works fine.
>>
>> I agree with people who were involved with the implementation of the
>> new append that the complexity is mainly in
>> 1. pipeline recovery
>> 2. consistent client reading while writing, and
>> 3. hflush()
>> Once it is done the append itself, which is reopening of previously
>> closed files for adding data, is not complex.
>>
>
>I agree that much of the complexity is in #1-3 above, which is why
>HDFS-265 is leveraged.
>The primary simplicity of not having append (and truncate) comes from
>not leveraging the invariant that finalized blocks are immutable, that
>blocks once written won't eg shrink in size (which we assume today).

That invariant can co-exist with append via copy-on-write.  The new state
and old state would co-exist until the old state was not needed, a file's
block map would have to use a persistent data structure. Copy on write
semantics with blocks in file systems is all the rage these days.  Free
snapshots, atomic transactions for operations on multiple blocks, etc.

>
>> You mentioned it and I agree you indeed should be more involved with
>> your customer base. As for eBay, append was of the motivations to work
>> on stabilizing 0.22 branch. And there is a lot of use cases which
>> require append for our customers.
>> Some of them were mentioned in this discussion.
>>
>
>From what I've seen 0.22 isn't ready for production use. Aside from
>not supporting critical features like security, it doesn't have a
>size-able user-base behind it testing and fixing bugs, etc. All things
>I'd imagine an org like eBay would want.  I've never gotten a request
>to support 0.22 from a customer.
>
>Thanks,
>Eli

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
<sh...@gmail.com> wrote:
> Eli,
>
> I went over the entire discussion on the topic, and did not get it. Is
> there a problem with append? We know it does not work in hadoop-1,
> only flush() does. Is there anything wrong with the new append
> (HDFS-265)? If so please file a bug.
> I tested it in Hadoop-0.22 branch it works fine.
>
> I agree with people who were involved with the implementation of the
> new append that the complexity is mainly in
> 1. pipeline recovery
> 2. consistent client reading while writing, and
> 3. hflush()
> Once it is done the append itself, which is reopening of previously
> closed files for adding data, is not complex.
>

I agree that much of the complexity is in #1-3 above, which is why
HDFS-265 is leveraged.
The primary simplicity of not having append (and truncate) comes from
not leveraging the invariant that finalized blocks are immutable, that
blocks once written won't eg shrink in size (which we assume today).

> You mentioned it and I agree you indeed should be more involved with
> your customer base. As for eBay, append was of the motivations to work
> on stabilizing 0.22 branch. And there is a lot of use cases which
> require append for our customers.
> Some of them were mentioned in this discussion.
>

>From what I've seen 0.22 isn't ready for production use. Aside from
not supporting critical features like security, it doesn't have a
size-able user-base behind it testing and fixing bugs, etc. All things
I'd imagine an org like eBay would want.  I've never gotten a request
to support 0.22 from a customer.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Konstantin Shvachko <sh...@gmail.com>.

Eli,

I went over the entire discussion on the topic, and did not get it. Is
there a problem with append? We know it does not work in hadoop-1,
only flush() does. Is there anything wrong with the new append
(HDFS-265)? If so please file a bug.
I tested it in Hadoop-0.22 branch it works fine.

I agree with people who were involved with the implementation of the
new append that the complexity is mainly in
1. pipeline recovery
2. consistent client reading while writing, and
3. hflush()
Once it is done the append itself, which is reopening of previously
closed files for adding data, is not complex.

You mentioned it and I agree you indeed should be more involved with
your customer base. As for eBay, append was of the motivations to work
on stabilizing 0.22 branch. And there is a lot of use cases which
require append for our customers.
Some of them were mentioned in this discussion.

Thanks,
--Konstantin


On Tue, Mar 20, 2012 at 5:37 PM, Eli Collins <el...@cloudera.com> wrote:
> Hey gang,
>
> I'd like to get people's thoughts on the following proposal. I think
> we should consider removing append from HDFS.
>
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
> issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production
> use. Anecdotally people who worked on branch-20-append have told me
> they think the new trunk code is substantially less well-tested than
> the branch-20-append code (at least for sync, append was never well
> tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get
> hsync (and append) stabilized in trunk (mostly testing and bug
> fixing).
>
> Rationale follows..
>
> Append does not seem to be an important requirement, hflush was. There
> has not been much demand for append, from users or downstream
> projects. Because Hadoop 1.x does not have a working append
> implementation (see HDFS-3120, the branch-20-append work was focused
> on sync not getting append working) which is not enabled by default
> and downstream projects will want to support Hadoop 1.x releases for
> years, most will not introduce dependencies on append anyway. This is
> not to say demand does not exist, just that if it does, it's been much
> smaller than security, sync, HA, backwards compatbile RPC, etc. This
> probably explains why, over 5 years after the original implementation
> started, we don't have a stable release with append.
>
> Append introduces non-trivial design and code complexity, which is not
> worth the cost if we don't have real users. Removing append means we
> have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly
> simplifies the implementation of other features like snapshots,
> HDFS-level caching, dedupe, etc.
>
> The vast majority of the HDFS-265 effort is still leveraged w/o
> append. The new data durability and read consistency behavior was the
> key part.
>
> GFS, which HDFS' design is based on, has append (and atomic record
> append) so obviously a workable design does not preclude append.
> However we also should not ape the GFS feature set simply because it
> exists. I've had conversations with people who worked on GFS that
> regret adding record append (see also
> http://queue.acm.org/detail.cfm?id=1594206). In short, unless append
> is a real priority for our users I think we should focus our energy
> elsewhere.
>
> Thanks,
> Eli

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Thu, Mar 22, 2012 at 10:15 AM, Daryn Sharp <da...@yahoo-inc.com> wrote:
> On Mar 20, 2012, at 7:37 PM, Eli Collins wrote:
>> Hey gang,
>>
>> I'd like to get people's thoughts on the following proposal. I think
>> we should consider removing append from HDFS.
>>
>> Where we are today.. append was added in the 0.17-19 releases
>> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
>> issues. It and sync were re-designed, re-implemented, and shipped in
>> 21.0 (HDFS-265). To my knowledge, there has been no real production
>> use. Anecdotally people who worked on branch-20-append have told me
>> they think the new trunk code is substantially less well-tested than
>> the branch-20-append code (at least for sync, append was never well
>> tested). It has certainly gotten way less pounding from HBase users.
>> The design however, is much improved, and people think we can get
>> hsync (and append) stabilized in trunk (mostly testing and bug
>> fixing).
>
> Up front:  I think append is a needed feature.
>

Can you elaborate.. eg are there particular use cases at Yahoo!  that
have been running for years that are itching to start using append
when 0.23 is deployed?  Are you guys testing the new append
implementation extensively because you have an app that's ready to use
it when 0.23 is deployed?

So far Milind has been the only one to chime in saying "we really need
append, here's why". Which is great.

> Politely speaking, I think the premise of the question is a bit dubious due to circular nature.  Ie. It's not used in production so is it worth it?

No, I'm saying we've absorbed a lot of complexity for it, but I don't
see downstream projects using it any time soon.  Similarly there
hasn't been a big push to get it working, eg there was a push for
security and hbase support on 20, but not append (the append rewrite
was invasive but so was security). It will have been several years
from the time the rewrite was started until it gets deployed in
production, which makes me think it's less of a priority. So much so
that I wondered whether it was a big priority at all.

> The stigma/perception that append has been unstable and is not well-tested is a compelling reason to not be in production at major installations.  The situation is going to be akin to "You go first. No, you go first!  No way, you go first!".
>
> Downstream projects also aren't going to use something until it's stable, so they either work around the limitation, or...  they chose something other hdfs.  There's also the unanswerable question of how potential users have been silently lost.  We are unlikely to have heard the user demand from those that chose another solution.  Generally for every complaint/request, a large N-many people didn't even bother.
>
> I envision a day where hdfs is a performant posix filesystem.

I think that's unlikely. Posix compliance is a non-goal for HDFS. We
are intentionally not-compliant in many cases to achieve scale and
performance. Check out the Ceph file system paper (they manage the
tradeoff between Posix and scale/performance explicitly). The primary
motivation for Posix compliance is compatibility with existing
Unix-like software. That's not HDFS' raison d'etre (which is the
ecosystem of projects that run atop HDFS: MR, HBase, Pig, Hive, Flume,
Sqoop, etc etc).  HDFS' focus on it's core use case and simplicity is
one of the reasons it's been as successful as it has been.

That's not to say we don't need to do a lot more work to better
integrate HDFS with existing software and tools. Fuse-DFS is just a
start. We need to support a standard interface like NFS etc. These
efforts do not require HDFS become fully Posix compliant.

There's always a trade off between adding more features (increasing
the size of the addressable market) and focusing on your core uses
cases, quality, etc.  In my mind append is on the boundary. I'm happy
to be convinced that append is in HDFS' wheel house.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Daryn Sharp <da...@yahoo-inc.com>.

On Mar 20, 2012, at 7:37 PM, Eli Collins wrote:
> Hey gang,
> 
> I'd like to get people's thoughts on the following proposal. I think
> we should consider removing append from HDFS.
> 
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
> issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production
> use. Anecdotally people who worked on branch-20-append have told me
> they think the new trunk code is substantially less well-tested than
> the branch-20-append code (at least for sync, append was never well
> tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get
> hsync (and append) stabilized in trunk (mostly testing and bug
> fixing).

Up front:  I think append is a needed feature.

Politely speaking, I think the premise of the question is a bit dubious due to circular nature.  Ie. It's not used in production so is it worth it?  The stigma/perception that append has been unstable and is not well-tested is a compelling reason to not be in production at major installations.  The situation is going to be akin to "You go first. No, you go first!  No way, you go first!".

Downstream projects also aren't going to use something until it's stable, so they either work around the limitation, or...  they chose something other hdfs.  There's also the unanswerable question of how potential users have been silently lost.  We are unlikely to have heard the user demand from those that chose another solution.  Generally for every complaint/request, a large N-many people didn't even bother.

I envision a day where hdfs is a performant posix filesystem.  Dropping append sets us back from that goal.  Admittedly, I don't know all the intricacies of how append was implemented and why it is/was difficult.  Is the complexity maybe due to "bolting" append onto code that wasn't designed with mutability in mind?  (That's truly a question, not a statement) If so, perhaps a refactoring would simplify the code?

Dropping append also might be used as a cudgel against hdfs.  Cynically speaking, do we want to risk marketeers from certain competitors to say or imply:  Trust your data with us because we're so brilliant that we have a feature hdfs has repeatedly tried and failed to implement!

Daryn

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 12:30 PM,  <Mi...@emc.com> wrote:
>
>>1. If the daily files are smaller than 1 block (seems unlikely)
>
> Even at a large hdfs installation, the avg file size was < 1.5 blocks.
> Bucketing causes the file sizes to drop.
>
>>2. The small files problem (a typical NN can store 100-200M files, so
>>a problem for big users)
>
> Big users probably have enough people to write their own roll-up code to
> avoid small-files problem. Its the rest that are used to storage systems
> handling billions of files.
>

HDFS does as well, you can federate NNs to support billions of files.
There's no fundamental max # files limitation in the design or latest
implementation.  I suspect we could support another 2x # files and #
blocks per NN if we wanted by being more clever in how we store MD.

One of the reason HDFS scales better (and is less buggy) than these
other systems is because it's design is simpler, eg maintaining all MD
in memory vs paging it. We don't want to lose these properties in the
bargain.

Thanks,
Eli

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 3:48 PM,  <Mi...@emc.com> wrote:
> Eli,
>
> If HDFS-3120 is committed to both 1.x and trunk/0.23.x, then one will be
> able to disable appends (while keeping hflush) using different config
> variables. By default (I.e. In hdfs-default.xlm), we should set
> dfs.support.append to false, and dfs.support.hsync to true.
>

Agree, thanks for the thoughts.

> That way, we get enough time to fix append, and if we decide to remove it,
> then we can do that without causing major distress in 0.24.
>
> Thoughts ?
>

Sounds good.

Thanks,
Eli

> - Milind
>
> On 3/21/12 3:16 PM, "Eli Collins" <el...@cloudera.com> wrote:
>
>>On Wed, Mar 21, 2012 at 3:06 PM,  <Mi...@emc.com> wrote:
>>>
>>>>
>>>>Absolutely, I'd like to learn more about what append/truncate buys us.
>>>
>>> Indeed. Lets postpone this discussion to Q2 then.
>>>
>>
>>I'd still like to hear what other people think if they haven't chimed
>>in.  Even if we decide to remove it, I don't think we need to do so
>>next week, eg can wait to hear more about what you're working on.
>>
>>One of the reasons I raised this topic now is that it the not too
>>distant future 0.23 will become the stable release, and we'll
>>effectively lose the ability to remove append once we're stable.  Not
>>that I expect people will stabilize append before this happens, it
>>doesn't seem to be a priority for anyone, though perhaps you'll end up
>>doing that work for your project.
>>
>>Thanks,
>>Eli
>>
>>> Thanks,
>>>
>>> - milind
>>>
>>> ---
>>> Milind Bhandarkar
>>> Greenplum Labs, EMC
>>> (Disclaimer: Opinions expressed in this email are those of the author,
>>>and
>>> do not necessarily represent the views of any organization, past or
>>> present, the author might be affiliated with.)
>>>
>>>
>>>>
>>>
>>
>

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

Eli,

If HDFS-3120 is committed to both 1.x and trunk/0.23.x, then one will be
able to disable appends (while keeping hflush) using different config
variables. By default (I.e. In hdfs-default.xlm), we should set
dfs.support.append to false, and dfs.support.hsync to true.

That way, we get enough time to fix append, and if we decide to remove it,
then we can do that without causing major distress in 0.24.

Thoughts ?

- Milind

On 3/21/12 3:16 PM, "Eli Collins" <el...@cloudera.com> wrote:

>On Wed, Mar 21, 2012 at 3:06 PM,  <Mi...@emc.com> wrote:
>>
>>>
>>>Absolutely, I'd like to learn more about what append/truncate buys us.
>>
>> Indeed. Lets postpone this discussion to Q2 then.
>>
>
>I'd still like to hear what other people think if they haven't chimed
>in.  Even if we decide to remove it, I don't think we need to do so
>next week, eg can wait to hear more about what you're working on.
>
>One of the reasons I raised this topic now is that it the not too
>distant future 0.23 will become the stable release, and we'll
>effectively lose the ability to remove append once we're stable.  Not
>that I expect people will stabilize append before this happens, it
>doesn't seem to be a priority for anyone, though perhaps you'll end up
>doing that work for your project.
>
>Thanks,
>Eli
>
>> Thanks,
>>
>> - milind
>>
>> ---
>> Milind Bhandarkar
>> Greenplum Labs, EMC
>> (Disclaimer: Opinions expressed in this email are those of the author,
>>and
>> do not necessarily represent the views of any organization, past or
>> present, the author might be affiliated with.)
>>
>>
>>>
>>
>

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 3:06 PM,  <Mi...@emc.com> wrote:
>
>>
>>Absolutely, I'd like to learn more about what append/truncate buys us.
>
> Indeed. Lets postpone this discussion to Q2 then.
>

I'd still like to hear what other people think if they haven't chimed
in.  Even if we decide to remove it, I don't think we need to do so
next week, eg can wait to hear more about what you're working on.

One of the reasons I raised this topic now is that it the not too
distant future 0.23 will become the stable release, and we'll
effectively lose the ability to remove append once we're stable.  Not
that I expect people will stabilize append before this happens, it
doesn't seem to be a priority for anyone, though perhaps you'll end up
doing that work for your project.

Thanks,
Eli

> Thanks,
>
> - milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>
>
>>
>

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

>
>Absolutely, I'd like to learn more about what append/truncate buys us.

Indeed. Lets postpone this discussion to Q2 then.

Thanks,

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)


>

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 12:48 PM,  <Mi...@emc.com> wrote:
> Eli,
>
> To clarify a little bit, I think HDFS-3120 is the right thing to do, to
> disable appends, while still enabling hsync in branch-1.
>
> But, going forward, (say 0.23+) having appends working correctly will
> definitely add value, and make HDFS more palatable for lots of other
> workloads.
>
> Of course, I have a vested interest in this, because our team is working
> on a project that requires append and truncate, and we will be testing it
> thoroughly at scale in Q2 this year. Would it be okay to wait for the
> results of this testing ?

Absolutely, I'd like to learn more about what append/truncate buys us.

Thanks,
Eli

>
> Thanks,
>
> - milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

I would also like to point to work being done on PLFS-HDFS:
http://institute.lanl.gov/isti/irhpit/presentations/PLFS-HDFS.pdf

This would be made much simpler by allowing appends.

Checkpointing in MPI is a very common use-case, and after Hamster,
PLFS-HDFS becomes an attractive way to do this.

(Section 2 of the 2009 HotCloud paper by PDL:
http://www.cs.cmu.edu/~svp/2009hotcloud-tablefs.pdf discusses the reasons
for seeking commonalities between HPC and DISC file systems.)

- Milind


On 3/21/12 12:48 PM, "Bhandarkar, Milind" <Mi...@emc.com>
wrote:

>Eli,
>
>To clarify a little bit, I think HDFS-3120 is the right thing to do, to
>disable appends, while still enabling hsync in branch-1.
>
>But, going forward, (say 0.23+) having appends working correctly will
>definitely add value, and make HDFS more palatable for lots of other
>workloads.
>
>Of course, I have a vested interest in this, because our team is working
>on a project that requires append and truncate, and we will be testing it
>thoroughly at scale in Q2 this year. Would it be okay to wait for the
>results of this testing ?
>
>Thanks,
>
>- milind
>
>---
>Milind Bhandarkar
>Greenplum Labs, EMC
>(Disclaimer: Opinions expressed in this email are those of the author, and
>do not necessarily represent the views of any organization, past or
>present, the author might be affiliated with.)
>
>

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

Eli,

To clarify a little bit, I think HDFS-3120 is the right thing to do, to
disable appends, while still enabling hsync in branch-1.

But, going forward, (say 0.23+) having appends working correctly will
definitely add value, and make HDFS more palatable for lots of other
workloads.

Of course, I have a vested interest in this, because our team is working
on a project that requires append and truncate, and we will be testing it
thoroughly at scale in Q2 this year. Would it be okay to wait for the
results of this testing ?

Thanks,

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

>1. If the daily files are smaller than 1 block (seems unlikely)

Even at a large hdfs installation, the avg file size was < 1.5 blocks.
Bucketing causes the file sizes to drop.

>2. The small files problem (a typical NN can store 100-200M files, so
>a problem for big users)

Big users probably have enough people to write their own roll-up code to
avoid small-files problem. Its the rest that are used to storage systems
handling billions of files.

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)



>
>In which case maybe better to focus on #2 rather than work around it?
>
>Thanks,
>Eli
>
>>
>> Reducing number of files this way also makes it easy to copy, take
>> snapshots etc without having to write special parallel code to do it.
>>
>>>
>>>I assume the 2nd one refers to not having to Multi*InputFormat. And
>>>the 3rd refers to appending to an old file instead of creating a new
>>>one.
>>
>> Yes.
>>
>>>
>>>> In addition, the small-files problem in HDFS forces people to write MR
>>>> code, and causes rewrite of large datasets even if a small amount of
>>>>data
>>>> is added to it.
>>
>>
>>>
>>>Do people rewrite large datasets today just to add 1mb? I haven't
>>>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
>>>customer base.  If so I'd would have expected people to put energy
>>>into getting append working in 1.x which know was has put energy into
>>>(I know some people feel the 20-based design is unworkable, I don't
>>>know it well enough to comment there).
>>
>> With HDFS, they do not rewrite large datasets just to add a small amount
>> of data. Instead they create new files, and use a separate
>> metadata-service (or just file numbering conventions) to make the added
>> data part of the large dataset. But with other file systems, they just
>> ">>".
>>
>> Thanks,
>>
>> - milind
>>
>>
>>>---
>>>Milind Bhandarkar
>>>Greenplum Labs, EMC
>>>(Disclaimer: Opinions expressed in this email are those of the author,
>>>and do not necessarily represent the views of any organization, past or
>>>present, the author might be affiliated with.)
>>
>

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 10:47 AM,  <Mi...@emc.com> wrote:
> Answers inline.
>
> On 3/21/12 10:32 AM, "Eli Collins" <el...@cloudera.com> wrote:
>
>>
>>Why not just write new files and use Har files, because Har files are a
>>pita?
>
> Yes, and har creation is an MR job, which is totally I/O bound, and yet
> takes up slots/containers, reducing cluster utilization.
>
>>Can you elaborate on the 1st one, how it's especially helpful for
>>archival?
>
> Say you have daily log files (consider many small job history files).
> Instead of keeping them as separate files, one appends them to a monthly
> files (this in itself is a complete rewrite), but appending monthly files
> to year-to-date files should not require rewrite (because after March, it
> becomes very inefficient.)

Why not just keep the original daily files instead of continually
either rewriting (yuck) or duplicating (yuck) the data by aggregating
them into rollups?  I can think of two reasons:

1. If the daily files are smaller than 1 block (seems unlikely)
2. The small files problem (a typical NN can store 100-200M files, so
a problem for big users)

In which case maybe better to focus on #2 rather than work around it?

Thanks,
Eli

>
> Reducing number of files this way also makes it easy to copy, take
> snapshots etc without having to write special parallel code to do it.
>
>>
>>I assume the 2nd one refers to not having to Multi*InputFormat. And
>>the 3rd refers to appending to an old file instead of creating a new
>>one.
>
> Yes.
>
>>
>>> In addition, the small-files problem in HDFS forces people to write MR
>>> code, and causes rewrite of large datasets even if a small amount of
>>>data
>>> is added to it.
>
>
>>
>>Do people rewrite large datasets today just to add 1mb? I haven't
>>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
>>customer base.  If so I'd would have expected people to put energy
>>into getting append working in 1.x which know was has put energy into
>>(I know some people feel the 20-based design is unworkable, I don't
>>know it well enough to comment there).
>
> With HDFS, they do not rewrite large datasets just to add a small amount
> of data. Instead they create new files, and use a separate
> metadata-service (or just file numbering conventions) to make the added
> data part of the large dataset. But with other file systems, they just
> ">>".
>
> Thanks,
>
> - milind
>
>
>>---
>>Milind Bhandarkar
>>Greenplum Labs, EMC
>>(Disclaimer: Opinions expressed in this email are those of the author,
>>and do not necessarily represent the views of any organization, past or
>>present, the author might be affiliated with.)
>

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

On Wed, Mar 21, 2012 at 10:32 AM, Eli Collins <el...@cloudera.com> wrote:
> Thanks for the feedback Milind, questions inline.
>
> On Wed, Mar 21, 2012 at 10:17 AM,  <Mi...@emc.com> wrote:
>> As someone who has worked with hdfs-compatible distributed file systems
>> that support append, I can vouch for its extensive usage.
>>
>> I have seen how simple it becomes to create tar archives, and later append
>> files to them, without writing special inefficient code to do so.
>>
>
> Why not just write new files and use Har files, because Har files are a pita?
>
>> I have seen it used in archiving cold data, reducing MR task launch
>> overhead without having to use a different input format, so that the same
>> code can be used for both hot and cold data.
>>
>
> Can you elaborate on the 1st one, how it's especially helpful for archival?
>
> I assume the 2nd one refers to not having to Multi*InputFormat. And
> the 3rd refers to appending to an old file instead of creating a new
> one.
>
>> In addition, the small-files problem in HDFS forces people to write MR
>> code, and causes rewrite of large datasets even if a small amount of data
>> is added to it.
>
> Do people rewrite large datasets today just to add 1mb? I haven't
> heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
> customer base.  If so I'd would have expected people to put energy
> into getting append working in 1.x which know was has put energy into

Arg, that should read "no one has put energy into".  </drinks coffee>

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

Answers inline.

On 3/21/12 10:32 AM, "Eli Collins" <el...@cloudera.com> wrote:

>
>Why not just write new files and use Har files, because Har files are a
>pita?

Yes, and har creation is an MR job, which is totally I/O bound, and yet
takes up slots/containers, reducing cluster utilization.

>Can you elaborate on the 1st one, how it's especially helpful for
>archival?

Say you have daily log files (consider many small job history files).
Instead of keeping them as separate files, one appends them to a monthly
files (this in itself is a complete rewrite), but appending monthly files
to year-to-date files should not require rewrite (because after March, it
becomes very inefficient.)

Reducing number of files this way also makes it easy to copy, take
snapshots etc without having to write special parallel code to do it.

>
>I assume the 2nd one refers to not having to Multi*InputFormat. And
>the 3rd refers to appending to an old file instead of creating a new
>one.

Yes.

>
>> In addition, the small-files problem in HDFS forces people to write MR
>> code, and causes rewrite of large datasets even if a small amount of
>>data
>> is added to it.

>
>Do people rewrite large datasets today just to add 1mb? I haven't
>heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
>customer base.  If so I'd would have expected people to put energy
>into getting append working in 1.x which know was has put energy into
>(I know some people feel the 20-based design is unworkable, I don't
>know it well enough to comment there).

With HDFS, they do not rewrite large datasets just to add a small amount
of data. Instead they create new files, and use a separate
metadata-service (or just file numbering conventions) to make the added
data part of the large dataset. But with other file systems, they just
">>".

Thanks,

- milind

>---
>Milind Bhandarkar
>Greenplum Labs, EMC
>(Disclaimer: Opinions expressed in this email are those of the author,
>and do not necessarily represent the views of any organization, past or
>present, the author might be affiliated with.)

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

Thanks for the feedback Milind, questions inline.

On Wed, Mar 21, 2012 at 10:17 AM,  <Mi...@emc.com> wrote:
> As someone who has worked with hdfs-compatible distributed file systems
> that support append, I can vouch for its extensive usage.
>
> I have seen how simple it becomes to create tar archives, and later append
> files to them, without writing special inefficient code to do so.
>

Why not just write new files and use Har files, because Har files are a pita?

> I have seen it used in archiving cold data, reducing MR task launch
> overhead without having to use a different input format, so that the same
> code can be used for both hot and cold data.
>

Can you elaborate on the 1st one, how it's especially helpful for archival?

I assume the 2nd one refers to not having to Multi*InputFormat. And
the 3rd refers to appending to an old file instead of creating a new
one.

> In addition, the small-files problem in HDFS forces people to write MR
> code, and causes rewrite of large datasets even if a small amount of data
> is added to it.

Do people rewrite large datasets today just to add 1mb? I haven't
heard of that from big users (Yahoo!, FB, Twitter, eBay..) or my
customer base.  If so I'd would have expected people to put energy
into getting append working in 1.x which know was has put energy into
(I know some people feel the 20-based design is unworkable, I don't
know it well enough to comment there).

Thanks,
Eli

>
> So, there is clearly a need for it, AFAIK.
>
> +1 on fixing it. Please let me know if you need help.
>
> - milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>
>
>
> On 3/21/12 5:36 AM, "Dave Shine" <Da...@channelintelligence.com>
> wrote:
>
>>I am not a contributor to this project, so I don't know how much weight
>>my opinion carries.  But I have been hoping to see append become stable
>>soon.  We are constantly dealing with the "small file problem", and I
>>have written M/R jobs to periodically roll up lots of small files into a
>>few small ones.  Having append would prevent me from needing to use up
>>cluster resources performing these tasks.
>>
>>Therefore, all things being equal I +1 making append work.  However, if
>>the level of complexity is as bad as Eli implies below, then I can
>>understand that perhaps it is not worth the effort. If it will cause too
>>much technical debt, then removing it makes sense.  But don't just remove
>>it because you don't believe there is a need for it.
>>
>>Thanks,
>>Dave Shine
>>
>>
>>-----Original Message-----
>>From: Eli Collins [mailto:eli@cloudera.com]
>>Sent: Tuesday, March 20, 2012 8:38 PM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: [DISCUSS] Remove append?
>>
>>Hey gang,
>>
>>I'd like to get people's thoughts on the following proposal. I think we
>>should consider removing append from HDFS.
>>
>>Where we are today.. append was added in the 0.17-19 releases
>>(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
>>issues. It and sync were re-designed, re-implemented, and shipped in
>>21.0 (HDFS-265). To my knowledge, there has been no real production use.
>>Anecdotally people who worked on branch-20-append have told me they think
>>the new trunk code is substantially less well-tested than the
>>branch-20-append code (at least for sync, append was never well tested).
>>It has certainly gotten way less pounding from HBase users.
>>The design however, is much improved, and people think we can get hsync
>>(and append) stabilized in trunk (mostly testing and bug fixing).
>>
>>Rationale follows..
>>
>>Append does not seem to be an important requirement, hflush was. There
>>has not been much demand for append, from users or downstream projects.
>>Because Hadoop 1.x does not have a working append implementation (see
>>HDFS-3120, the branch-20-append work was focused on sync not getting
>>append working) which is not enabled by default and downstream projects
>>will want to support Hadoop 1.x releases for years, most will not
>>introduce dependencies on append anyway. This is not to say demand does
>>not exist, just that if it does, it's been much smaller than security,
>>sync, HA, backwards compatbile RPC, etc. This probably explains why, over
>>5 years after the original implementation started, we don't have a stable
>>release with append.
>>
>>Append introduces non-trivial design and code complexity, which is not
>>worth the cost if we don't have real users. Removing append means we have
>>the property that HDFS blocks, when finalized, are immutable.
>>This significantly simplifies the design and code, which significantly
>>simplifies the implementation of other features like snapshots,
>>HDFS-level caching, dedupe, etc.
>>
>>The vast majority of the HDFS-265 effort is still leveraged w/o append.
>>The new data durability and read consistency behavior was the key part.
>>
>>GFS, which HDFS' design is based on, has append (and atomic record
>>append) so obviously a workable design does not preclude append.
>>However we also should not ape the GFS feature set simply because it
>>exists. I've had conversations with people who worked on GFS that regret
>>adding record append (see also
>>http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a
>>real priority for our users I think we should focus our energy elsewhere.
>>
>>Thanks,
>>Eli
>>
>>The information contained in this email message is considered
>>confidential and proprietary to the sender and is intended solely for
>>review and use by the named recipient. Any unauthorized review, use or
>>distribution is strictly prohibited. If you have received this message in
>>error, please advise the sender by reply email and delete the message.
>>
>

RE: [DISCUSS] Remove append?

Posted by Dave Shine <Da...@channelintelligence.com>.

I never brought it up on the CDH list because I was told during my CDH training (Dec 2010) that is was already there.  When I later learned it was usable only for HBase, I just assumed it would be coming, eventually.

Dave


-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com] 
Sent: Wednesday, March 21, 2012 2:52 PM
To: hdfs-dev@hadoop.apache.org
Subject: Re: [DISCUSS] Remove append?

Good point. I thought I'd start with devs first. If you can't get it past devs there's no reason to go further.

Also, users will tell you they want everything. I'd like to root cause this, eg if they want append to solve the small files problem I'd like to know if solving the latter means we don't have to do the former.

ps - fwiw the cdh-user@ mailing list has 800 people on it and it's rarely requested. Ditto in customer conversations. However the user base continues to grow rapidly and change in makeup so the past isn't necessarily a good predictor.

Thanks,
Eli

On Wed, Mar 21, 2012 at 11:31 AM, Tim Broberg <Ti...@exar.com> wrote:
> No specific advice on this particular issue, but in general, I learned the hard way to stop asking the question, "Feature X is hard to support, is anybody really going to use this?" *Every time* I have asked this question, I get the answer I want to hear. *Every time*, they come back and ask for the feature back later and it's more work than it would have been if I had just planned for it from the beginning.
>
> YMMV, and I'm always asking marketing guys whereas you're asking developers.
>
> Ok, there's one piece of specific advice: Go find the people that will tell you what you don't want to hear. Ask hdfs-user's whether they need the feature rather than hdfs-dev's.
>
> We all have too much empathy for your position here to make you suffer.
>
>    - Tim.
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Tuesday, March 20, 2012 8:38 PM
> To: hdfs-dev@hadoop.apache.org
> Subject: [DISCUSS] Remove append?
>
> Hey gang,
>
> I'd like to get people's thoughts on the following proposal. I think we should consider removing append from HDFS.
>
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality 
> issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production use. Anecdotally people who worked on branch-20-append have told me they think the new trunk code is substantially less well-tested than the branch-20-append code (at least for sync, append was never well tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get hsync (and append) stabilized in trunk (mostly testing and bug fixing).
>
> Rationale follows..
>
> Append does not seem to be an important requirement, hflush was. There has not been much demand for append, from users or downstream projects. Because Hadoop 1.x does not have a working append implementation (see HDFS-3120, the branch-20-append work was focused on sync not getting append working) which is not enabled by default and downstream projects will want to support Hadoop 1.x releases for years, most will not introduce dependencies on append anyway. This is not to say demand does not exist, just that if it does, it's been much smaller than security, sync, HA, backwards compatbile RPC, etc. This probably explains why, over 5 years after the original implementation started, we don't have a stable release with append.
>
> Append introduces non-trivial design and code complexity, which is not worth the cost if we don't have real users. Removing append means we have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly simplifies the implementation of other features like snapshots, HDFS-level caching, dedupe, etc.
>
> The vast majority of the HDFS-265 effort is still leveraged w/o append. The new data durability and read consistency behavior was the key part.
>
> GFS, which HDFS' design is based on, has append (and atomic record
> append) so obviously a workable design does not preclude append.
> However we also should not ape the GFS feature set simply because it exists. I've had conversations with people who worked on GFS that regret adding record append (see also http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a real priority for our users I think we should focus our energy elsewhere.
>
> Thanks,
> Eli
>
> The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
>
> The information and any attached documents contained in this message 
> may be confidential and/or legally privileged.  The message is 
> intended solely for the addressee(s).  If you are not the intended 
> recipient, you are hereby notified that any use, dissemination, or 
> reproduction is strictly prohibited and may be unlawful.  If you are 
> not the intended recipient, please contact the sender immediately by 
> return e-mail and destroy all copies of the original message.

Re: [DISCUSS] Remove append?

Posted by Eli Collins <el...@cloudera.com>.

Good point. I thought I'd start with devs first. If you can't get it
past devs there's no reason to go further.

Also, users will tell you they want everything. I'd like to root cause
this, eg if they want append to solve the small files problem I'd like
to know if solving the latter means we don't have to do the former.

ps - fwiw the cdh-user@ mailing list has 800 people on it and it's
rarely requested. Ditto in customer conversations. However the user
base continues to grow rapidly and change in makeup so the past isn't
necessarily a good predictor.

Thanks,
Eli

On Wed, Mar 21, 2012 at 11:31 AM, Tim Broberg <Ti...@exar.com> wrote:
> No specific advice on this particular issue, but in general, I learned the hard way to stop asking the question, "Feature X is hard to support, is anybody really going to use this?" *Every time* I have asked this question, I get the answer I want to hear. *Every time*, they come back and ask for the feature back later and it's more work than it would have been if I had just planned for it from the beginning.
>
> YMMV, and I'm always asking marketing guys whereas you're asking developers.
>
> Ok, there's one piece of specific advice: Go find the people that will tell you what you don't want to hear. Ask hdfs-user's whether they need the feature rather than hdfs-dev's.
>
> We all have too much empathy for your position here to make you suffer.
>
>    - Tim.
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Tuesday, March 20, 2012 8:38 PM
> To: hdfs-dev@hadoop.apache.org
> Subject: [DISCUSS] Remove append?
>
> Hey gang,
>
> I'd like to get people's thoughts on the following proposal. I think we should consider removing append from HDFS.
>
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production use. Anecdotally people who worked on branch-20-append have told me they think the new trunk code is substantially less well-tested than the branch-20-append code (at least for sync, append was never well tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get hsync (and append) stabilized in trunk (mostly testing and bug fixing).
>
> Rationale follows..
>
> Append does not seem to be an important requirement, hflush was. There has not been much demand for append, from users or downstream projects. Because Hadoop 1.x does not have a working append implementation (see HDFS-3120, the branch-20-append work was focused on sync not getting append working) which is not enabled by default and downstream projects will want to support Hadoop 1.x releases for years, most will not introduce dependencies on append anyway. This is not to say demand does not exist, just that if it does, it's been much smaller than security, sync, HA, backwards compatbile RPC, etc. This probably explains why, over 5 years after the original implementation started, we don't have a stable release with append.
>
> Append introduces non-trivial design and code complexity, which is not worth the cost if we don't have real users. Removing append means we have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly simplifies the implementation of other features like snapshots, HDFS-level caching, dedupe, etc.
>
> The vast majority of the HDFS-265 effort is still leveraged w/o append. The new data durability and read consistency behavior was the key part.
>
> GFS, which HDFS' design is based on, has append (and atomic record
> append) so obviously a workable design does not preclude append.
> However we also should not ape the GFS feature set simply because it exists. I've had conversations with people who worked on GFS that regret adding record append (see also http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a real priority for our users I think we should focus our energy elsewhere.
>
> Thanks,
> Eli
>
> The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.

RE: [DISCUSS] Remove append?

Posted by Tim Broberg <Ti...@exar.com>.

No specific advice on this particular issue, but in general, I learned the hard way to stop asking the question, "Feature X is hard to support, is anybody really going to use this?" *Every time* I have asked this question, I get the answer I want to hear. *Every time*, they come back and ask for the feature back later and it's more work than it would have been if I had just planned for it from the beginning.

YMMV, and I'm always asking marketing guys whereas you're asking developers.

Ok, there's one piece of specific advice: Go find the people that will tell you what you don't want to hear. Ask hdfs-user's whether they need the feature rather than hdfs-dev's.

We all have too much empathy for your position here to make you suffer.

    - Tim.

-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com]
Sent: Tuesday, March 20, 2012 8:38 PM
To: hdfs-dev@hadoop.apache.org
Subject: [DISCUSS] Remove append?

Hey gang,

I'd like to get people's thoughts on the following proposal. I think we should consider removing append from HDFS.

Where we are today.. append was added in the 0.17-19 releases
(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality issues. It and sync were re-designed, re-implemented, and shipped in
21.0 (HDFS-265). To my knowledge, there has been no real production use. Anecdotally people who worked on branch-20-append have told me they think the new trunk code is substantially less well-tested than the branch-20-append code (at least for sync, append was never well tested). It has certainly gotten way less pounding from HBase users.
The design however, is much improved, and people think we can get hsync (and append) stabilized in trunk (mostly testing and bug fixing).

Rationale follows..

Append does not seem to be an important requirement, hflush was. There has not been much demand for append, from users or downstream projects. Because Hadoop 1.x does not have a working append implementation (see HDFS-3120, the branch-20-append work was focused on sync not getting append working) which is not enabled by default and downstream projects will want to support Hadoop 1.x releases for years, most will not introduce dependencies on append anyway. This is not to say demand does not exist, just that if it does, it's been much smaller than security, sync, HA, backwards compatbile RPC, etc. This probably explains why, over 5 years after the original implementation started, we don't have a stable release with append.

Append introduces non-trivial design and code complexity, which is not worth the cost if we don't have real users. Removing append means we have the property that HDFS blocks, when finalized, are immutable.
This significantly simplifies the design and code, which significantly simplifies the implementation of other features like snapshots, HDFS-level caching, dedupe, etc.

The vast majority of the HDFS-265 effort is still leveraged w/o append. The new data durability and read consistency behavior was the key part.

GFS, which HDFS' design is based on, has append (and atomic record
append) so obviously a workable design does not preclude append.
However we also should not ape the GFS feature set simply because it exists. I've had conversations with people who worked on GFS that regret adding record append (see also http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a real priority for our users I think we should focus our energy elsewhere.

Thanks,
Eli

The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Re: [DISCUSS] Remove append?

Posted by Konstantin Shvachko <sh...@gmail.com>.

Hi Dave,

Your opinion is very much appreciated.

Thanks,
--Konstantin

On Wed, Mar 21, 2012 at 5:36 AM, Dave Shine
<Da...@channelintelligence.com> wrote:
> I am not a contributor to this project, so I don't know how much weight my opinion carries.  But I have been hoping to see append become stable soon.  We are constantly dealing with the "small file problem", and I have written M/R jobs to periodically roll up lots of small files into a few small ones.  Having append would prevent me from needing to use up cluster resources performing these tasks.
>
> Therefore, all things being equal I +1 making append work.  However, if the level of complexity is as bad as Eli implies below, then I can understand that perhaps it is not worth the effort. If it will cause too much technical debt, then removing it makes sense.  But don't just remove it because you don't believe there is a need for it.
>
> Thanks,
> Dave Shine
>
>
> -----Original Message-----
> From: Eli Collins [mailto:eli@cloudera.com]
> Sent: Tuesday, March 20, 2012 8:38 PM
> To: hdfs-dev@hadoop.apache.org
> Subject: [DISCUSS] Remove append?
>
> Hey gang,
>
> I'd like to get people's thoughts on the following proposal. I think we should consider removing append from HDFS.
>
> Where we are today.. append was added in the 0.17-19 releases
> (HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality issues. It and sync were re-designed, re-implemented, and shipped in
> 21.0 (HDFS-265). To my knowledge, there has been no real production use. Anecdotally people who worked on branch-20-append have told me they think the new trunk code is substantially less well-tested than the branch-20-append code (at least for sync, append was never well tested). It has certainly gotten way less pounding from HBase users.
> The design however, is much improved, and people think we can get hsync (and append) stabilized in trunk (mostly testing and bug fixing).
>
> Rationale follows..
>
> Append does not seem to be an important requirement, hflush was. There has not been much demand for append, from users or downstream projects. Because Hadoop 1.x does not have a working append implementation (see HDFS-3120, the branch-20-append work was focused on sync not getting append working) which is not enabled by default and downstream projects will want to support Hadoop 1.x releases for years, most will not introduce dependencies on append anyway. This is not to say demand does not exist, just that if it does, it's been much smaller than security, sync, HA, backwards compatbile RPC, etc. This probably explains why, over 5 years after the original implementation started, we don't have a stable release with append.
>
> Append introduces non-trivial design and code complexity, which is not worth the cost if we don't have real users. Removing append means we have the property that HDFS blocks, when finalized, are immutable.
> This significantly simplifies the design and code, which significantly simplifies the implementation of other features like snapshots, HDFS-level caching, dedupe, etc.
>
> The vast majority of the HDFS-265 effort is still leveraged w/o append. The new data durability and read consistency behavior was the key part.
>
> GFS, which HDFS' design is based on, has append (and atomic record
> append) so obviously a workable design does not preclude append.
> However we also should not ape the GFS feature set simply because it exists. I've had conversations with people who worked on GFS that regret adding record append (see also http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a real priority for our users I think we should focus our energy elsewhere.
>
> Thanks,
> Eli
>
> The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

Re: [DISCUSS] Remove append?

Posted by Mi...@emc.com.

As someone who has worked with hdfs-compatible distributed file systems
that support append, I can vouch for its extensive usage.

I have seen how simple it becomes to create tar archives, and later append
files to them, without writing special inefficient code to do so.

I have seen it used in archiving cold data, reducing MR task launch
overhead without having to use a different input format, so that the same
code can be used for both hot and cold data.

In addition, the small-files problem in HDFS forces people to write MR
code, and causes rewrite of large datasets even if a small amount of data
is added to it.

So, there is clearly a need for it, AFAIK.

+1 on fixing it. Please let me know if you need help.

- milind

---
Milind Bhandarkar
Greenplum Labs, EMC
(Disclaimer: Opinions expressed in this email are those of the author, and
do not necessarily represent the views of any organization, past or
present, the author might be affiliated with.)



On 3/21/12 5:36 AM, "Dave Shine" <Da...@channelintelligence.com>
wrote:

>I am not a contributor to this project, so I don't know how much weight
>my opinion carries.  But I have been hoping to see append become stable
>soon.  We are constantly dealing with the "small file problem", and I
>have written M/R jobs to periodically roll up lots of small files into a
>few small ones.  Having append would prevent me from needing to use up
>cluster resources performing these tasks.
>
>Therefore, all things being equal I +1 making append work.  However, if
>the level of complexity is as bad as Eli implies below, then I can
>understand that perhaps it is not worth the effort. If it will cause too
>much technical debt, then removing it makes sense.  But don't just remove
>it because you don't believe there is a need for it.
>
>Thanks,
>Dave Shine
>
>
>-----Original Message-----
>From: Eli Collins [mailto:eli@cloudera.com]
>Sent: Tuesday, March 20, 2012 8:38 PM
>To: hdfs-dev@hadoop.apache.org
>Subject: [DISCUSS] Remove append?
>
>Hey gang,
>
>I'd like to get people's thoughts on the following proposal. I think we
>should consider removing append from HDFS.
>
>Where we are today.. append was added in the 0.17-19 releases
>(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality
>issues. It and sync were re-designed, re-implemented, and shipped in
>21.0 (HDFS-265). To my knowledge, there has been no real production use.
>Anecdotally people who worked on branch-20-append have told me they think
>the new trunk code is substantially less well-tested than the
>branch-20-append code (at least for sync, append was never well tested).
>It has certainly gotten way less pounding from HBase users.
>The design however, is much improved, and people think we can get hsync
>(and append) stabilized in trunk (mostly testing and bug fixing).
>
>Rationale follows..
>
>Append does not seem to be an important requirement, hflush was. There
>has not been much demand for append, from users or downstream projects.
>Because Hadoop 1.x does not have a working append implementation (see
>HDFS-3120, the branch-20-append work was focused on sync not getting
>append working) which is not enabled by default and downstream projects
>will want to support Hadoop 1.x releases for years, most will not
>introduce dependencies on append anyway. This is not to say demand does
>not exist, just that if it does, it's been much smaller than security,
>sync, HA, backwards compatbile RPC, etc. This probably explains why, over
>5 years after the original implementation started, we don't have a stable
>release with append.
>
>Append introduces non-trivial design and code complexity, which is not
>worth the cost if we don't have real users. Removing append means we have
>the property that HDFS blocks, when finalized, are immutable.
>This significantly simplifies the design and code, which significantly
>simplifies the implementation of other features like snapshots,
>HDFS-level caching, dedupe, etc.
>
>The vast majority of the HDFS-265 effort is still leveraged w/o append.
>The new data durability and read consistency behavior was the key part.
>
>GFS, which HDFS' design is based on, has append (and atomic record
>append) so obviously a workable design does not preclude append.
>However we also should not ape the GFS feature set simply because it
>exists. I've had conversations with people who worked on GFS that regret
>adding record append (see also
>http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a
>real priority for our users I think we should focus our energy elsewhere.
>
>Thanks,
>Eli
>
>The information contained in this email message is considered
>confidential and proprietary to the sender and is intended solely for
>review and use by the named recipient. Any unauthorized review, use or
>distribution is strictly prohibited. If you have received this message in
>error, please advise the sender by reply email and delete the message.
>

RE: [DISCUSS] Remove append?

Posted by Dave Shine <Da...@channelintelligence.com>.

I am not a contributor to this project, so I don't know how much weight my opinion carries.  But I have been hoping to see append become stable soon.  We are constantly dealing with the "small file problem", and I have written M/R jobs to periodically roll up lots of small files into a few small ones.  Having append would prevent me from needing to use up cluster resources performing these tasks.

Therefore, all things being equal I +1 making append work.  However, if the level of complexity is as bad as Eli implies below, then I can understand that perhaps it is not worth the effort. If it will cause too much technical debt, then removing it makes sense.  But don't just remove it because you don't believe there is a need for it.

Thanks,
Dave Shine


-----Original Message-----
From: Eli Collins [mailto:eli@cloudera.com]
Sent: Tuesday, March 20, 2012 8:38 PM
To: hdfs-dev@hadoop.apache.org
Subject: [DISCUSS] Remove append?

Hey gang,

I'd like to get people's thoughts on the following proposal. I think we should consider removing append from HDFS.

Where we are today.. append was added in the 0.17-19 releases
(HADOOP-1700) and subsequently disabled (HADOOP-5224) due to quality issues. It and sync were re-designed, re-implemented, and shipped in
21.0 (HDFS-265). To my knowledge, there has been no real production use. Anecdotally people who worked on branch-20-append have told me they think the new trunk code is substantially less well-tested than the branch-20-append code (at least for sync, append was never well tested). It has certainly gotten way less pounding from HBase users.
The design however, is much improved, and people think we can get hsync (and append) stabilized in trunk (mostly testing and bug fixing).

Rationale follows..

Append does not seem to be an important requirement, hflush was. There has not been much demand for append, from users or downstream projects. Because Hadoop 1.x does not have a working append implementation (see HDFS-3120, the branch-20-append work was focused on sync not getting append working) which is not enabled by default and downstream projects will want to support Hadoop 1.x releases for years, most will not introduce dependencies on append anyway. This is not to say demand does not exist, just that if it does, it's been much smaller than security, sync, HA, backwards compatbile RPC, etc. This probably explains why, over 5 years after the original implementation started, we don't have a stable release with append.

Append introduces non-trivial design and code complexity, which is not worth the cost if we don't have real users. Removing append means we have the property that HDFS blocks, when finalized, are immutable.
This significantly simplifies the design and code, which significantly simplifies the implementation of other features like snapshots, HDFS-level caching, dedupe, etc.

The vast majority of the HDFS-265 effort is still leveraged w/o append. The new data durability and read consistency behavior was the key part.

GFS, which HDFS' design is based on, has append (and atomic record
append) so obviously a workable design does not preclude append.
However we also should not ape the GFS feature set simply because it exists. I've had conversations with people who worked on GFS that regret adding record append (see also http://queue.acm.org/detail.cfm?id=1594206). In short, unless append is a real priority for our users I think we should focus our energy elsewhere.

Thanks,
Eli

The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.

Re: [DISCUSS] Remove append?

Posted by Scott Carey <sc...@richrelevance.com>.

On 3/20/12 5:37 PM, "Eli Collins" <el...@cloudera.com> wrote:

>Append introduces non-trivial design and code complexity, which is not
>worth the cost if we don't have real users. Removing append means we
>have the property that HDFS blocks, when finalized, are immutable.
>This significantly simplifies the design and code, which significantly
>simplifies the implementation of other features like snapshots,
>HDFS-level caching, dedupe, etc.

The above is related the critical design flaw in HDFS that makes it more
complicated than necessary.

Immutable files on a node can be combined with append with copy-on-write
semantics if the blocks are small enough.  But small blocks are not going
to work with this flaw.

This flaw is the definition of a block.  It is conflated, being is two
things at once:
# An immutable segment of data that the file system tracks.
# The segment of data that is contiguous on an individual data node.

The first in any sane file system is a constant length.
The second need not be.  File systems like Ext4 and XFS use extents to map
ranges of blocks to contiguous regions on disk.  Then, they need only
track these extents rather than all the fine grained detail of each block.
 The equivalent of a block report is then an extent report.

HDFS does not have extents, and this causes extreme pressure to have large
blocks for two well known reasons:  reduction in filesystem state data,
and larger data batches for Mappers.
With extents, both of these pressures apply to extent sizes instead of
block sizes.  Blocks can be small, extents larger.  Blocks can be
immutable with copy-on-write for appends, truncate, and even random write.

Others have already implemented the above in other distributed file
systems.  But when mentioned here in the past it seemed to be ignored or
misunderstood:

http://mail-archives.apache.org/mod_mbox/hadoop-general/201110.mbox/%3C1318
437111.16477.228.camel@thinkpad%3E
The response to that was disappointing -- the extent concept did not seem
to be comprehended, and none of the good ideas from the links provided got
discussed.

I personally NEED append for some of my work and had been planning on
using it in 0.23.  However I recognize that even more than that I can't
risk losing data for my append use case.  If append is too hard and
complicated to bolt on to HDFS, perhaps a bigger re-think is required so
that such features are not so complicated and a better natural fit to the
design.