You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Benjamin Reed <br...@yahoo-inc.com> on 2007/07/09 12:39:50 UTC

Copy on write for HDFS

I need to implement COW for HDFS for a project I'm working on. I vaguely 
remember it being discussed before, but I can't find any threads about 
it. I wanted to at least check for interest/previous work before 
proceeding. Hard links would work for me as well, but they are harder to 
implement. I was thinking of adding the following to the client protocol:

public void cow(String src, String clientName, boolean overwrite, 
LocatedBlocks blocks) throws IOException;

The call would simply create a new file and populate its contents with 
the blocks contained in the LocatedBlocks.

Apart from fast copies, it also allows fast truncations and extensions 
of existing files.

(This is not a hard link because it is possible that the set of blocks 
may not correspond to any other file.)

Has such a thing been discussed before?

thanx
ben

Re: Copy on write for HDFS

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Hola Ben,

Yes I am saying since there are no writes there are no copies, and that 
is why you get just the grouping of blocks.
Do you want to group blocks of one file only or different files?
What is a potential use of that?
Are you planning to implement operations like taking a subset_of_blocks 
of one file, intersection, union?
If we had hard-links then grouping could be performed on one-block files 
by creating directories.
This is low level, but works.

--Konstantin

Benjamin Reed wrote:

>I think you are saying that the copy never happens because there is never a 
>write that will cause the copy. Copy on write is probably not the correct 
>term both because there is not a write but also in the API I propose you 
>don't really need to make a copy of another file, you are just basically 
>grouping a set of existing blocks into a new file.
>
>The reason I call it copy-on-write is because the data structures used are the 
>same: you have a single block used by a separate file and you have a two 
>files that may both reference a single block but another block in one file is 
>not referenced by another file.
>
>Anyone know a better name for this kind of semantics.
>
>Creating one-block files doesn't help at all. Both because we don't want to 
>deal with a bunch of one-block files and because even if the files are one 
>block you still cannot have that one block referenced by two separate files.
>
>ben
>
>On Lunes 09 Julio 2007, Konstantin Shvachko wrote:
>  
>
>>Copy on write was discussed in HADOOP-334 in the context of periodic
>>checkpointing of the name-space.
>>Other than that I remember a discussion about file clone() operation,
>>which makes a new inode, but uses the
>>same blocks as the original file, which are copied once they are
>>modified or appended.
>>But this functionality would be possible only if we had at least
>>appends. Since hdfs does not support
>>modifications the purpose of COW is going to be only to support grouping
>>of blocks from different
>>(or is it just one?) files.
>>I think it is possible, but very non-possix.
>>And you can always create one-block files and group them in directories
>>instead.
>>
>>--Konstantin
>>
>>Benjamin Reed wrote:
>>    
>>
>>>I need to implement COW for HDFS for a project I'm working on. I
>>>vaguely remember it being discussed before, but I can't find any
>>>threads about it. I wanted to at least check for interest/previous
>>>work before proceeding. Hard links would work for me as well, but they
>>>are harder to implement. I was thinking of adding the following to the
>>>client protocol:
>>>
>>>public void cow(String src, String clientName, boolean overwrite,
>>>LocatedBlocks blocks) throws IOException;
>>>
>>>The call would simply create a new file and populate its contents with
>>>the blocks contained in the LocatedBlocks.
>>>
>>>Apart from fast copies, it also allows fast truncations and extensions
>>>of existing files.
>>>
>>>(This is not a hard link because it is possible that the set of blocks
>>>may not correspond to any other file.)
>>>
>>>Has such a thing been discussed before?
>>>
>>>thanx
>>>ben
>>>      
>>>
>
>
>
>  
>

Re: Copy on write for HDFS

Posted by Owen O'Malley <oo...@yahoo-inc.com>.

On Jul 15, 2007, at 11:08 PM, Dhruba Borthakur wrote:

> I guess what you are saying is that a block can belong to multiple  
> files.

A better name for the feature would be "clone," I think. And yes, it  
would be a file copy that is cheap since it doesn't involve moving  
any data. It only updates structures on the NameNode.

> 1. File deletion: In the current code, when a file is deleted, all  
> blocks
> belonging to that file are scheduled for deletion. This code has to  
> change
> in such a way that a block gets deleted only if it does not belong  
> to *any*
> file.

There would either need to be a ref count on the blocks or a reverse  
mapping of blocks to sets of files. And yes, you can only delete the  
block if the set of files is empty or the ref count goes to 0. A more  
invasive change is that the desired replication of the block is the  
maximum of the replications of the containing files. I assume that  
means that you would need to stored desired replication on each block  
rather than in the file information.

> 2. race between cow() and delete():  The client invokes cow() with   
> set of
> LocatedBlocks. Since there aren't any client side locks, by the  
> time the
> Namenode processes the cow() command, the original block(s) could  
> have been
> deleted.

The right interface in my opinion is not that you give blocks at all,  
but do the clone at the file level.

void cloneFile(Path source, Path destination) throws IOException

or something. Then the namespace can be locked while the data  
structures are read and modified.

-- Owen

RE: Copy on write for HDFS

Posted by Dhruba Borthakur <dh...@yahoo-inc.com>.

Hi Ben,

I guess what you are saying is that a block can belong to multiple files. If
so, here are some issues of interest:

1. File deletion: In the current code, when a file is deleted, all blocks
belonging to that file are scheduled for deletion. This code has to change
in such a way that a block gets deleted only if it does not belong to *any*
file.

2. race between cow() and delete():  The client invokes cow() with  set of
LocatedBlocks. Since there aren't any client side locks, by the time the
Namenode processes the cow() command, the original block(s) could have been
deleted.

Thanks,
dhruba

-----Original Message-----
From: Benjamin Reed [mailto:breed@yahoo-inc.com] 
Sent: Saturday, July 14, 2007 3:57 AM
To: hadoop-dev@lucene.apache.org
Subject: Re: Copy on write for HDFS

I think you are saying that the copy never happens because there is never a 
write that will cause the copy. Copy on write is probably not the correct 
term both because there is not a write but also in the API I propose you 
don't really need to make a copy of another file, you are just basically 
grouping a set of existing blocks into a new file.

The reason I call it copy-on-write is because the data structures used are
the 
same: you have a single block used by a separate file and you have a two 
files that may both reference a single block but another block in one file
is 
not referenced by another file.

Anyone know a better name for this kind of semantics.

Creating one-block files doesn't help at all. Both because we don't want to 
deal with a bunch of one-block files and because even if the files are one 
block you still cannot have that one block referenced by two separate files.

ben

On Lunes 09 Julio 2007, Konstantin Shvachko wrote:
> Copy on write was discussed in HADOOP-334 in the context of periodic
> checkpointing of the name-space.
> Other than that I remember a discussion about file clone() operation,
> which makes a new inode, but uses the
> same blocks as the original file, which are copied once they are
> modified or appended.
> But this functionality would be possible only if we had at least
> appends. Since hdfs does not support
> modifications the purpose of COW is going to be only to support grouping
> of blocks from different
> (or is it just one?) files.
> I think it is possible, but very non-possix.
> And you can always create one-block files and group them in directories
> instead.
>
> --Konstantin
>
> Benjamin Reed wrote:
> > I need to implement COW for HDFS for a project I'm working on. I
> > vaguely remember it being discussed before, but I can't find any
> > threads about it. I wanted to at least check for interest/previous
> > work before proceeding. Hard links would work for me as well, but they
> > are harder to implement. I was thinking of adding the following to the
> > client protocol:
> >
> > public void cow(String src, String clientName, boolean overwrite,
> > LocatedBlocks blocks) throws IOException;
> >
> > The call would simply create a new file and populate its contents with
> > the blocks contained in the LocatedBlocks.
> >
> > Apart from fast copies, it also allows fast truncations and extensions
> > of existing files.
> >
> > (This is not a hard link because it is possible that the set of blocks
> > may not correspond to any other file.)
> >
> > Has such a thing been discussed before?
> >
> > thanx
> > ben

Re: Copy on write for HDFS

Posted by Benjamin Reed <br...@yahoo-inc.com>.

I think you are saying that the copy never happens because there is never a 
write that will cause the copy. Copy on write is probably not the correct 
term both because there is not a write but also in the API I propose you 
don't really need to make a copy of another file, you are just basically 
grouping a set of existing blocks into a new file.

The reason I call it copy-on-write is because the data structures used are the 
same: you have a single block used by a separate file and you have a two 
files that may both reference a single block but another block in one file is 
not referenced by another file.

Anyone know a better name for this kind of semantics.

Creating one-block files doesn't help at all. Both because we don't want to 
deal with a bunch of one-block files and because even if the files are one 
block you still cannot have that one block referenced by two separate files.

ben

On Lunes 09 Julio 2007, Konstantin Shvachko wrote:
> Copy on write was discussed in HADOOP-334 in the context of periodic
> checkpointing of the name-space.
> Other than that I remember a discussion about file clone() operation,
> which makes a new inode, but uses the
> same blocks as the original file, which are copied once they are
> modified or appended.
> But this functionality would be possible only if we had at least
> appends. Since hdfs does not support
> modifications the purpose of COW is going to be only to support grouping
> of blocks from different
> (or is it just one?) files.
> I think it is possible, but very non-possix.
> And you can always create one-block files and group them in directories
> instead.
>
> --Konstantin
>
> Benjamin Reed wrote:
> > I need to implement COW for HDFS for a project I'm working on. I
> > vaguely remember it being discussed before, but I can't find any
> > threads about it. I wanted to at least check for interest/previous
> > work before proceeding. Hard links would work for me as well, but they
> > are harder to implement. I was thinking of adding the following to the
> > client protocol:
> >
> > public void cow(String src, String clientName, boolean overwrite,
> > LocatedBlocks blocks) throws IOException;
> >
> > The call would simply create a new file and populate its contents with
> > the blocks contained in the LocatedBlocks.
> >
> > Apart from fast copies, it also allows fast truncations and extensions
> > of existing files.
> >
> > (This is not a hard link because it is possible that the set of blocks
> > may not correspond to any other file.)
> >
> > Has such a thing been discussed before?
> >
> > thanx
> > ben

Re: Copy on write for HDFS

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Copy on write was discussed in HADOOP-334 in the context of periodic 
checkpointing of the name-space.
Other than that I remember a discussion about file clone() operation, 
which makes a new inode, but uses the
same blocks as the original file, which are copied once they are 
modified or appended.
But this functionality would be possible only if we had at least 
appends. Since hdfs does not support
modifications the purpose of COW is going to be only to support grouping 
of blocks from different
(or is it just one?) files.
I think it is possible, but very non-possix.
And you can always create one-block files and group them in directories 
instead.

--Konstantin

Benjamin Reed wrote:

> I need to implement COW for HDFS for a project I'm working on. I 
> vaguely remember it being discussed before, but I can't find any 
> threads about it. I wanted to at least check for interest/previous 
> work before proceeding. Hard links would work for me as well, but they 
> are harder to implement. I was thinking of adding the following to the 
> client protocol:
>
> public void cow(String src, String clientName, boolean overwrite, 
> LocatedBlocks blocks) throws IOException;
>
> The call would simply create a new file and populate its contents with 
> the blocks contained in the LocatedBlocks.
>
> Apart from fast copies, it also allows fast truncations and extensions 
> of existing files.
>
> (This is not a hard link because it is possible that the set of blocks 
> may not correspond to any other file.)
>
> Has such a thing been discussed before?
>
> thanx
> ben
>