You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Antonio Barbuzzi <an...@gmail.com> on 2010/07/01 19:13:50 UTC

parallel file copy into hdfs preallocating blocks

Hi all,

Hope that my question is not too silly or naive...

 From my impressions, it seems that the throughput when writing from a 
client not part of the cluster to HDFS is not the maximum achievable (it 
should be close to the client disk read throughput). Maybe this is due 
to all the overlays (??).

So, I wonder whether it could be possible (or worth) to write multiple 
blocks of the same file in HDFS concurrently (and so, take advantage of 
having multiple datanodes).

Insofar as I know, this is not (yet) possible.
But, whenever you need to copy an existing file into HDFS, the total 
number of blocks is known a priori, so, if the namenode could 
preallocate blocks, a client could upload multiple blocks at the same 
time, improving the upload throughput. Note that I'm not talking about 
concurrent writing to the same block, but about concurrent write to a 
file with nonconcurrent access to each block.

Is this approach feasible?
I think that a smart client could write multiple chunk-sized files, but 
a direct support in the API could be useful.