You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Antonio Barbuzzi <an...@gmail.com> on 2010/07/01 19:13:50 UTC
parallel file copy into hdfs preallocating blocks
Hi all,
Hope that my question is not too silly or naive...
From my impressions, it seems that the throughput when writing from a
client not part of the cluster to HDFS is not the maximum achievable (it
should be close to the client disk read throughput). Maybe this is due
to all the overlays (??).
So, I wonder whether it could be possible (or worth) to write multiple
blocks of the same file in HDFS concurrently (and so, take advantage of
having multiple datanodes).
Insofar as I know, this is not (yet) possible.
But, whenever you need to copy an existing file into HDFS, the total
number of blocks is known a priori, so, if the namenode could
preallocate blocks, a client could upload multiple blocks at the same
time, improving the upload throughput. Note that I'm not talking about
concurrent writing to the same block, but about concurrent write to a
file with nonconcurrent access to each block.
Is this approach feasible?
I think that a smart client could write multiple chunk-sized files, but
a direct support in the API could be useful.