You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Samuel Guo <gu...@gmail.com> on 2008/11/16 10:27:00 UTC

evaluate the size of the input & split them in parallel

Hi all,

When I am using Hadoop to do some Map/Reduce jobs over a large dataset(many
thousands of large input files), It seems that the client will take a little
long time to initial the job before actually running it. I am doubting that
it may be stucked during getting thousands of file's metadata from NameNode
and computing their splits.

Is there any way to evaluate the size of the input & construct their split
information in parallel. Can we run a light map/reduce job to construct the
split information in parallel before initializing a job? I think it's worth
constructing the job's split information in parallel when we encounter the
jobs with many thousands of input files.

Hope for reply.

Regards,

Samuel