You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by rahul rai <ra...@yahoo.in> on 2015/05/03 20:56:22 UTC

map reduce optimization

Hi,
Can somebody help preparation of map reduce settings.
We recently set up Hadoop 10 nodes Hadoop cluster and have 10 TB of data mostly xml data small files in zipped. This is our initial POC. 

I would like to know a few things , the steps to be followed

1. As per my understanding these 10 TB of data we upload to NameNode. 
2. Since the files are very small average 100 kb xml files . How should we combine these xml files . Can we combine roughly 1 TB each so that it would come to 10 files.3. We set the block size to 128MB.4. Do we need to put these 10 files each 1TB in one folder or 10 files in just 1 folder.5. In the case of 4 above  if it is only 1 folder having 10 files of 1 TB we run the map reduce . Is it better to run 10 map reduce jobs for each folder in case of 10 folders or just one map reduce for 1 folder.
6. In case we run 1 map reduce job using 1 folder having 10 files each 1 TB , the map tasks I calculate is 10*1*1024*1024/128=81920 mappers. Can the system sustain these many mappers
Thanksrai