You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Alfonso Olias Sanz <al...@gmail.com> on 2008/03/15 01:05:37 UTC

[core-user] Move application to Map/Reduce architecture with Hadoop

Hi

I have just started using hadoop and HDFS.  I have done the WordCount
test application which gets some input files, process the files, and
generates and output file.

I have a similar application, that has a million input files and has
to produce a million output files. The correlation between
input/output is 1:1.

This process is suitable to run with a Map/Reduce approach.  But I
have several doubts I hope some body can answer me.

*** What should be the Reduce function??? Because there is no merge of
data of the running map processes.

*** I set up a 2nodes cluster for the WordCount test. They work as
master+slave, slave. Before  launching the process I copied the files
using $HADOOP_HOME/bin/hadoop dfs -copyFromLocal <sourceFiles>
<destination>

This files were replicated in both nodes. Is it there anyway to avoid
the files being replicated to all the nodes and instead, have them
distributed among all the nodes. With no replication of files.


*** During the WordCount test, 25 map jobs were launched!  For our
application is overkilling. We have done several performance tests
without using hadoop and we have seen  that we can launch 1
application per core.  So is it there anyway to configure the cluster
in order to launch a number of tasks per core. So depending on
dual-core or quad core pcs, the number of running processes will be
different.


Thanks in advance.
alf.