You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Eugeny N Dzhurinsky <bo...@redwerk.com> on 2008/03/24 10:42:30 UTC

HDFS and Map/Reduce question

Hello there!

I would like to know, is it possible to do the following things with Hadoop:

1) I need to have certain directories to be fully replicated between several
hosts, and remain in consistent state - this means these directories and all
content of these directories will reside on several hosts, and each of these
hosts will have complete content of these directories. It could be great if it
is possible to tell Hadoop which hosts to use for replication.

2) I need to find out which hosts has certain file(s) accessible on their
local HDFS filesystem.

3) I need to have a way to send the same map job to several hosts (broadcast
job), and combine results of these mapping tasks into single file in reduce.

I need this to implement the task described below:

Imagine there is the Hadoop cluster, and some mapping tasks produced a lot of
files in the HDFS (index all knowledge bases in several universities for
example). Lucene is used to index things, and the indexes are residing on the
same host holding indexed data. To search the data, the search query is
broadcast between several hosts, and each of the hosts queries it's local
index and return results, which are combined at reduce part. These results are
including the information which host contain the actual data source.

So we need to know, which host contain which files in the HDFS locally,
broadcast the same search job to all hosts in cluster, and make sure we can
replicate indexes for the data on several hosts, so each of these host will
have complete index.

-- 
Eugene N Dzhurinsky