You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Frank Scholten <fr...@frankscholten.nl> on 2012/03/01 11:08:13 UTC
Distributed Indexing on MapReduce
Hi all,
I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944
What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.
I have been going through the Nutch and contrib/index code and from my
understanding I have to:
* Create an InputFormat / RecordReader / InputSplit class for
splitting the e-mails across mappers
* Create a Mapper which emits the e-mails as key value pairs
* Create a Reducer which indexes the e-mails on the local filesystem
(or straight to HDFS?)
* Copy these indexes from local filesystem to HDFS. In the same Reducer?
I am unsure about the final steps. How to get to the end result, a
bunch of index shards on HDFS. It seems
that each Reducer needs to be aware of a directory they eventually
write to on HDFS. I don't see how to get each reducer to copy its
shard to HDFS
How do I set this up?
Cheers,
Frank