You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/08/22 18:10:05 UTC

Extracted Data Manipulation - org.apache.nutch.io, MapRed?

Hello,

I am going to perform some manipulations on extracted text presented as
array of strings, I need some advice. Need to retrieve Strings, store it
(some Strings can be repeated in a file few times), sort, calculate
statistics, store sorted subset in another file, etc.
Which class is better designed for this?
ArrayFile
MapFile
SequenceFile - I can sort by LongWritable, tried to sort by String -
unsuccessfully
SetFile

What is Map Reduce, could you please provide some overview?

I can't use Lucene because I don't want to analyze-tokenize strings.
Also, I don't want to reinvent a wheel, especially for a distributed NFS
- I want to use this power.

Thanks,
Fuad