You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrew McNabb <am...@mcnabbs.org> on 2005/11/08 20:16:03 UTC

mapreduce with large amounts of data

I'm starting some work using Nutch's MapReduce for parallel computation
unrelated to web indexing.  The last few days I've been becoming
familiar with how the implementation works, and I've been very
impressed.

I ran some tests using the Grep demo to get a feel for how it works with
large files, and everything worked perfectly for medium sized files, but
I had some issues with a 400MB input file.  I started it on two machines
and watched it make progress up to about 10% or 15%.  I came back a
half-hour later, and the progress was -700% or so, and several Java
processes were bogging down both machines.

I'm happy to help track down and fix the problem.  Is it a known
problem?  Can anyone recommend the best approach to tracking down the
bug?

Thanks for everything--this is a great project.

-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868