You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Doug Cook <na...@candiru.com> on 2006/07/20 08:34:38 UTC

Best performance approach for single MP machine?

Hi,

I've recently switched to 0.8 from 0.7, and after some initial fits and
starts, I'm past the "get it working at all" stage to the "get reasonable
performance" stage.

I've got a single machine with 4 CPUs and a lot of memory. URL fetching
works great because it's (mostly) multithreaded. But as soon as I hit the
reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and
the phase can take days, leaving me vulnerable to losing everything should a
process fail.

Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some
help getting my configuration right. I've seen examples/tutorials of
configurations for multiple machines; am I just "faking" multiple machines
on my single node (will that work?) or is there a cleaner, simpler approach?

Alternatively, I was all excited to get an easy improvement with
-numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it
looks like -numFetchers has gone away, and though there was an 0.8 version
patch, at a quick glance this didn't seem to have made it into the mainline
source, and I don't see the value of trying to merge this in if there's a
cleaner Hadoop-based approach.

Many thanks for any help.

Doug
-- 
View this message in context: http://www.nabble.com/Best-performance-approach-for-single-MP-machine--tf1970539.html#a5409596
Sent from the Nutch - User forum at Nabble.com.


Re: Best performance approach for single MP machine?

Posted by Thomas Delnoij <di...@gmail.com>.
Hi Doug,

is it possible you could post your hadoop-site.xml? I would like to
accomplish the same.

Rgrds. Thomas

On 7/21/06, Doug Cook <na...@candiru.com> wrote:
>
> Thanks, Håvard (and Doug, in the original email).
>
> Those pointers, plus a few other tips from elsewhere, did the trick. I'm now
> up and running with all CPUs.
>
> One thing I found along the way was that if I did not set
> mapred.child.heap.size, I would run out of heap space in initialization of
> inject with even a small URL list. Is this normal? If so, why not have a
> reasonable default for heap.size? If this is not normal, is it indicative of
> something else I might have misconfigured?
>
> In any case, I'm running now, just curious (and would like for others to
> avoid having to "discover" this).
>
> -Doug
> --
> View this message in context: http://www.nabble.com/Best-performance-approach-for-single-MP-machine--tf1970539.html#a5430453
> Sent from the Nutch - User forum at Nabble.com.
>
>

Re: Best performance approach for single MP machine?

Posted by Doug Cook <na...@candiru.com>.
Thanks, Håvard (and Doug, in the original email).

Those pointers, plus a few other tips from elsewhere, did the trick. I'm now
up and running with all CPUs.

One thing I found along the way was that if I did not set
mapred.child.heap.size, I would run out of heap space in initialization of
inject with even a small URL list. Is this normal? If so, why not have a
reasonable default for heap.size? If this is not normal, is it indicative of
something else I might have misconfigured? 

In any case, I'm running now, just curious (and would like for others to
avoid having to "discover" this).

-Doug
-- 
View this message in context: http://www.nabble.com/Best-performance-approach-for-single-MP-machine--tf1970539.html#a5430453
Sent from the Nutch - User forum at Nabble.com.


Re: Best performance approach for single MP machine?

Posted by "Håvard W. Kongsgård" <h....@niap.no>.
 
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02394.html

"
Teruhiko Kurosaka wrote:

    Can I use MapReduce to run Nutch on a multi CPU system?
      

Yes.


    I want to run the index job on two (or four) CPUs
    on a single system.  I'm not trying to distribute the job
    over multiple systems.

    If the MapReduce is the way to go,
    do I just specify config parameters like these:
    mapred.tasktracker.tasks.maxiumum=2
    mapred.job.tracker=localhost:9001
    mapred.reduce.tasks=2 (or 1?)

    and
    bin/start-all.sh

    ?
      

That should work. You'd probably want to set the default number of map 
tasks to be a multiple of the number of CPUs, and the number of reduce 
tasks to be exactly the number of cpus.

Don't use start-all.sh, but rather just:

bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker


    Must I use NDFS for MapReduce?
      

No.

Doug

"





Doug Cook wrote:
> Hi,
>
> I've recently switched to 0.8 from 0.7, and after some initial fits and
> starts, I'm past the "get it working at all" stage to the "get reasonable
> performance" stage.
>
> I've got a single machine with 4 CPUs and a lot of memory. URL fetching
> works great because it's (mostly) multithreaded. But as soon as I hit the
> reduce phase of fetch, it's dog slow. I'm down to running on one CPU, and
> the phase can take days, leaving me vulnerable to losing everything should a
> process fail.
>
> Wait! you say. That's just what Hadoop is for! I'm all ears. I'd love some
> help getting my configuration right. I've seen examples/tutorials of
> configurations for multiple machines; am I just "faking" multiple machines
> on my single node (will that work?) or is there a cleaner, simpler approach?
>
> Alternatively, I was all excited to get an easy improvement with
> -numFetchers, and run 4 fetchers simultaneously to use all my CPUs, but it
> looks like -numFetchers has gone away, and though there was an 0.8 version
> patch, at a quick glance this didn't seem to have made it into the mainline
> source, and I don't see the value of trying to merge this in if there's a
> cleaner Hadoop-based approach.
>
> Many thanks for any help.
>
> Doug
>