You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sourajit Basak <so...@gmail.com> on 2012/12/06 08:11:45 UTC

Nutch distributed on IBM BladeCenter

We are running Nutch distributed on a IBM blade center setup. Each blade is
2P8C with 4G RAM per core.

The Nutch hadoop jobs will do an OCR (a plugged-in custom parser), hence,
will be memory intensive. The jobs have a high initialization time. I am
wondering if anyone can suggest which hadoop parameters do we tune to
utilize the blades to their fullest.

I understand that arriving at an optimized solution is subject to trials.
To start off, we have zeroed on this params.

1. mapred.tasktracker.map|reduce.tasks.maximum<http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum>equal
or multiple of total cores per blade/node = 16n (do we leave aside
more room so as not throttle system procs?)
2. mapred.map|reduce.child.java.opts (Per my understanding, nutch jobs do
not spawn any child jvm?) How do we say that use this amount of memory for
any new job created ?
3. mapred.job.reuse.jvm.num.tasks ? Does this mean that our custom parser
will be initialized only once. If so, we need to take care of parser
failures and take appropriate precaution.

Best,
Sourajit

Re: Nutch distributed on IBM BladeCenter

Posted by Julien Nioche <li...@gmail.com>.
Hi Sourajit

On 6 December 2012 07:11, Sourajit Basak <so...@gmail.com> wrote:

> We are running Nutch distributed on a IBM blade center setup. Each blade is
> 2P8C with 4G RAM per core.
>
> The Nutch hadoop jobs will do an OCR (a plugged-in custom parser), hence,
> will be memory intensive. The jobs have a high initialization time. I am
> wondering if anyone can suggest which hadoop parameters do we tune to
> utilize the blades to their fullest.
>
> I understand that arriving at an optimized solution is subject to trials.
> To start off, we have zeroed on this params.
>

See http://svn.apache.org/viewvc/nutch/trunk/src/bin/crawl?view=markup for
a starting point on Hadoop params


>
> 1. mapred.tasktracker.map|reduce.tasks.maximum<
> http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum
> >equal
> or multiple of total cores per blade/node = 16n (do we leave aside
> more room so as not throttle system procs?)
>

yes - you'll have at least the tasktracker and datanode running on each
machine as well as the system so you should leave them a bit of CPU


> 2. mapred.map|reduce.child.java.opts (Per my understanding, nutch jobs do
> not spawn any child jvm?) How do we say that use this amount of memory for
> any new job created ?
>

well, any Hadoop-based application will have the mappers and reducers
running as separate JVMs. The options above will determine how much RAM the
mappers and reducers are allowed. See also mapred.child.java.opts

The main memory hog is typically the parsing step,even more so if you do
OCR I expect



> 3. mapred.job.reuse.jvm.num.tasks ? Does this mean that our custom parser
> will be initialized only once. If so, we need to take care of parser
> failures and take appropriate precaution.
>

IIRC correctly the mapper or reducer instances will be reused but
reinitialised. A bit of experimentation will tell you but you got the idea
here and it is the right approach if the initialisation is slow.

HTH

Julien



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: upgrade nutch 1.4 to 2.x

Posted by kiran chitturi <ch...@gmail.com>.
I am not sure how the migration works. There has been changes in the
architecture from 1.4 to 2.x.

On Thu, Dec 6, 2012 at 1:32 PM, kaveh minooie <ka...@plutoz.com> wrote:

> Hi every body
>
> I was wondering if it was possible to upgrade ( by that I mean the crawldb
> and linkdb) from 1.4 to 2.x version? has any one has done this?
>
> --
> Kaveh Minooie
>
> www.plutoz.com
>



-- 
Kiran Chitturi

upgrade nutch 1.4 to 2.x

Posted by kaveh minooie <ka...@plutoz.com>.
Hi every body

I was wondering if it was possible to upgrade ( by that I mean the 
crawldb and linkdb) from 1.4 to 2.x version? has any one has done this?

-- 
Kaveh Minooie

www.plutoz.com