You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Magnús Skúlason <ma...@gmail.com> on 2012/03/14 11:38:19 UTC

Configuring nutch to run on hadoop

Hi,

I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
all running on one machine).

Everything works fine except that I only get two map and reduce tasks
to run at the same time, the machine that I am running on has 4 quad
core CPUs so I should benefit from running more tasks at a time.

in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
     <property>
         <name>mapred.job.tracker.map.tasks.maximum</name>
         <value>4</value>
     </property>
     <property>
         <name>mapred.job.tracker.reduce.tasks.maximum</name>
         <value>4</value>
     </property>
     <property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx512m</value>
     </property>
   <property>
     <name>mapred.map.tasks</name>
     <value>4</value>
  </property>
  <property>
     <name>mapred.reduce.tasks</name>
     <value>4</value>
  </property>

</configuration>

and I have restarted hadoop, but still it only spawns two map and
reduce tasks a time, are there some other parameters I should be
setting? should I maybe omit the .tasks.maximum parameters?

Another question regarding the setting:
        <name>mapred.child.java.opts</name>
        <value>-Xmx512m</value>
how many fetcher threads should I be able to run with this memory
setting, I have noticed that with a high fetcher thread setting
(fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
in the fetching step. Is there any rule of thumb for the numbers of
threads per 100mb of memory?

best regards,
Magnus

Re: Configuring nutch to run on hadoop

Posted by Magnús Skúlason <ma...@gmail.com>.

Thanks,

This solved the problem :) I must have picked up the other parameter
names from an outdated documentation or simply miss spelled them

best regards,
Magnus

On Thu, Mar 15, 2012 at 10:05 AM, Rafael Pappert <rp...@fwpsystems.com> wrote:
> oh Ferdy wrote that already …. sorry!
>
>
>
> On 15/Mar/ 2012, at 11:01 , Rafael Pappert wrote:
>
>> Hello,
>>
>> it is
>>
>> mapred.tasktracker.reduce.tasks.maximum
>> mapred.tasktracker.map.tasks.maximum
>>
>> cheers,
>> Rafael.
>>
>>
>>
>> On 15/Mar/ 2012, at 08:36 , Ferdy Galema wrote:
>>
>>> Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
>>>
>>> On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>>>>
>>>> mapred.tasktracker.map.tasks.maximum
>>>> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
>>>> running at same time, per tasktracker.
>>>>
>>>> It seems you made a mistake with the property
>>>> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>>>>
>>>> Ferdy
>>>>
>>>> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>>>>> all running on one machine).
>>>>>
>>>>> Everything works fine except that I only get two map and reduce tasks
>>>>> to run at the same time, the machine that I am running on has 4 quad
>>>>> core CPUs so I should benefit from running more tasks at a time.
>>>>>
>>>>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>>>>> <configuration>
>>>>>   <property>
>>>>>       <name>mapred.job.tracker</name>
>>>>>       <value>localhost:9001</value>
>>>>>   </property>
>>>>>   <property>
>>>>>       <name>mapred.job.tracker.map.tasks.maximum</name>
>>>>>       <value>4</value>
>>>>>   </property>
>>>>>   <property>
>>>>>       <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>>>>       <value>4</value>
>>>>>   </property>
>>>>>   <property>
>>>>>      <name>mapred.child.java.opts</name>
>>>>>      <value>-Xmx512m</value>
>>>>>   </property>
>>>>> <property>
>>>>>   <name>mapred.map.tasks</name>
>>>>>   <value>4</value>
>>>>> </property>
>>>>> <property>
>>>>>   <name>mapred.reduce.tasks</name>
>>>>>   <value>4</value>
>>>>> </property>
>>>>>
>>>>> </configuration>
>>>>>
>>>>> and I have restarted hadoop, but still it only spawns two map and
>>>>> reduce tasks a time, are there some other parameters I should be
>>>>> setting? should I maybe omit the .tasks.maximum parameters?
>>>>>
>>>>> Another question regarding the setting:
>>>>>      <name>mapred.child.java.opts</name>
>>>>>      <value>-Xmx512m</value>
>>>>> how many fetcher threads should I be able to run with this memory
>>>>> setting, I have noticed that with a high fetcher thread setting
>>>>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>>>>> in the fetching step. Is there any rule of thumb for the numbers of
>>>>> threads per 100mb of memory?
>>>>>
>>>>> best regards,
>>>>> Magnus
>>>>>
>>>>
>>>>
>>
>

Re: Configuring nutch to run on hadoop

Posted by Rafael Pappert <rp...@fwpsystems.com>.

oh Ferdy wrote that already …. sorry!



On 15/Mar/ 2012, at 11:01 , Rafael Pappert wrote:

> Hello,
> 
> it is
> 
> mapred.tasktracker.reduce.tasks.maximum
> mapred.tasktracker.map.tasks.maximum
> 
> cheers,
> Rafael.
> 
> 
> 
> On 15/Mar/ 2012, at 08:36 , Ferdy Galema wrote:
> 
>> Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
>> 
>> On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
>> 
>>> Hi,
>>> 
>>> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>>> 
>>> mapred.tasktracker.map.tasks.maximum
>>> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
>>> running at same time, per tasktracker.
>>> 
>>> It seems you made a mistake with the property
>>> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>>> 
>>> Ferdy
>>> 
>>> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>>> 
>>>> Hi,
>>>> 
>>>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>>>> all running on one machine).
>>>> 
>>>> Everything works fine except that I only get two map and reduce tasks
>>>> to run at the same time, the machine that I am running on has 4 quad
>>>> core CPUs so I should benefit from running more tasks at a time.
>>>> 
>>>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>>>> <configuration>
>>>>   <property>
>>>>       <name>mapred.job.tracker</name>
>>>>       <value>localhost:9001</value>
>>>>   </property>
>>>>   <property>
>>>>       <name>mapred.job.tracker.map.tasks.maximum</name>
>>>>       <value>4</value>
>>>>   </property>
>>>>   <property>
>>>>       <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>>>       <value>4</value>
>>>>   </property>
>>>>   <property>
>>>>      <name>mapred.child.java.opts</name>
>>>>      <value>-Xmx512m</value>
>>>>   </property>
>>>> <property>
>>>>   <name>mapred.map.tasks</name>
>>>>   <value>4</value>
>>>> </property>
>>>> <property>
>>>>   <name>mapred.reduce.tasks</name>
>>>>   <value>4</value>
>>>> </property>
>>>> 
>>>> </configuration>
>>>> 
>>>> and I have restarted hadoop, but still it only spawns two map and
>>>> reduce tasks a time, are there some other parameters I should be
>>>> setting? should I maybe omit the .tasks.maximum parameters?
>>>> 
>>>> Another question regarding the setting:
>>>>      <name>mapred.child.java.opts</name>
>>>>      <value>-Xmx512m</value>
>>>> how many fetcher threads should I be able to run with this memory
>>>> setting, I have noticed that with a high fetcher thread setting
>>>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>>>> in the fetching step. Is there any rule of thumb for the numbers of
>>>> threads per 100mb of memory?
>>>> 
>>>> best regards,
>>>> Magnus
>>>> 
>>> 
>>> 
>

Re: Configuring nutch to run on hadoop

Posted by Rafael Pappert <rp...@fwpsystems.com>.

Hello,

it is

mapred.tasktracker.reduce.tasks.maximum
mapred.tasktracker.map.tasks.maximum

cheers,
Rafael.



On 15/Mar/ 2012, at 08:36 , Ferdy Galema wrote:

> Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
> 
> On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
> 
>> Hi,
>> 
>> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>> 
>> mapred.tasktracker.map.tasks.maximum
>> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
>> running at same time, per tasktracker.
>> 
>> It seems you made a mistake with the property
>> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>> 
>> Ferdy
>> 
>> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>> 
>>> Hi,
>>> 
>>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>>> all running on one machine).
>>> 
>>> Everything works fine except that I only get two map and reduce tasks
>>> to run at the same time, the machine that I am running on has 4 quad
>>> core CPUs so I should benefit from running more tasks at a time.
>>> 
>>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>>> <configuration>
>>>    <property>
>>>        <name>mapred.job.tracker</name>
>>>        <value>localhost:9001</value>
>>>    </property>
>>>    <property>
>>>        <name>mapred.job.tracker.map.tasks.maximum</name>
>>>        <value>4</value>
>>>    </property>
>>>    <property>
>>>        <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>>        <value>4</value>
>>>    </property>
>>>    <property>
>>>       <name>mapred.child.java.opts</name>
>>>       <value>-Xmx512m</value>
>>>    </property>
>>>  <property>
>>>    <name>mapred.map.tasks</name>
>>>    <value>4</value>
>>> </property>
>>> <property>
>>>    <name>mapred.reduce.tasks</name>
>>>    <value>4</value>
>>> </property>
>>> 
>>> </configuration>
>>> 
>>> and I have restarted hadoop, but still it only spawns two map and
>>> reduce tasks a time, are there some other parameters I should be
>>> setting? should I maybe omit the .tasks.maximum parameters?
>>> 
>>> Another question regarding the setting:
>>>       <name>mapred.child.java.opts</name>
>>>       <value>-Xmx512m</value>
>>> how many fetcher threads should I be able to run with this memory
>>> setting, I have noticed that with a high fetcher thread setting
>>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>>> in the fetching step. Is there any rule of thumb for the numbers of
>>> threads per 100mb of memory?
>>> 
>>> best regards,
>>> Magnus
>>> 
>> 
>>

Re: Configuring nutch to run on hadoop

Posted by Ferdy Galema <fe...@kalooga.com>.

Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.

On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:

> Hi,
>
> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>
> mapred.tasktracker.map.tasks.maximum
> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
> running at same time, per tasktracker.
>
> It seems you made a mistake with the property
> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>
> Ferdy
>
> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>
>> Hi,
>>
>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>> all running on one machine).
>>
>> Everything works fine except that I only get two map and reduce tasks
>> to run at the same time, the machine that I am running on has 4 quad
>> core CPUs so I should benefit from running more tasks at a time.
>>
>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>> <configuration>
>>     <property>
>>         <name>mapred.job.tracker</name>
>>         <value>localhost:9001</value>
>>     </property>
>>     <property>
>>         <name>mapred.job.tracker.map.tasks.maximum</name>
>>         <value>4</value>
>>     </property>
>>     <property>
>>         <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>         <value>4</value>
>>     </property>
>>     <property>
>>        <name>mapred.child.java.opts</name>
>>        <value>-Xmx512m</value>
>>     </property>
>>   <property>
>>     <name>mapred.map.tasks</name>
>>     <value>4</value>
>>  </property>
>>  <property>
>>     <name>mapred.reduce.tasks</name>
>>     <value>4</value>
>>  </property>
>>
>> </configuration>
>>
>> and I have restarted hadoop, but still it only spawns two map and
>> reduce tasks a time, are there some other parameters I should be
>> setting? should I maybe omit the .tasks.maximum parameters?
>>
>> Another question regarding the setting:
>>        <name>mapred.child.java.opts</name>
>>        <value>-Xmx512m</value>
>> how many fetcher threads should I be able to run with this memory
>> setting, I have noticed that with a high fetcher thread setting
>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>> in the fetching step. Is there any rule of thumb for the numbers of
>> threads per 100mb of memory?
>>
>> best regards,
>> Magnus
>>
>
>

Re: Configuring nutch to run on hadoop

Posted by Ferdy Galema <fe...@kalooga.com>.

Hi,

mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.

mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
running at same time, per tasktracker.

It seems you made a mistake with the property
"mapred.job.tracker.map.tasks.maximum", it does not seem to exist.

Ferdy

2012/3/14 Magnús Skúlason <ma...@gmail.com>

> Hi,
>
> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
> all running on one machine).
>
> Everything works fine except that I only get two map and reduce tasks
> to run at the same time, the machine that I am running on has 4 quad
> core CPUs so I should benefit from running more tasks at a time.
>
> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
> <configuration>
>     <property>
>         <name>mapred.job.tracker</name>
>         <value>localhost:9001</value>
>     </property>
>     <property>
>         <name>mapred.job.tracker.map.tasks.maximum</name>
>         <value>4</value>
>     </property>
>     <property>
>         <name>mapred.job.tracker.reduce.tasks.maximum</name>
>         <value>4</value>
>     </property>
>     <property>
>        <name>mapred.child.java.opts</name>
>        <value>-Xmx512m</value>
>     </property>
>   <property>
>     <name>mapred.map.tasks</name>
>     <value>4</value>
>  </property>
>  <property>
>     <name>mapred.reduce.tasks</name>
>     <value>4</value>
>  </property>
>
> </configuration>
>
> and I have restarted hadoop, but still it only spawns two map and
> reduce tasks a time, are there some other parameters I should be
> setting? should I maybe omit the .tasks.maximum parameters?
>
> Another question regarding the setting:
>        <name>mapred.child.java.opts</name>
>        <value>-Xmx512m</value>
> how many fetcher threads should I be able to run with this memory
> setting, I have noticed that with a high fetcher thread setting
> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
> in the fetching step. Is there any rule of thumb for the numbers of
> threads per 100mb of memory?
>
> best regards,
> Magnus
>