You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Magnús Skúlason <ma...@gmail.com> on 2012/03/14 11:38:19 UTC
Configuring nutch to run on hadoop
Hi,
I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
all running on one machine).
Everything works fine except that I only get two map and reduce tasks
to run at the same time, the machine that I am running on has 4 quad
core CPUs so I should benefit from running more tasks at a time.
in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapred.job.tracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.job.tracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>4</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>4</value>
</property>
</configuration>
and I have restarted hadoop, but still it only spawns two map and
reduce tasks a time, are there some other parameters I should be
setting? should I maybe omit the .tasks.maximum parameters?
Another question regarding the setting:
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value>
how many fetcher threads should I be able to run with this memory
setting, I have noticed that with a high fetcher thread setting
(fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
in the fetching step. Is there any rule of thumb for the numbers of
threads per 100mb of memory?
best regards,
Magnus
Re: Configuring nutch to run on hadoop
Posted by Magnús Skúlason <ma...@gmail.com>.
Thanks,
This solved the problem :) I must have picked up the other parameter
names from an outdated documentation or simply miss spelled them
best regards,
Magnus
On Thu, Mar 15, 2012 at 10:05 AM, Rafael Pappert <rp...@fwpsystems.com> wrote:
> oh Ferdy wrote that already …. sorry!
>
>
>
> On 15/Mar/ 2012, at 11:01 , Rafael Pappert wrote:
>
>> Hello,
>>
>> it is
>>
>> mapred.tasktracker.reduce.tasks.maximum
>> mapred.tasktracker.map.tasks.maximum
>>
>> cheers,
>> Rafael.
>>
>>
>>
>> On 15/Mar/ 2012, at 08:36 , Ferdy Galema wrote:
>>
>>> Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
>>>
>>> On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>>>>
>>>> mapred.tasktracker.map.tasks.maximum
>>>> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
>>>> running at same time, per tasktracker.
>>>>
>>>> It seems you made a mistake with the property
>>>> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>>>>
>>>> Ferdy
>>>>
>>>> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>>>>> all running on one machine).
>>>>>
>>>>> Everything works fine except that I only get two map and reduce tasks
>>>>> to run at the same time, the machine that I am running on has 4 quad
>>>>> core CPUs so I should benefit from running more tasks at a time.
>>>>>
>>>>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>>>>> <configuration>
>>>>> <property>
>>>>> <name>mapred.job.tracker</name>
>>>>> <value>localhost:9001</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>mapred.job.tracker.map.tasks.maximum</name>
>>>>> <value>4</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>>>> <value>4</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>mapred.child.java.opts</name>
>>>>> <value>-Xmx512m</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>mapred.map.tasks</name>
>>>>> <value>4</value>
>>>>> </property>
>>>>> <property>
>>>>> <name>mapred.reduce.tasks</name>
>>>>> <value>4</value>
>>>>> </property>
>>>>>
>>>>> </configuration>
>>>>>
>>>>> and I have restarted hadoop, but still it only spawns two map and
>>>>> reduce tasks a time, are there some other parameters I should be
>>>>> setting? should I maybe omit the .tasks.maximum parameters?
>>>>>
>>>>> Another question regarding the setting:
>>>>> <name>mapred.child.java.opts</name>
>>>>> <value>-Xmx512m</value>
>>>>> how many fetcher threads should I be able to run with this memory
>>>>> setting, I have noticed that with a high fetcher thread setting
>>>>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>>>>> in the fetching step. Is there any rule of thumb for the numbers of
>>>>> threads per 100mb of memory?
>>>>>
>>>>> best regards,
>>>>> Magnus
>>>>>
>>>>
>>>>
>>
>
Re: Configuring nutch to run on hadoop
Posted by Rafael Pappert <rp...@fwpsystems.com>.
oh Ferdy wrote that already …. sorry!
On 15/Mar/ 2012, at 11:01 , Rafael Pappert wrote:
> Hello,
>
> it is
>
> mapred.tasktracker.reduce.tasks.maximum
> mapred.tasktracker.map.tasks.maximum
>
> cheers,
> Rafael.
>
>
>
> On 15/Mar/ 2012, at 08:36 , Ferdy Galema wrote:
>
>> Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
>>
>> On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
>>
>>> Hi,
>>>
>>> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>>>
>>> mapred.tasktracker.map.tasks.maximum
>>> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
>>> running at same time, per tasktracker.
>>>
>>> It seems you made a mistake with the property
>>> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>>>
>>> Ferdy
>>>
>>> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>>>
>>>> Hi,
>>>>
>>>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>>>> all running on one machine).
>>>>
>>>> Everything works fine except that I only get two map and reduce tasks
>>>> to run at the same time, the machine that I am running on has 4 quad
>>>> core CPUs so I should benefit from running more tasks at a time.
>>>>
>>>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>>>> <configuration>
>>>> <property>
>>>> <name>mapred.job.tracker</name>
>>>> <value>localhost:9001</value>
>>>> </property>
>>>> <property>
>>>> <name>mapred.job.tracker.map.tasks.maximum</name>
>>>> <value>4</value>
>>>> </property>
>>>> <property>
>>>> <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>>> <value>4</value>
>>>> </property>
>>>> <property>
>>>> <name>mapred.child.java.opts</name>
>>>> <value>-Xmx512m</value>
>>>> </property>
>>>> <property>
>>>> <name>mapred.map.tasks</name>
>>>> <value>4</value>
>>>> </property>
>>>> <property>
>>>> <name>mapred.reduce.tasks</name>
>>>> <value>4</value>
>>>> </property>
>>>>
>>>> </configuration>
>>>>
>>>> and I have restarted hadoop, but still it only spawns two map and
>>>> reduce tasks a time, are there some other parameters I should be
>>>> setting? should I maybe omit the .tasks.maximum parameters?
>>>>
>>>> Another question regarding the setting:
>>>> <name>mapred.child.java.opts</name>
>>>> <value>-Xmx512m</value>
>>>> how many fetcher threads should I be able to run with this memory
>>>> setting, I have noticed that with a high fetcher thread setting
>>>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>>>> in the fetching step. Is there any rule of thumb for the numbers of
>>>> threads per 100mb of memory?
>>>>
>>>> best regards,
>>>> Magnus
>>>>
>>>
>>>
>
Re: Configuring nutch to run on hadoop
Posted by Rafael Pappert <rp...@fwpsystems.com>.
Hello,
it is
mapred.tasktracker.reduce.tasks.maximum
mapred.tasktracker.map.tasks.maximum
cheers,
Rafael.
On 15/Mar/ 2012, at 08:36 , Ferdy Galema wrote:
> Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
>
> On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
>
>> Hi,
>>
>> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>>
>> mapred.tasktracker.map.tasks.maximum
>> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
>> running at same time, per tasktracker.
>>
>> It seems you made a mistake with the property
>> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>>
>> Ferdy
>>
>> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>>
>>> Hi,
>>>
>>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>>> all running on one machine).
>>>
>>> Everything works fine except that I only get two map and reduce tasks
>>> to run at the same time, the machine that I am running on has 4 quad
>>> core CPUs so I should benefit from running more tasks at a time.
>>>
>>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>>> <configuration>
>>> <property>
>>> <name>mapred.job.tracker</name>
>>> <value>localhost:9001</value>
>>> </property>
>>> <property>
>>> <name>mapred.job.tracker.map.tasks.maximum</name>
>>> <value>4</value>
>>> </property>
>>> <property>
>>> <name>mapred.job.tracker.reduce.tasks.maximum</name>
>>> <value>4</value>
>>> </property>
>>> <property>
>>> <name>mapred.child.java.opts</name>
>>> <value>-Xmx512m</value>
>>> </property>
>>> <property>
>>> <name>mapred.map.tasks</name>
>>> <value>4</value>
>>> </property>
>>> <property>
>>> <name>mapred.reduce.tasks</name>
>>> <value>4</value>
>>> </property>
>>>
>>> </configuration>
>>>
>>> and I have restarted hadoop, but still it only spawns two map and
>>> reduce tasks a time, are there some other parameters I should be
>>> setting? should I maybe omit the .tasks.maximum parameters?
>>>
>>> Another question regarding the setting:
>>> <name>mapred.child.java.opts</name>
>>> <value>-Xmx512m</value>
>>> how many fetcher threads should I be able to run with this memory
>>> setting, I have noticed that with a high fetcher thread setting
>>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>>> in the fetching step. Is there any rule of thumb for the numbers of
>>> threads per 100mb of memory?
>>>
>>> best regards,
>>> Magnus
>>>
>>
>>
Re: Configuring nutch to run on hadoop
Posted by Ferdy Galema <fe...@kalooga.com>.
Same goes for "mapred.job.tracker.reduce.tasks.maximum". It does not exist.
On Thu, Mar 15, 2012 at 8:35 AM, Ferdy Galema <fe...@kalooga.com>wrote:
> Hi,
>
> mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
>
> mapred.tasktracker.map.tasks.maximum
> and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
> running at same time, per tasktracker.
>
> It seems you made a mistake with the property
> "mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
>
> Ferdy
>
> 2012/3/14 Magnús Skúlason <ma...@gmail.com>
>
>> Hi,
>>
>> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
>> all running on one machine).
>>
>> Everything works fine except that I only get two map and reduce tasks
>> to run at the same time, the machine that I am running on has 4 quad
>> core CPUs so I should benefit from running more tasks at a time.
>>
>> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
>> <configuration>
>> <property>
>> <name>mapred.job.tracker</name>
>> <value>localhost:9001</value>
>> </property>
>> <property>
>> <name>mapred.job.tracker.map.tasks.maximum</name>
>> <value>4</value>
>> </property>
>> <property>
>> <name>mapred.job.tracker.reduce.tasks.maximum</name>
>> <value>4</value>
>> </property>
>> <property>
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx512m</value>
>> </property>
>> <property>
>> <name>mapred.map.tasks</name>
>> <value>4</value>
>> </property>
>> <property>
>> <name>mapred.reduce.tasks</name>
>> <value>4</value>
>> </property>
>>
>> </configuration>
>>
>> and I have restarted hadoop, but still it only spawns two map and
>> reduce tasks a time, are there some other parameters I should be
>> setting? should I maybe omit the .tasks.maximum parameters?
>>
>> Another question regarding the setting:
>> <name>mapred.child.java.opts</name>
>> <value>-Xmx512m</value>
>> how many fetcher threads should I be able to run with this memory
>> setting, I have noticed that with a high fetcher thread setting
>> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
>> in the fetching step. Is there any rule of thumb for the numbers of
>> threads per 100mb of memory?
>>
>> best regards,
>> Magnus
>>
>
>
Re: Configuring nutch to run on hadoop
Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,
mapred.map.tasks and mapred.reduce.tasks defines the number of total tasks.
mapred.tasktracker.map.tasks.maximum
and mapred.tasktracker.reduce.tasks.maximum define the number of maximum
running at same time, per tasktracker.
It seems you made a mistake with the property
"mapred.job.tracker.map.tasks.maximum", it does not seem to exist.
Ferdy
2012/3/14 Magnús Skúlason <ma...@gmail.com>
> Hi,
>
> I have setup nutch to run on a Pseudo-Distributed hadoop cluster (i.e.
> all running on one machine).
>
> Everything works fine except that I only get two map and reduce tasks
> to run at the same time, the machine that I am running on has 4 quad
> core CPUs so I should benefit from running more tasks at a time.
>
> in HADOOP_HOME/conf/mapred-site.xml I have set the following options:
> <configuration>
> <property>
> <name>mapred.job.tracker</name>
> <value>localhost:9001</value>
> </property>
> <property>
> <name>mapred.job.tracker.map.tasks.maximum</name>
> <value>4</value>
> </property>
> <property>
> <name>mapred.job.tracker.reduce.tasks.maximum</name>
> <value>4</value>
> </property>
> <property>
> <name>mapred.child.java.opts</name>
> <value>-Xmx512m</value>
> </property>
> <property>
> <name>mapred.map.tasks</name>
> <value>4</value>
> </property>
> <property>
> <name>mapred.reduce.tasks</name>
> <value>4</value>
> </property>
>
> </configuration>
>
> and I have restarted hadoop, but still it only spawns two map and
> reduce tasks a time, are there some other parameters I should be
> setting? should I maybe omit the .tasks.maximum parameters?
>
> Another question regarding the setting:
> <name>mapred.child.java.opts</name>
> <value>-Xmx512m</value>
> how many fetcher threads should I be able to run with this memory
> setting, I have noticed that with a high fetcher thread setting
> (fetcher.threads.fetch in nutch-site.xml) I get out of memory errors
> in the fetching step. Is there any rule of thumb for the numbers of
> threads per 100mb of memory?
>
> best regards,
> Magnus
>