You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matei Zaharia <ma...@eecs.berkeley.edu> on 2007/11/10 20:57:19 UTC
Fetching many pages off LAN
Hi,
I am using Nutch to index about 1 million static HTML pages on a
single server on my LAN, using a cluster of ~20 machines. However,
whenever I perform a fetch, Nutch only uses two map workers despite
the fact that there are 20 in the cluster and ends up giving 90% of
the pages to one of them. For example, I created a fetchlist of 10,000
pages and ended up with one mapper fetching 175 of them and one
fetching 9000. What can I do to use more mappers and partition the
load more evenly? My web server should be able to handle more
connections at once.
Thanks,
Matei Zaharia
Re: Fetching many pages off LAN
Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
> What kind of patterns do the url have ??
>
> This is my wild guess: you have a limited (surely less than 20) set
> of domains for the complete set of urls.
> HashPartitioner , which partitions the urls based on domains is the
> class to look at.
> And if this is true, you will have to write a custom Partitioner
Yes, this is indeed the case. Thanks for the tip.
Re: Fetching many pages off LAN
Posted by Sagar Naik <sa...@visvo.com>.
Matei Zaharia wrote:
> An update.. I noticed that I hadn't specified -numFetchers as a
> command line argument, so I tried setting that to 10, but even then I
> end up with 100-200 pages for most fetchers and 9000 for one of them.
>
> Matei Zaharia wrote:
>> I have my hosts per fetcher, threads per host, and fetcher delay set
>> as follows:
>>
>> <property>
>> <name>fetcher.threads.fetch</name>
>> <value>25</value>
>> <description>The number of FetcherThreads the fetcher should use.
>> This is also determines the maximum number of requests that are
>> made at once (each FetcherThread handles one
>> connection).</description>
>> </property>
>>
>> <property>
>> <name>fetcher.threads.per.host</name>
>> <value>150</value>
>> <description>This number is the maximum number of threads that
>> should be allowed to access a host at one time.</description>
>> </property>
>>
>> <property>
>> <name>fetcher.server.delay</name>
>> <value>1.0</value>
>> <description>The number of seconds the fetcher will delay between
>> successive requests to the same server.</description>
>> </property>
>>
>>
>> However, no matter what I do, I only get 2 mappers.
>>
>> Matei
>>
>> On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:
>>
>>> Well,
>>>
>>> without knowing your configuration it's a bit hard to tell, but i
>>> think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
>>>
>>> hope it helps,
>>> Sebastian Steinmetz
>>>
>>>
>>> Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
>>>
>>>> Hi,
>>>>
>>>> I am using Nutch to index about 1 million static HTML pages on a
>>>> single server on my LAN, using a cluster of ~20 machines. However,
>>>> whenever I perform a fetch, Nutch only uses two map workers despite
>>>> the fact that there are 20 in the cluster and ends up giving 90% of
>>>> the pages to one of them. For example, I created a fetchlist of
>>>> 10,000 pages and ended up with one mapper fetching 175 of them and
>>>> one fetching 9000. What can I do to use more mappers and partition
>>>> the load more evenly? My web server should be able to handle more
>>>> connections at once.
>>>>
>>>> Thanks,
>>>>
>>>> Matei Zaharia
>>>
>>
>
>
Hey ,
What kind of patterns do the url have ??
This is my wild guess: you have a limited (surely less than 20) set of
domains for the complete set of urls.
HashPartitioner , which partitions the urls based on domains is the
class to look at.
And if this is true, you will have to write a custom Partitioner
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.
Re: Fetching many pages off LAN
Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
An update.. I noticed that I hadn't specified -numFetchers as a command
line argument, so I tried setting that to 10, but even then I end up
with 100-200 pages for most fetchers and 9000 for one of them.
Matei Zaharia wrote:
> I have my hosts per fetcher, threads per host, and fetcher delay set
> as follows:
>
> <property>
> <name>fetcher.threads.fetch</name>
> <value>25</value>
> <description>The number of FetcherThreads the fetcher should use.
> This is also determines the maximum number of requests that are
> made at once (each FetcherThread handles one
> connection).</description>
> </property>
>
> <property>
> <name>fetcher.threads.per.host</name>
> <value>150</value>
> <description>This number is the maximum number of threads that
> should be allowed to access a host at one time.</description>
> </property>
>
> <property>
> <name>fetcher.server.delay</name>
> <value>1.0</value>
> <description>The number of seconds the fetcher will delay between
> successive requests to the same server.</description>
> </property>
>
>
> However, no matter what I do, I only get 2 mappers.
>
> Matei
>
> On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:
>
>> Well,
>>
>> without knowing your configuration it's a bit hard to tell, but i
>> think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
>>
>> hope it helps,
>> Sebastian Steinmetz
>>
>>
>> Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
>>
>>> Hi,
>>>
>>> I am using Nutch to index about 1 million static HTML pages on a
>>> single server on my LAN, using a cluster of ~20 machines. However,
>>> whenever I perform a fetch, Nutch only uses two map workers despite
>>> the fact that there are 20 in the cluster and ends up giving 90% of
>>> the pages to one of them. For example, I created a fetchlist of
>>> 10,000 pages and ended up with one mapper fetching 175 of them and
>>> one fetching 9000. What can I do to use more mappers and partition
>>> the load more evenly? My web server should be able to handle more
>>> connections at once.
>>>
>>> Thanks,
>>>
>>> Matei Zaharia
>>
>
Re: Fetching many pages off LAN
Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
I have my hosts per fetcher, threads per host, and fetcher delay set
as follows:
<property>
<name>fetcher.threads.fetch</name>
<value>25</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</
description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>150</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
However, no matter what I do, I only get 2 mappers.
Matei
On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:
> Well,
>
> without knowing your configuration it's a bit hard to tell, but i
> think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
>
> hope it helps,
> Sebastian Steinmetz
>
>
> Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
>
>> Hi,
>>
>> I am using Nutch to index about 1 million static HTML pages on a
>> single server on my LAN, using a cluster of ~20 machines. However,
>> whenever I perform a fetch, Nutch only uses two map workers despite
>> the fact that there are 20 in the cluster and ends up giving 90% of
>> the pages to one of them. For example, I created a fetchlist of
>> 10,000 pages and ended up with one mapper fetching 175 of them and
>> one fetching 9000. What can I do to use more mappers and partition
>> the load more evenly? My web server should be able to handle more
>> connections at once.
>>
>> Thanks,
>>
>> Matei Zaharia
>
Re: Fetching many pages off LAN
Posted by Sebastian Steinmetz <s....@mederi-research.de>.
Well,
without knowing your configuration it's a bit hard to tell, but i
think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
hope it helps,
Sebastian Steinmetz
Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
> Hi,
>
> I am using Nutch to index about 1 million static HTML pages on a
> single server on my LAN, using a cluster of ~20 machines. However,
> whenever I perform a fetch, Nutch only uses two map workers despite
> the fact that there are 20 in the cluster and ends up giving 90% of
> the pages to one of them. For example, I created a fetchlist of
> 10,000 pages and ended up with one mapper fetching 175 of them and
> one fetching 9000. What can I do to use more mappers and partition
> the load more evenly? My web server should be able to handle more
> connections at once.
>
> Thanks,
>
> Matei Zaharia