You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matei Zaharia <ma...@eecs.berkeley.edu> on 2007/11/10 20:57:19 UTC

Fetching many pages off LAN

Hi,

I am using Nutch to index about 1 million static HTML pages on a  
single server on my LAN, using a cluster of ~20 machines. However,  
whenever I perform a fetch, Nutch only uses two map workers despite  
the fact that there are 20 in the cluster and ends up giving 90% of  
the pages to one of them. For example, I created a fetchlist of 10,000  
pages and ended up with one mapper fetching 175 of them and one  
fetching 9000. What can I do to use more mappers and partition the  
load more evenly? My web server should be able to handle more  
connections at once.

Thanks,

Matei Zaharia

Re: Fetching many pages off LAN

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
> What kind of patterns do the url have ??
>
> This is my wild guess:  you have a limited (surely less than 20) set  
> of domains for the complete set of urls.
> HashPartitioner , which partitions the urls based on domains is the  
> class to look at.
> And if this is true, you will have to write a custom Partitioner

Yes, this is indeed the case. Thanks for the tip.


Re: Fetching many pages off LAN

Posted by Sagar Naik <sa...@visvo.com>.
Matei Zaharia wrote:
> An update.. I noticed that I hadn't specified -numFetchers as a 
> command line argument, so I tried setting that to 10, but even then I 
> end up with 100-200 pages for most fetchers and 9000 for one of them.
>
> Matei Zaharia wrote:
>> I have my hosts per fetcher, threads per host, and fetcher delay set 
>> as follows:
>>
>> <property>
>>   <name>fetcher.threads.fetch</name>
>>   <value>25</value>
>>   <description>The number of FetcherThreads the fetcher should use.
>>     This is also determines the maximum number of requests that are
>>     made at once (each FetcherThread handles one 
>> connection).</description>
>> </property>
>>
>> <property>
>>   <name>fetcher.threads.per.host</name>
>>   <value>150</value>
>>   <description>This number is the maximum number of threads that
>>     should be allowed to access a host at one time.</description>
>> </property>
>>
>> <property>
>>   <name>fetcher.server.delay</name>
>>   <value>1.0</value>
>>   <description>The number of seconds the fetcher will delay between
>>    successive requests to the same server.</description>
>> </property>
>>
>>
>> However, no matter what I do, I only get 2 mappers.
>>
>> Matei
>>
>> On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:
>>
>>> Well,
>>>
>>> without knowing your configuration it's a bit hard to tell, but i 
>>> think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
>>>
>>> hope it helps,
>>>     Sebastian Steinmetz
>>>
>>>
>>> Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
>>>
>>>> Hi,
>>>>
>>>> I am using Nutch to index about 1 million static HTML pages on a 
>>>> single server on my LAN, using a cluster of ~20 machines. However, 
>>>> whenever I perform a fetch, Nutch only uses two map workers despite 
>>>> the fact that there are 20 in the cluster and ends up giving 90% of 
>>>> the pages to one of them. For example, I created a fetchlist of 
>>>> 10,000 pages and ended up with one mapper fetching 175 of them and 
>>>> one fetching 9000. What can I do to use more mappers and partition 
>>>> the load more evenly? My web server should be able to handle more 
>>>> connections at once.
>>>>
>>>> Thanks,
>>>>
>>>> Matei Zaharia
>>>
>>
>
>
Hey ,

What kind of patterns do the url have ??

This is my wild guess:  you have a limited (surely less than 20) set of 
domains for the complete set of urls.
 HashPartitioner , which partitions the urls based on domains is the 
class to look at.
And if this is true, you will have to write a custom Partitioner




-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.


Re: Fetching many pages off LAN

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
An update.. I noticed that I hadn't specified -numFetchers as a command 
line argument, so I tried setting that to 10, but even then I end up 
with 100-200 pages for most fetchers and 9000 for one of them.

Matei Zaharia wrote:
> I have my hosts per fetcher, threads per host, and fetcher delay set 
> as follows:
>
> <property>
>   <name>fetcher.threads.fetch</name>
>   <value>25</value>
>   <description>The number of FetcherThreads the fetcher should use.
>     This is also determines the maximum number of requests that are
>     made at once (each FetcherThread handles one 
> connection).</description>
> </property>
>
> <property>
>   <name>fetcher.threads.per.host</name>
>   <value>150</value>
>   <description>This number is the maximum number of threads that
>     should be allowed to access a host at one time.</description>
> </property>
>
> <property>
>   <name>fetcher.server.delay</name>
>   <value>1.0</value>
>   <description>The number of seconds the fetcher will delay between
>    successive requests to the same server.</description>
> </property>
>
>
> However, no matter what I do, I only get 2 mappers.
>
> Matei
>
> On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:
>
>> Well,
>>
>> without knowing your configuration it's a bit hard to tell, but i 
>> think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
>>
>> hope it helps,
>>     Sebastian Steinmetz
>>
>>
>> Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
>>
>>> Hi,
>>>
>>> I am using Nutch to index about 1 million static HTML pages on a 
>>> single server on my LAN, using a cluster of ~20 machines. However, 
>>> whenever I perform a fetch, Nutch only uses two map workers despite 
>>> the fact that there are 20 in the cluster and ends up giving 90% of 
>>> the pages to one of them. For example, I created a fetchlist of 
>>> 10,000 pages and ended up with one mapper fetching 175 of them and 
>>> one fetching 9000. What can I do to use more mappers and partition 
>>> the load more evenly? My web server should be able to handle more 
>>> connections at once.
>>>
>>> Thanks,
>>>
>>> Matei Zaharia
>>
>


Re: Fetching many pages off LAN

Posted by Matei Zaharia <ma...@eecs.berkeley.edu>.
I have my hosts per fetcher, threads per host, and fetcher delay set  
as follows:

<property>
   <name>fetcher.threads.fetch</name>
   <value>25</value>
   <description>The number of FetcherThreads the fetcher should use.
     This is also determines the maximum number of requests that are
     made at once (each FetcherThread handles one connection).</ 
description>
</property>

<property>
   <name>fetcher.threads.per.host</name>
   <value>150</value>
   <description>This number is the maximum number of threads that
     should be allowed to access a host at one time.</description>
</property>

<property>
   <name>fetcher.server.delay</name>
   <value>1.0</value>
   <description>The number of seconds the fetcher will delay between
    successive requests to the same server.</description>
</property>


However, no matter what I do, I only get 2 mappers.

Matei

On Nov 10, 2007, at 12:18 PM, Sebastian Steinmetz wrote:

> Well,
>
> without knowing your configuration it's a bit hard to tell, but i  
> think, you may have set "fetcher.threads.per.host" too low (2 maybe?)
>
> hope it helps,
> 	Sebastian Steinmetz
>
>
> Am 10.11.2007 um 20:57 schrieb Matei Zaharia:
>
>> Hi,
>>
>> I am using Nutch to index about 1 million static HTML pages on a  
>> single server on my LAN, using a cluster of ~20 machines. However,  
>> whenever I perform a fetch, Nutch only uses two map workers despite  
>> the fact that there are 20 in the cluster and ends up giving 90% of  
>> the pages to one of them. For example, I created a fetchlist of  
>> 10,000 pages and ended up with one mapper fetching 175 of them and  
>> one fetching 9000. What can I do to use more mappers and partition  
>> the load more evenly? My web server should be able to handle more  
>> connections at once.
>>
>> Thanks,
>>
>> Matei Zaharia
>


Re: Fetching many pages off LAN

Posted by Sebastian Steinmetz <s....@mederi-research.de>.
Well,

without knowing your configuration it's a bit hard to tell, but i  
think, you may have set "fetcher.threads.per.host" too low (2 maybe?)

hope it helps,
	Sebastian Steinmetz


Am 10.11.2007 um 20:57 schrieb Matei Zaharia:

> Hi,
>
> I am using Nutch to index about 1 million static HTML pages on a  
> single server on my LAN, using a cluster of ~20 machines. However,  
> whenever I perform a fetch, Nutch only uses two map workers despite  
> the fact that there are 20 in the cluster and ends up giving 90% of  
> the pages to one of them. For example, I created a fetchlist of  
> 10,000 pages and ended up with one mapper fetching 175 of them and  
> one fetching 9000. What can I do to use more mappers and partition  
> the load more evenly? My web server should be able to handle more  
> connections at once.
>
> Thanks,
>
> Matei Zaharia