You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2011/02/07 22:32:17 UTC
Re: Performance Configuration on Focused Web Crawl

Hi Hannes,

I'm curious as to whether you got this configuration running, any  
issues you ran into, and what performance you saw.

Thanks,

-- Ken


On Nov 20, 2010, at 10:52am, Hannes Carl Meyer wrote:

> Ken, thanks, I guess thats a good hint!
>
> I'm using the simple org.apache.nutch.crawl.Crawl to perform the  
> crawl - I
> guess the configuration of the Map-Reduce Job then is pretty low.
>
> @Andrzej could you give me a hint where to configure the number of  
> reduce
> tasks in nutch 0.9? (running on a single machine)
>
> Regards,
>
> Hannes
>
> On Sat, Nov 20, 2010 at 7:06 PM, Ken Krugler <kkrugler_lists@transpac.com 
> >wrote:
>
>>
>> On Nov 20, 2010, at 7:51am, Hannes Carl Meyer wrote:
>>
>> Thank you for sharing your experiences!
>>>
>>> in my case the web servers are pretty stable and we are allowed to  
>>> perform
>>> intensive crawling which make it easy to increase the threads per  
>>> host.
>>>
>>> imho the fetch process isn't really the bottleneck. It is the  
>>> process
>>> between the fetch process when merging and updating the crawldb.
>>>
>>> We are using a 16 Core Hardware, during fetch process CPUs are  
>>> being used
>>> around 1000 % but in between fetching it is always around 90-100 %  
>>> on a
>>> single core
>>>
>>
>> In regular map-reduce Hadoop jobs you get this situation if the job  
>> has
>> been configured to use a single reducer, and thus only one core is  
>> active
>>
>> Though it would surprise me if the crawlDB update job was  
>> configured this
>> way, as I don't see a reason why the crawlDB has to be a single  
>> file in
>> HDFS.
>>
>> Andrzej and others would know best, of course.
>>
>> -- Ken
>>
>>
>>
>>
>>> On Sat, Nov 20, 2010 at 11:33 AM, Ye T Thet <ye...@gmail.com>
>>> wrote:
>>>
>>> Hannes,
>>>>
>>>> I guess It would depends on situation
>>>> - your server specs (where cralwer is running) and
>>>> - hosts specs
>>>>
>>>> Anyway, I have been crawling around 50 hosts. I tweaked a few to  
>>>> get it
>>>> right for my situation.
>>>>
>>>> Currently I am using 500 threads. and 10 threads per host.
>>>>
>>>> In my opinion, number of threads for crawler does not matter much.
>>>> Because
>>>> crawler does not take much of a resource (memory and CPU). As far  
>>>> as your
>>>> server network band width can handle, it should be fine.
>>>>
>>>> In my case, number of threads per host matters. Because some of  
>>>> my server
>>>> cannot handle that much of bandwidth.
>>>>
>>>> Not sure if it would helps, I had to adjust fetcher.server.delay,
>>>> fetcher.server.min.delay and fetcher.max.crawl.delay because, my  
>>>> hosts
>>>> sometimes cannot handle that much of threads.
>>>>
>>>>
>>>> Warm Regards,
>>>>
>>>> Y.T. Thet
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Nov 18, 2010 at 11:06 PM, Hannes Carl Meyer <
>>>> hannescarl@googlemail.com> wrote:
>>>>
>>>> Hi Ken,
>>>>>
>>>>> our Crawler is allowed to hit those hosts in a frequent way at  
>>>>> night so
>>>>> we
>>>>> are not getting a penalty ;-)
>>>>>
>>>>> Could you imagine running nutch in this case with about 400  
>>>>> threads,
>>>>> with
>>>>> 1
>>>>> thread per host and a delay of 1.0?
>>>>>
>>>>> I tried that way but experienced some really long idle times...  
>>>>> My idea
>>>>> was
>>>>> one thread per host. That would mean adding another host would  
>>>>> require
>>>>> add
>>>>> an additional thread.
>>>>>
>>>>> Regards
>>>>>
>>>>> Hannes
>>>>>
>>>>> On Thu, Nov 18, 2010 at 3:36 PM, Ken Krugler <
>>>>> kkrugler_lists@transpac.com
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> If you're hitting each host with 45 threads, you better be on  
>>>>> really
>>>>>>
>>>>> good
>>>>>
>>>>>> terms with those webmasters :)
>>>>>>
>>>>>> With 90 total threads, that means as few as 2 hosts are active  
>>>>>> at any
>>>>>>
>>>>> time,
>>>>>
>>>>>> yes?
>>>>>>
>>>>>> -- Ken
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Nov 18, 2010, at 3:51am, Hannes Carl Meyer wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> I'm using nutch 0.9 to crawl about 400 hosts with an average  
>>>>>>> of 600
>>>>>>>
>>>>>> pages.
>>>>>
>>>>>> That makes a volume of 240.000 fetched pages - I want to get  
>>>>>> all of
>>>>>>>
>>>>>> them.
>>>>>
>>>>>>
>>>>>>> Can one give me an advice on the right threads/delay/per-host
>>>>>>> configuration
>>>>>>> in this environnement?
>>>>>>>
>>>>>>> My current conf:
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>fetcher.server.delay</name>
>>>>>>>    <value>1.0</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>fetcher.threads.fetch</name>
>>>>>>>    <value>90</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> <property>
>>>>>>>    <name>fetcher.threads.per.host</name>
>>>>>>>    <value>45</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> <property>
>>>>>>>  <name>fetcher.threads.per.host.by.ip</name>
>>>>>>>  <value>false</value>
>>>>>>> </property>
>>>>>>>
>>>>>>> The total runtime is about 5 hours.
>>>>>>>
>>>>>>> How can performance be improved? (I still have enough CPU,  
>>>>>>> Bandwith)
>>>>>>>
>>>>>>> Note: This runs on a single machine, distribution to other  
>>>>>>> machines is
>>>>>>>
>>>>>> not
>>>>>
>>>>>> planned.
>>>>>>>
>>>>>>> Thanks and Regards
>>>>>>>
>>>>>>> Hannes
>>>>>>>
>>>>>>>
>>>>>> --------------------------
>>>>>> Ken Krugler
>>>>>> +1 530-210-6378
>>>>>> http://bixolabs.com
>>>>>> e l a s t i c   w e b   m i n i n g
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n g
>>
>>
>>
>>
>>
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g