You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Nebel <mi...@nebel.de> on 2006/02/01 10:59:25 UTC

misconfigured http.robots.agents (was Re: mapred: config parameters)

Hi,

as I expected: the error sat in front of my computer. :-(

I changed the http.agent.name and added it to the http.robots.agents. So 
far so good, but my mistake was: I added the new name not at the first 
position. Finally I was bothered by the SEVERE-Error in the 
taskmanager-log. After fixing this problem - everything works really fine!

Lesson learned: if the developer throws a severe error: don't ignore it 
  - fix it!

Regards

	Michael



Gal Nitzan wrote:

> Hi Michael,
> 
> this question should be asked in the nutch-users list.
> 
> Take a look at a thread: So many Unfetched Pages using MapReduce
> 
> G.
> 
> On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote:
> 
>>Hi,
>>
>>the last days I gave the mapred-branch a try and I was impressed!
>>
>>But I still have a problem with the incremental crawling. My setup: I 
>>have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running 
>>one round of "crawling" consists out of the steps:
>>
>>- generate (I set a limit of "-topN 10000000")
>>- fetch
>>- update
>>- index
>>- invertlinks
>>
>>For the first round, I injected a list of about 20.000 websites. When 
>>running nutch, I expected, that the fetcher would be pretty busy and 
>>went for a coffee. Ok: perhaps someone talked to wife and decided, I 
>>should not drink so much coffee. But I think, I made a mistake.. But 
>>after 100 URLs he stopped working.
>>
>>After some tweaking I got the installation to fetch about 10.000 pages, 
>>but this is still not what I expect. First guess was the url-filter, but 
>>I see the urls in the tasktracker log. I looked at the mailinglist and 
>>got many ideas, but I still get more confused.
>>
>>I think, the following parameters have an influence on the number of 
>>pages fetched (in the brackets are the values I selected):
>>
>>- mapred.map.tasks   	  		(100)
>>- mapred.reduce.tasks	  		(3)
>>- mapred.task.timeout	  		(3600000 [an other question])
>>- mapred.tasktracker.tasks.maximum 	(10)
>>- fetcher.threads.fetch			(100)
>>- fetcher.server.delay			(5.0)
>>- fetcher.threads.per.host	        (10)
>>- generate.max.per.host			(1000)
>>- http.content.limit                    (2000000)
>>
>>I don't like my parameters, but so I got the most results. Looking at 
>>the jobmanager, each "map task" fetched between 70 - 100 pages. Having 
>>100 map.tasks: I have ~ 8000 new pages fetched in the end. That's nearly 
>>the number the crawldb says too.
>>
>>Which parameter has an influence on the number of pages ONE task 
>>fetches. By my observations, I would guess it's "fetcher.threads.fetch". 
>>Increasing this number further means, to blast the load on the 
>>tasktrackers. So there must be an other problem.
>>
>>Any help appreciated!
>>
>>Regards
>>
>>	Michael
>>
> 
> 


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/