You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Nebel <mi...@nebel.de> on 2006/02/01 10:59:25 UTC
misconfigured http.robots.agents (was Re: mapred: config parameters)
Hi,
as I expected: the error sat in front of my computer. :-(
I changed the http.agent.name and added it to the http.robots.agents. So
far so good, but my mistake was: I added the new name not at the first
position. Finally I was bothered by the SEVERE-Error in the
taskmanager-log. After fixing this problem - everything works really fine!
Lesson learned: if the developer throws a severe error: don't ignore it
- fix it!
Regards
Michael
Gal Nitzan wrote:
> Hi Michael,
>
> this question should be asked in the nutch-users list.
>
> Take a look at a thread: So many Unfetched Pages using MapReduce
>
> G.
>
> On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote:
>
>>Hi,
>>
>>the last days I gave the mapred-branch a try and I was impressed!
>>
>>But I still have a problem with the incremental crawling. My setup: I
>>have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running
>>one round of "crawling" consists out of the steps:
>>
>>- generate (I set a limit of "-topN 10000000")
>>- fetch
>>- update
>>- index
>>- invertlinks
>>
>>For the first round, I injected a list of about 20.000 websites. When
>>running nutch, I expected, that the fetcher would be pretty busy and
>>went for a coffee. Ok: perhaps someone talked to wife and decided, I
>>should not drink so much coffee. But I think, I made a mistake.. But
>>after 100 URLs he stopped working.
>>
>>After some tweaking I got the installation to fetch about 10.000 pages,
>>but this is still not what I expect. First guess was the url-filter, but
>>I see the urls in the tasktracker log. I looked at the mailinglist and
>>got many ideas, but I still get more confused.
>>
>>I think, the following parameters have an influence on the number of
>>pages fetched (in the brackets are the values I selected):
>>
>>- mapred.map.tasks (100)
>>- mapred.reduce.tasks (3)
>>- mapred.task.timeout (3600000 [an other question])
>>- mapred.tasktracker.tasks.maximum (10)
>>- fetcher.threads.fetch (100)
>>- fetcher.server.delay (5.0)
>>- fetcher.threads.per.host (10)
>>- generate.max.per.host (1000)
>>- http.content.limit (2000000)
>>
>>I don't like my parameters, but so I got the most results. Looking at
>>the jobmanager, each "map task" fetched between 70 - 100 pages. Having
>>100 map.tasks: I have ~ 8000 new pages fetched in the end. That's nearly
>>the number the crawldb says too.
>>
>>Which parameter has an influence on the number of pages ONE task
>>fetches. By my observations, I would guess it's "fetcher.threads.fetch".
>>Increasing this number further means, to blast the load on the
>>tasktrackers. So there must be an other problem.
>>
>>Any help appreciated!
>>
>>Regards
>>
>> Michael
>>
>
>
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/