You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michael Nebel <mi...@nebel.de> on 2006/01/31 15:52:20 UTC

mapred: config parameters

Hi,

the last days I gave the mapred-branch a try and I was impressed!

But I still have a problem with the incremental crawling. My setup: I 
have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running 
one round of "crawling" consists out of the steps:

- generate (I set a limit of "-topN 10000000")
- fetch
- update
- index
- invertlinks

For the first round, I injected a list of about 20.000 websites. When 
running nutch, I expected, that the fetcher would be pretty busy and 
went for a coffee. Ok: perhaps someone talked to wife and decided, I 
should not drink so much coffee. But I think, I made a mistake.. But 
after 100 URLs he stopped working.

After some tweaking I got the installation to fetch about 10.000 pages, 
but this is still not what I expect. First guess was the url-filter, but 
I see the urls in the tasktracker log. I looked at the mailinglist and 
got many ideas, but I still get more confused.

I think, the following parameters have an influence on the number of 
pages fetched (in the brackets are the values I selected):

- mapred.map.tasks   	  		(100)
- mapred.reduce.tasks	  		(3)
- mapred.task.timeout	  		(3600000 [an other question])
- mapred.tasktracker.tasks.maximum 	(10)
- fetcher.threads.fetch			(100)
- fetcher.server.delay			(5.0)
- fetcher.threads.per.host	        (10)
- generate.max.per.host			(1000)
- http.content.limit                    (2000000)

I don't like my parameters, but so I got the most results. Looking at 
the jobmanager, each "map task" fetched between 70 - 100 pages. Having 
100 map.tasks: I have ~ 8000 new pages fetched in the end. That's nearly 
the number the crawldb says too.

Which parameter has an influence on the number of pages ONE task 
fetches. By my observations, I would guess it's "fetcher.threads.fetch". 
Increasing this number further means, to blast the load on the 
tasktrackers. So there must be an other problem.

Any help appreciated!

Regards

	Michael

-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

misconfigured http.robots.agents (was Re: mapred: config parameters)

Posted by Michael Nebel <mi...@nebel.de>.

Hi,

as I expected: the error sat in front of my computer. :-(

I changed the http.agent.name and added it to the http.robots.agents. So 
far so good, but my mistake was: I added the new name not at the first 
position. Finally I was bothered by the SEVERE-Error in the 
taskmanager-log. After fixing this problem - everything works really fine!

Lesson learned: if the developer throws a severe error: don't ignore it 
  - fix it!

Regards

	Michael



Gal Nitzan wrote:

> Hi Michael,
> 
> this question should be asked in the nutch-users list.
> 
> Take a look at a thread: So many Unfetched Pages using MapReduce
> 
> G.
> 
> On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote:
> 
>>Hi,
>>
>>the last days I gave the mapred-branch a try and I was impressed!
>>
>>But I still have a problem with the incremental crawling. My setup: I 
>>have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running 
>>one round of "crawling" consists out of the steps:
>>
>>- generate (I set a limit of "-topN 10000000")
>>- fetch
>>- update
>>- index
>>- invertlinks
>>
>>For the first round, I injected a list of about 20.000 websites. When 
>>running nutch, I expected, that the fetcher would be pretty busy and 
>>went for a coffee. Ok: perhaps someone talked to wife and decided, I 
>>should not drink so much coffee. But I think, I made a mistake.. But 
>>after 100 URLs he stopped working.
>>
>>After some tweaking I got the installation to fetch about 10.000 pages, 
>>but this is still not what I expect. First guess was the url-filter, but 
>>I see the urls in the tasktracker log. I looked at the mailinglist and 
>>got many ideas, but I still get more confused.
>>
>>I think, the following parameters have an influence on the number of 
>>pages fetched (in the brackets are the values I selected):
>>
>>- mapred.map.tasks   	  		(100)
>>- mapred.reduce.tasks	  		(3)
>>- mapred.task.timeout	  		(3600000 [an other question])
>>- mapred.tasktracker.tasks.maximum 	(10)
>>- fetcher.threads.fetch			(100)
>>- fetcher.server.delay			(5.0)
>>- fetcher.threads.per.host	        (10)
>>- generate.max.per.host			(1000)
>>- http.content.limit                    (2000000)
>>
>>I don't like my parameters, but so I got the most results. Looking at 
>>the jobmanager, each "map task" fetched between 70 - 100 pages. Having 
>>100 map.tasks: I have ~ 8000 new pages fetched in the end. That's nearly 
>>the number the crawldb says too.
>>
>>Which parameter has an influence on the number of pages ONE task 
>>fetches. By my observations, I would guess it's "fetcher.threads.fetch". 
>>Increasing this number further means, to blast the load on the 
>>tasktrackers. So there must be an other problem.
>>
>>Any help appreciated!
>>
>>Regards
>>
>>	Michael
>>
> 
> 


-- 
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/

Re: mapred: config parameters

Posted by Gal Nitzan <gn...@usa.net>.

Hi Michael,

this question should be asked in the nutch-users list.

Take a look at a thread: So many Unfetched Pages using MapReduce

G.

On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote:
> Hi,
> 
> the last days I gave the mapred-branch a try and I was impressed!
> 
> But I still have a problem with the incremental crawling. My setup: I 
> have 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Running 
> one round of "crawling" consists out of the steps:
> 
> - generate (I set a limit of "-topN 10000000")
> - fetch
> - update
> - index
> - invertlinks
> 
> For the first round, I injected a list of about 20.000 websites. When 
> running nutch, I expected, that the fetcher would be pretty busy and 
> went for a coffee. Ok: perhaps someone talked to wife and decided, I 
> should not drink so much coffee. But I think, I made a mistake.. But 
> after 100 URLs he stopped working.
> 
> After some tweaking I got the installation to fetch about 10.000 pages, 
> but this is still not what I expect. First guess was the url-filter, but 
> I see the urls in the tasktracker log. I looked at the mailinglist and 
> got many ideas, but I still get more confused.
> 
> I think, the following parameters have an influence on the number of 
> pages fetched (in the brackets are the values I selected):
> 
> - mapred.map.tasks   	  		(100)
> - mapred.reduce.tasks	  		(3)
> - mapred.task.timeout	  		(3600000 [an other question])
> - mapred.tasktracker.tasks.maximum 	(10)
> - fetcher.threads.fetch			(100)
> - fetcher.server.delay			(5.0)
> - fetcher.threads.per.host	        (10)
> - generate.max.per.host			(1000)
> - http.content.limit                    (2000000)
> 
> I don't like my parameters, but so I got the most results. Looking at 
> the jobmanager, each "map task" fetched between 70 - 100 pages. Having 
> 100 map.tasks: I have ~ 8000 new pages fetched in the end. That's nearly 
> the number the crawldb says too.
> 
> Which parameter has an influence on the number of pages ONE task 
> fetches. By my observations, I would guess it's "fetcher.threads.fetch". 
> Increasing this number further means, to blast the load on the 
> tasktrackers. So there must be an other problem.
> 
> Any help appreciated!
> 
> Regards
> 
> 	Michael
>