You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by djames <dj...@supinfo.com> on 2006/12/27 10:08:55 UTC

Nutch Common administration's Task

Hello dear expert,

I'm new in my corporation, and they were searching for a solution to crawl a
selection of 1 000 000 url.
Naturaly my choice was for nutch for his scalability and java code.
I begin working with nutch three weeks ago and appreciate many things.
but i have some questions i can't answer:

How can'i refresh my crawl? The script on the wiki page continu the
precedente crawl. I try to do a new crawl but it's not possible because the
directory already exist.

If i want to add new url to my index must i create a new index and then
merge it with the existant index?

I heard about subcollections, but i don't understand what is it?

And finally when i try to setup a map reduce  configuration with the how to
on the wiki page and i launch a crawl i get this error on the hadoop log's: 

dfs.DataNode - Failed to transfer blk_-4263222254813988872 to
tasktracker:50010
java.net.UnknownHostException: tasktracker
	at java.net.PlainSocketImpl.connect(Unknown Source)
	at java.net.SocksSocketImpl.connect(Unknown Source)
	at java.net.Socket.connect(Unknown Source)
	at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:797)
	at java.lang.Thread.run(Unknown Source)

even if the box tasktracker was reachable before the launch of the crawl
with the administration interface she became unreachable after???

Thanks a lot for your help, and felicitations to the developer and for
wikipedia
-- 
View this message in context: http://www.nabble.com/Nutch-Common-administration%27s-Task-tf2885119.html#a8060701
Sent from the Nutch - User mailing list archive at Nabble.com.