You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ian Reardon <ir...@gmail.com> on 2005/05/04 19:15:10 UTC

Some Nutch Questions

I would like to build an engine based on a hand full of hand picked
sites from a specific domain.   I had a few questions.

How many documents can I fit on a single server implementation (2 cpu
xeon)?  With space being irrelevant aprox. how many documents can I
have on a single node with respectable search performance?

My idea is to have a hand full of sites that I judge for quality and
index these on a regular basis maybe... once a month.  I would like to
add new sites over time.  Does this sound feasible with nutch?

What method would be best for this type of application? I setup nutch
and crawled a very small sample using method 1 in the tutorial
"Intranet crawl"  I was unable to get whole web crawl to work.  What
is that -dmozfile flag?  I don't want to base this off dmoz.  If
anyone could point me to some documentation or tutorial that better
explains whole web crawling I would appreciate it.  Thanks a lot.