You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by mengel <me...@163.com> on 2009/12/11 11:42:48 UTC

nutch's design document

Hello,Dear:
   I am a freshman for Nutch. I want to learn nutch, but I can't find a document for design such as architecture. Can you give me some advice for how to learn Nutch.Thank you very much.

                                                             Mengel


Re: nutch's design document

Posted by MilleBii <mi...@gmail.com>.
Welcome !!!

Nutch is different from anything else I have seen before, but its
great and also difficult. So expect to spend some time.

Best way to learn is practice to understand what it does.

1. Front-End (search) : is a web site which wraps a Lucene based
index. If you are not familiar with Lucene you can buy yourself the
book Lucene in action, but it is not really necessary. You can also
use Solr as a more sophisticated front end.

2. Back-End (crawling to indexing)

crawling is done in a number of steps (read the wiki) and uses two
critical database crawldb and linkdb to maintain a graph of where the
engine has gone.
It will fetch, parse, index pages...

3. Cluster / Cloud computing
Based on hadoop it uses map/reduce parallel processing technique for
the different steps.
There is an Hadoop book you can buy.

Good luck and see you on the mailing list.

2009/12/11, mengel <me...@163.com>:
> Hello,Dear:
>    I am a freshman for Nutch. I want to learn nutch, but I can't find a
> document for design such as architecture. Can you give me some advice for
> how to learn Nutch.Thank you very much.
>
>                                                              Mengel
>
>


-- 
-MilleBii-