You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Amna Waqar <am...@gmail.com> on 2011/02/01 10:48:14 UTC

Help : Nutch indexing mechanism

Hello everybody,
I want to know how nutch actually does indexing..What are the steps involved
in indexing..
Thanks in advance
Regards
Amna Waqar

RE: Help : Nutch indexing mechanism

Posted by a a <mb...@msn.com>.

hi,



how nutch
works:

 

   1. Create a new WebDB (admin db -create).

   2. Inject root URLs into the WebDB (inject).

   3. Generate a fetchlist from the WebDB in a
new segment (generate).

   4. Fetch content from URLs in the fetchlist
(fetch).

   5. Update the WebDB with links from fetched
pages (updatedb).

   6. Repeat steps 3-5 until the required depth
is reached.

   7. Update segments with scores and links
from the WebDB (updatesegs).

   8. Index the fetched pages (index).

   9. Eliminate duplicate content (and
duplicate URLs) from the indexes (dedup).

  10. Merge the indexes into a single index for
searching (merge).



mehdi




> Date: Tue, 1 Feb 2011 04:48:14 -0500
> Subject: Help : Nutch indexing mechanism
> From: amna.waqar.ee@gmail.com
> To: user@nutch.apache.org
> 
> Hello everybody,
> I want to know how nutch actually does indexing..What are the steps involved
> in indexing..
> Thanks in advance
> Regards
> Amna Waqar