You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bruno Thiel <br...@objectconsulting.com.au> on 2006/10/18 02:21:07 UTC

Indexing the file system / best approach

All,

I want to get nutch to index the file system. My first approach was to
nfs-mount the file system and et nutch crawl through the hierachary over
http/Apache. This turned out to be fairly slow  ~3,000 fetches per hour. 
Next approach was to go via file:/// <file:///>  and to generate a file list
to be crawled. This file list is fairly big ~200,000 entries, and with the
current 0.8.1 release of nutch the fetcher just freezes right at the end of
a crawl. Other strategies to split up the filelist into smaller parts
~20,000 and subsequently merging the indexes still fail for the same reason.

Anybody doing an extensive crawl with nutch through the file system in the
community - what's your setup?

Cheers, Bruno

Re: Indexing the file system / best approach

Posted by Sami Siren <ss...@gmail.com>.

Bruno Thiel wrote:
> All,
>
> I want to get nutch to index the file system. My first approach was to
> nfs-mount the file system and et nutch crawl through the hierachary over
> http/Apache. This turned out to be fairly slow  ~3,000 fetches per hour. 
> Next approach was to go via file:/// <file:///>  and to generate a file list
> to be crawled. This file list is fairly big ~200,000 entries, and with the
> current 0.8.1 release of nutch the fetcher just freezes right at the end of
> a crawl.
What exactly happens when your fetcher freezes? 200 000 entries is not a 
big list to
be fetched.

--
 Sami Siren