You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Saravanaraj Duraisamy <sa...@gmail.com> on 2006/02/06 04:33:38 UTC

Problem indexing Files

Hi i am using nutch to index files in local FS and FTP.

my filter file is

-^(http|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
-[?*!@=]
-.*(/.+?)/.*?\1/.*?\1/
+^file:/E:/Index Samples/
-^file:/E:/Index Samples/Index/

but nutch crawls the forbidden folders also. is there a web db kind of thing
for files also. is it possible to make nutch to index files based on the
last modified date.

can anybody suggest the datastructure for webdb (filedb??) for files. it
will be good to group files and create seperate segements for each group. so
if some files are changed, only those segments can be replaced.

Rgds,
D.Saravanaraj

Re: Problem indexing Files

Posted by Gal Nitzan <gn...@usa.net>.
Make sure you add -. at the end of your regex file to disallow anything
else.

On Mon, 2006-02-06 at 09:03 +0530, Saravanaraj Duraisamy wrote:
> Hi i am using nutch to index files in local FS and FTP.
> 
> my filter file is
> 
> -^(http|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|png|PNG|jar)$
> -[?*!@=]
> -.*(/.+?)/.*?\1/.*?\1/
> +^file:/E:/Index Samples/
> -^file:/E:/Index Samples/Index/
> 
> but nutch crawls the forbidden folders also. is there a web db kind of thing
> for files also. is it possible to make nutch to index files based on the
> last modified date.
> 
> can anybody suggest the datastructure for webdb (filedb??) for files. it
> will be good to group files and create seperate segements for each group. so
> if some files are changed, only those segments can be replaced.
> 
> Rgds,
> D.Saravanaraj