You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hannu Väisänen <hv...@joyx.joensuu.fi> on 2009/04/03 06:35:37 UTC

Nutch can't find all files

I am using Nutch to index my hard disk.

Nutch is skipping some files. They do not show in Nutch logs (like
fetching file:...) and it is as if Nutch do not notice that they
exist.

But when I moved one file that Nutch did not notice to a test
directory that had only a few files and indexed only that directory,
Nutch did index the file.

Any ideas on how I can debug the problem?

Re: Nutch can't find all files

Posted by yanky young <ya...@gmail.com>.
Hi:

Of course u can look into code and add some debug lines in ur case. Just
look at protocol-file plugin, which is supposed to process file:// scheme.
You can find this plugin code in ${nutch_home}/src/plugin/protocol-file

and as of nutch fetching list, you can dump crawldb by nutch readdb command.

good luck


2009/4/9 Hannu Väisänen <hv...@joyx.joensuu.fi>

> On Wed, Apr 08, 2009 at 08:54:37AM +0200, Andrzej Bialecki wrote:
> > Most likely this is related to the setting db.max.outlinks.per.page. The
> > default is 1000. In case of file:// URLs this means that directory
> > listings with more than 1000 entries will be truncated. Solution: simply
> > increase the limit.
>
> That helped a little. Now Nutch is fetching more files but it is still
> skipping files.
>
> I have more questions.
>
> How does Nutch select the files it fetches?
>
> Is it reading every file name in a directory and then selecting what it
> fetches?
>
> Is it possible to output the file names Nutch consideres for fetching?
>
> Where do I look in the code? (-:
>

Re: Nutch can't find all files

Posted by Hannu Väisänen <hv...@joyx.joensuu.fi>.
On Wed, Apr 08, 2009 at 08:54:37AM +0200, Andrzej Bialecki wrote:
> Most likely this is related to the setting db.max.outlinks.per.page. The  
> default is 1000. In case of file:// URLs this means that directory  
> listings with more than 1000 entries will be truncated. Solution: simply  
> increase the limit.

That helped a little. Now Nutch is fetching more files but it is still
skipping files.

I have more questions.

How does Nutch select the files it fetches?

Is it reading every file name in a directory and then selecting what it
fetches?

Is it possible to output the file names Nutch consideres for fetching?

Where do I look in the code? (-:

Re: Nutch can't find all files

Posted by Andrzej Bialecki <ab...@getopt.org>.
Hannu Väisänen wrote:
> On Mon, Apr 06, 2009 at 11:18:59PM +0800, yanky young wrote:
>> Maybe it is about Windows path names and file names.
>> In Windows, path names and file names can have whitespace.
> 
> I am running Linux and I have no whitespace in my file names.
> 
> 
>> log4j.logger.org.apache.nutch.protocol.file=DEBUG,cmdstdout
> 
> This did not show the files Nutch is skipping.

Most likely this is related to the setting db.max.outlinks.per.page. The 
default is 1000. In case of file:// URLs this means that directory 
listings with more than 1000 entries will be truncated. Solution: simply 
increase the limit.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Nutch can't find all files

Posted by Hannu Väisänen <hv...@joyx.joensuu.fi>.
On Mon, Apr 06, 2009 at 11:18:59PM +0800, yanky young wrote:
> Maybe it is about Windows path names and file names.
> In Windows, path names and file names can have whitespace.

I am running Linux and I have no whitespace in my file names.


> log4j.logger.org.apache.nutch.protocol.file=DEBUG,cmdstdout

This did not show the files Nutch is skipping.


Re: Nutch can't find all files

Posted by yanky young <ya...@gmail.com>.
maybe it is about windows path names and file names.

in windows, path names and file names can have whitespace. but nutch
don't get it right in this case. at least nutch 0.9 has problem about
this issue. you can try to set debug mode with protocol-file plugin in
log4j.properties file as follows to see what happened:

log4j.logger.org.apache.nutch.protocol.file=DEBUG,cmdstdout

if that's the case, here is the workaround:

in FileResponse, find these lines:

// url.toURI() is only in j2se 1.5.0
//java.io.File f = new java.io.File(url.toURI());
java.io.File f = new java.io.File(path);

change to these:

java.io.File f = new java.io.File(url.toURI());
//java.io.File f = new java.io.File(path);

and run ant compile

good luck



2009/4/3, Hannu Väisänen <hv...@joyx.joensuu.fi>:
> I am using Nutch to index my hard disk.
>
> Nutch is skipping some files. They do not show in Nutch logs (like
> fetching file:...) and it is as if Nutch do not notice that they
> exist.
>
> But when I moved one file that Nutch did not notice to a test
> directory that had only a few files and indexed only that directory,
> Nutch did index the file.
>
> Any ideas on how I can debug the problem?
>