You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by vs...@free.fr on 2009/04/30 11:14:30 UTC

Is it possible to avoid Nutch 1.0 from indexing local directories ?

Hello, 

I am crawling my local File System using nutch 1.0

in my url file, i put
file://localhost/data/

During the crawl, nuch is parsing the directories in order to find oulinks, wich is allright :

2009-04-29 15:17:56,468 INFO  fetcher.OldFetcher - fetching file://localhost/data/
2009-04-29 15:17:56,793 DEBUG file.File - fetching file://localhost/data/
2009-04-29 15:18:18,501 DEBUG parse.ParseUtil - Parsing [file://localhost/data/] with [org.apache.nutch.parse.html.HtmlParser@e53220]
2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,33% confidence)
2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,32% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,32% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,29% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,23% confidence)
2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,20% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,19% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 16% confidence)
2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,16% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-9 (detect, 14% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,13% confidence)
2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,13% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 11% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset big5 (detect, 10% confidence)
2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset x-windows-949 (detect, 10% confidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset euc-jp (detect, 10% cofidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset gb18030 (detect, 10% confidence)
2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset shift_jis (detect, 10% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset utf-8 (detect, 10% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 8% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 4% confidence)
2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: Choosing encoding: utf-8 (default)
2009-04-29 15:18:18,553 DEBUG parse.html - Meta tags for file://localhost/data/: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null
2009-04-29 15:18:18,560 DEBUG parse.html - found 6 outlinks in file://localhost/data/

However nutch is also trying to index the directory himself :

2009-04-29 15:20:27,283 DEBUG indexer.Indexer - Indexing [file://localhost/data/] with analyzer org.apache.nutch.analysis.en.EnglishAnalyzer@17c2891 (en)

Is there a way to tell nutch to find outlinks from directories, without trying to index them ?

Any help would be greatly apreciated.
Regards, 
Vincent

Re: Is it possible to avoid Nutch 1.0 from indexing local directories ?

Posted by Dennis Kubes <ku...@apache.org>.
Without fetching, no.  Without indexing yes.  You can run the fetcher on 
these directories.  Then use the webgraph tools to find just inlinks or 
outlink.

It looks like below you are probably using the crawl command which 
performs the entire stack from fetching and parsing to indexing.  You 
can run the commands individually to avoid indexing if you like.

Dennis

vswm@free.fr wrote:
> Hello, 
> 
> I am crawling my local File System using nutch 1.0
> 
> in my url file, i put
> file://localhost/data/
> 
> During the crawl, nuch is parsing the directories in order to find oulinks, wich is allright :
> 
> 2009-04-29 15:17:56,468 INFO  fetcher.OldFetcher - fetching file://localhost/data/
> 2009-04-29 15:17:56,793 DEBUG file.File - fetching file://localhost/data/
> 2009-04-29 15:18:18,501 DEBUG parse.ParseUtil - Parsing [file://localhost/data/] with [org.apache.nutch.parse.html.HtmlParser@e53220]
> 2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,33% confidence)
> 2009-04-29 15:18:18,511 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,32% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,32% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,29% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,23% confidence)
> 2009-04-29 15:18:18,512 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,20% confidence)
> 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,19% confidence)
> 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 16% confidence)
> 2009-04-29 15:18:18,513 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,16% confidence)
> 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-9 (detect, 14% confidence)
> 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,13% confidence)
> 2009-04-29 15:18:18,514 DEBUG util.EncodingDetector - file://localhost/data/: charset windows-1252 (detect,13% confidence)
> 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 11% confidence)
> 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset big5 (detect, 10% confidence)
> 2009-04-29 15:18:18,515 DEBUG util.EncodingDetector - file://localhost/data/: charset x-windows-949 (detect, 10% confidence)
> 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset euc-jp (detect, 10% cofidence)
> 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset gb18030 (detect, 10% confidence)
> 2009-04-29 15:18:18,516 DEBUG util.EncodingDetector - file://localhost/data/: charset shift_jis (detect, 10% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset utf-8 (detect, 10% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 8% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: charset iso-8859-2 (detect, 4% confidence)
> 2009-04-29 15:18:18,517 DEBUG util.EncodingDetector - file://localhost/data/: Choosing encoding: utf-8 (default)
> 2009-04-29 15:18:18,553 DEBUG parse.html - Meta tags for file://localhost/data/: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null
> 2009-04-29 15:18:18,560 DEBUG parse.html - found 6 outlinks in file://localhost/data/
> 
> However nutch is also trying to index the directory himself :
> 
> 2009-04-29 15:20:27,283 DEBUG indexer.Indexer - Indexing [file://localhost/data/] with analyzer org.apache.nutch.analysis.en.EnglishAnalyzer@17c2891 (en)
> 
> Is there a way to tell nutch to find outlinks from directories, without trying to index them ?
> 
> Any help would be greatly apreciated.
> Regards, 
> Vincent