You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by oh...@cox.net on 2009/07/16 19:36:40 UTC

Problem crawling local filesystem

Hi,

I'm trying to setup a test using Nutch to crawl the local file system.  This is on a Redhat system.  I'm basically following the procedure in these links:

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

http://markmail.org/message/pnmqd7ypguh7qtit

Here's my command line:

bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log


Here's what I get in the Nutch log file:

[root@nssdemo nutch-1.0]# cat acrawlfs1.log
crawl started in: acrawlfs1.test
rootUrlDir = fs-urls
threads = 10
depth = 4
Injector: starting
Injector: crawlDb: acrawlfs1.test/crawldb
Injector: urlDir: fs-urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: acrawlfs1.test/segments/20090716101523
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Bad protocol in url:
Bad protocol in url: #file:///data/readings/semanticweb/
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: acrawlfs1.test/segments/20090716101523
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching file:///testfiles/ <file:///testfiles/>
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
org.apache.nutch.protocol.file.FileError: File Error: 404
        at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of file:///testfiles/ <file:///testfiles/> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: acrawlfs1.test/crawldb
CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: acrawlfs1.test/segments/20090716101532
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Bad protocol in url:
Bad protocol in url: #file:///data/readings/semanticweb/
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: acrawlfs1.test/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: acrawlfs1.test/indexes
Dedup: done
merging indexes to: acrawlfs1.test/index
Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000
done merging
crawl finished: acrawlfs1.test

Here's my conf/nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>


<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>

</configuration>


and, my crawl-urlfilter.txt:

[root@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt
#skip http:, ftp:, & mailto: urls
##-^(file|ftp|mailto):

-^(http|ftp|mailto):

#skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

#skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

#accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

#accecpt anything else
+.*

And, in fs-urls, I have urls:

file:///testfiles/ <file:///testfiles/>

#file:///data/readings/semanticweb/

For this test, I have a /testfiles directory, with a bunch of .txt files under two directories /testfiles/Content1 and /testfiles/Content2.

It looks like the crawl goes to the end, and creates the directories and files under acrawlfs1.test, but when I run Luke on the index directory, I got an error, with a popup window with just "0" in it.

Is the problem because of that 404 error in the log?  If so, why am I getting that 404 error?

Thanks,
Jim

Re: Problem crawling local filesystem

Posted by oh...@cox.net.
Hi,

I think that I've found my problem.

It looks like that line in the "urls" file:

file:///testfiles/ <file:///testfiles/>

should have been:

file:///testfiles/

I originally had the bad one because I had followed the link that I sent, but I don't know why he included the "<...>" part :(...

I'm re-running it now.  I don't have the patch/fix to prevent crawling the parent directory, yet, so it's taking awhile.

Jim



---- ohaya@cox.net wrote: 
> Hi,
> 
> I'm trying to setup a test using Nutch to crawl the local file system.  This is on a Redhat system.  I'm basically following the procedure in these links:
> 
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
> 
> http://markmail.org/message/pnmqd7ypguh7qtit
> 
> Here's my command line:
> 
> bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log
> 
> 
> Here's what I get in the Nutch log file:
> 
> [root@nssdemo nutch-1.0]# cat acrawlfs1.log
> crawl started in: acrawlfs1.test
> rootUrlDir = fs-urls
> threads = 10
> depth = 4
> Injector: starting
> Injector: crawlDb: acrawlfs1.test/crawldb
> Injector: urlDir: fs-urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: acrawlfs1.test/segments/20090716101523
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Bad protocol in url:
> Bad protocol in url: #file:///data/readings/semanticweb/
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: acrawlfs1.test/segments/20090716101523
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records.
> fetching file:///testfiles/ <file:///testfiles/>
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> org.apache.nutch.protocol.file.FileError: File Error: 404
>         at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
>         at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
> fetch of file:///testfiles/ <file:///testfiles/> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: acrawlfs1.test/crawldb
> CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: acrawlfs1.test/segments/20090716101532
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Bad protocol in url:
> Bad protocol in url: #file:///data/readings/semanticweb/
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: acrawlfs1.test/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523
> LinkDb: done
> Indexer: starting
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: acrawlfs1.test/indexes
> Dedup: done
> merging indexes to: acrawlfs1.test/index
> Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000
> done merging
> crawl finished: acrawlfs1.test
> 
> Here's my conf/nutch-site.xml:
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> 
> <property>
> <name>plugin.includes</name>
> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
> 
> </configuration>
> 
> 
> and, my crawl-urlfilter.txt:
> 
> [root@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
> 
> -^(http|ftp|mailto):
> 
> #skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
> 
> #skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> 
> #accecpt anything else
> +.*
> 
> And, in fs-urls, I have urls:
> 
> file:///testfiles/ <file:///testfiles/>
> 
> #file:///data/readings/semanticweb/
> 
> For this test, I have a /testfiles directory, with a bunch of .txt files under two directories /testfiles/Content1 and /testfiles/Content2.
> 
> It looks like the crawl goes to the end, and creates the directories and files under acrawlfs1.test, but when I run Luke on the index directory, I got an error, with a popup window with just "0" in it.
> 
> Is the problem because of that 404 error in the log?  If so, why am I getting that 404 error?
> 
> Thanks,
> Jim