You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by oh...@cox.net on 2009/07/16 19:36:40 UTC
Problem crawling local filesystem
Hi,
I'm trying to setup a test using Nutch to crawl the local file system. This is on a Redhat system. I'm basically following the procedure in these links:
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
http://markmail.org/message/pnmqd7ypguh7qtit
Here's my command line:
bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log
Here's what I get in the Nutch log file:
[root@nssdemo nutch-1.0]# cat acrawlfs1.log
crawl started in: acrawlfs1.test
rootUrlDir = fs-urls
threads = 10
depth = 4
Injector: starting
Injector: crawlDb: acrawlfs1.test/crawldb
Injector: urlDir: fs-urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: acrawlfs1.test/segments/20090716101523
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Bad protocol in url:
Bad protocol in url: #file:///data/readings/semanticweb/
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: acrawlfs1.test/segments/20090716101523
Fetcher: threads: 10
QueueFeeder finished: total 1 records.
fetching file:///testfiles/ <file:///testfiles/>
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
fetch of file:///testfiles/ <file:///testfiles/> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: acrawlfs1.test/crawldb
CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: acrawlfs1.test/segments/20090716101532
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Bad protocol in url:
Bad protocol in url: #file:///data/readings/semanticweb/
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: acrawlfs1.test/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: acrawlfs1.test/indexes
Dedup: done
merging indexes to: acrawlfs1.test/index
Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000
done merging
crawl finished: acrawlfs1.test
Here's my conf/nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
</property>
<property>
<name>file.content.limit</name> <value>-1</value>
</property>
</configuration>
and, my crawl-urlfilter.txt:
[root@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt
#skip http:, ftp:, & mailto: urls
##-^(file|ftp|mailto):
-^(http|ftp|mailto):
#skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
#skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
#accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
#accecpt anything else
+.*
And, in fs-urls, I have urls:
file:///testfiles/ <file:///testfiles/>
#file:///data/readings/semanticweb/
For this test, I have a /testfiles directory, with a bunch of .txt files under two directories /testfiles/Content1 and /testfiles/Content2.
It looks like the crawl goes to the end, and creates the directories and files under acrawlfs1.test, but when I run Luke on the index directory, I got an error, with a popup window with just "0" in it.
Is the problem because of that 404 error in the log? If so, why am I getting that 404 error?
Thanks,
Jim
Re: Problem crawling local filesystem
Posted by oh...@cox.net.
Hi,
I think that I've found my problem.
It looks like that line in the "urls" file:
file:///testfiles/ <file:///testfiles/>
should have been:
file:///testfiles/
I originally had the bad one because I had followed the link that I sent, but I don't know why he included the "<...>" part :(...
I'm re-running it now. I don't have the patch/fix to prevent crawling the parent directory, yet, so it's taking awhile.
Jim
---- ohaya@cox.net wrote:
> Hi,
>
> I'm trying to setup a test using Nutch to crawl the local file system. This is on a Redhat system. I'm basically following the procedure in these links:
>
> http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
>
> http://markmail.org/message/pnmqd7ypguh7qtit
>
> Here's my command line:
>
> bin/nutch crawl fs-urls -dir acrawlfs1.test -depth 4 >& acrawlfs1.log
>
>
> Here's what I get in the Nutch log file:
>
> [root@nssdemo nutch-1.0]# cat acrawlfs1.log
> crawl started in: acrawlfs1.test
> rootUrlDir = fs-urls
> threads = 10
> depth = 4
> Injector: starting
> Injector: crawlDb: acrawlfs1.test/crawldb
> Injector: urlDir: fs-urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: acrawlfs1.test/segments/20090716101523
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Bad protocol in url:
> Bad protocol in url: #file:///data/readings/semanticweb/
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: acrawlfs1.test/segments/20090716101523
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records.
> fetching file:///testfiles/ <file:///testfiles/>
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> -finishing thread FetcherThread, activeThreads=2
> -finishing thread FetcherThread, activeThreads=1
> org.apache.nutch.protocol.file.FileError: File Error: 404
> at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
> fetch of file:///testfiles/ <file:///testfiles/> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: acrawlfs1.test/crawldb
> CrawlDb update: segments: [acrawlfs1.test/segments/20090716101523]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: acrawlfs1.test/segments/20090716101532
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Bad protocol in url:
> Bad protocol in url: #file:///data/readings/semanticweb/
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=1 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: acrawlfs1.test/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: file:/opt/nutch-1.0/acrawlfs1.test/segments/20090716101523
> LinkDb: done
> Indexer: starting
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: acrawlfs1.test/indexes
> Dedup: done
> merging indexes to: acrawlfs1.test/index
> Adding file:/opt/nutch-1.0/acrawlfs1.test/indexes/part-00000
> done merging
> crawl finished: acrawlfs1.test
>
> Here's my conf/nutch-site.xml:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-file|urlfilter-regex|parse-(text|html|msword|pdf)|index-basic|query-(basic|site|url)</value>
> </property>
> <property>
> <name>file.content.limit</name> <value>-1</value>
> </property>
>
> </configuration>
>
>
> and, my crawl-urlfilter.txt:
>
> [root@nssdemo nutch-1.0]# cat conf/crawl-urlfilter.txt
> #skip http:, ftp:, & mailto: urls
> ##-^(file|ftp|mailto):
>
> -^(http|ftp|mailto):
>
> #skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> #skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> #accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
> #accecpt anything else
> +.*
>
> And, in fs-urls, I have urls:
>
> file:///testfiles/ <file:///testfiles/>
>
> #file:///data/readings/semanticweb/
>
> For this test, I have a /testfiles directory, with a bunch of .txt files under two directories /testfiles/Content1 and /testfiles/Content2.
>
> It looks like the crawl goes to the end, and creates the directories and files under acrawlfs1.test, but when I run Luke on the index directory, I got an error, with a popup window with just "0" in it.
>
> Is the problem because of that 404 error in the log? If so, why am I getting that 404 error?
>
> Thanks,
> Jim