You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ever <ev...@gmx.de> on 2007/05/21 19:09:37 UTC
Crawling Local file System
Hi there,
I have a problem getting the local filesystem crawled by nutch. My current
config is as following: Nutch trunc Version which compiled nicely without
any errors especially protocol-file is ok. I also tryed to insert the
protocol-file.jar to the lib path but had the same bad result. Is there
something more i can look for?
Thank you in advance !
regards
========== Log Output====================
bash-3.2$ ./nutch crawl urls.txt -dir crawl -threads 1
crawl started in: crawl
rootUrlDir = urls.txt
threads = 1
depth = 5
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070521184833
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070521184833
Fetcher: threads: 1
fetching file:///C:/temp/test/
fetch of file:///C:/temp/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
....
=====================================
My Configuration:
<configuration>
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|protocol-smb|urlfilter-crawl|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>
</description>
</property>
</configuration>
==================
my crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):
+^(file|smb):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
#+^http://([a-z0-9]*\.)*apache.org/
# skip everything else
-.
==========
My Commandline Args I got them by echo the last line of ./nutch
C:\Programme\Java_jdk1.5.0_04/bin/java -Xmx1000m
-Dhadoop.log.dir=c:\Dipl\Nutch\logs -Dhadoop.log.file=hadoop.log
-Djava.library.path=c:\Dipl\Nutch\lib\native\Windows_XP-x86-32
-Djava.protocol.handler.pkgs=jcifs -classpath
c:\Dipl\Nutch\conf;C;C:\Programme\Java_jdk1.5.0_04\lib\tools.jar;c:\Dipl\Nutch\build;c:\Dipl\Nutch\build\nutch-1.0-dev.job;c:\Dipl\Nutch\build\test\classes;c:\Dipl\Nutch\nutch-*.job;c:\Dipl\Nutch\lib\commons-cli-2.0-SNAPSHOT.jar;c:\Dipl\Nutch\lib\commons-codec-1.3.jar;c:\Dipl\Nutch\lib\commons-httpclient-3.0.1.jar;c:\Dipl\Nutch\lib\commons-lang-2.1.jar;c:\Dipl\Nutch\lib\commons-logging-1.0.4.jar;c:\Dipl\Nutch\lib\commons-logging-api-1.0.4.jar;c:\Dipl\Nutch\lib\hadoop-0.12.2-core.jar;c:\Dipl\Nutch\lib\jakarta-oro-2.0.7.jar;c:\Dipl\Nutch\lib\jets3t-0.5.0.jar;c:\Dipl\Nutch\lib\jetty-5.1.4.jar;c:\Dipl\Nutch\lib\junit-3.8.1.jar;c:\Dipl\Nutch\lib\log4j-1.2.13.jar;c:\Dipl\Nutch\lib\lucene-core-2.1.0.jar;c:\Dipl\Nutch\lib\lucene-misc-2.1.0.jar;c:\Dipl\Nutch\lib\servlet-api.jar;c:\Dipl\Nutch\lib\taglibs-i18n.jar;c:\Dipl\Nutch\lib\xerces-2_6_2-apis.jar;c:\Dipl\Nutch\lib\xerces-2_6_2.jar;c:\Dipl\Nutch\lib\jetty-ext\ant.jar;c:\Dipl\Nutch\lib\jetty-ext\commons-el.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-compiler.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-runtime.jar;c:\Dipl\Nutch\lib\jetty-ext\jsp-api.jar;C:\Dipl\Nutch\build\protocol-file\protocol-file.jar
org.apache.nutch.crawl.Crawl urls.txt -dir crawl
=========
My Urls.txt
file:///C:/temp/test/
--
View this message in context: http://www.nabble.com/Crawling-Local-file-System-tf3791589.html#a10722948
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Crawling Local file System
Posted by Ever <ev...@gmx.de>.
Ok I got it
I used also the pluigin protocol-smb and there was an error in the
plugin.xml
Plugin id was set to "protocol-file" and not to protocol-smb so there were
2 protocol-file
check out
https://issues.apache.org/jira/browse/NUTCH-427
<?xml version="1.0" encoding="UTF-8" ?>
- <!-- Document : plugin.xml
Created on : 03 January 2007, 10:41
Author : Armel T. Nene
Description:
This file is used by Nutch to configure the SMB protocol
-->
- <plugin id="protocol-file" name="SMB Protocol Plug-in" version="1.0.0"
provider-name="iDNA Solutions LTD">
- <runtime>
- <library name="protocol-smb.jar">
<export name="*" />
</library>
<library name="jcifs-1.2.12.jar" />
</runtime>
- <requires>
<import plugin="nutch-extensionpoints" />
</requires>
- <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
point="org.apache.nutch.protocol.Protocol">
- <implementation id="org.apache.nutch.protocol.smb.SMB"
class="org.apache.nutch.protocol.smb.SMB">
<parameter name="protocolName" value="SMB" />
</implementation>
</extension>
</plugin>
Ever wrote:
>
> Hi there,
> I have a problem getting the local filesystem crawled by nutch. My current
> config is as following: Nutch trunc Version which compiled nicely without
> any errors especially protocol-file is ok. I also tryed to insert the
> protocol-file.jar to the lib path but had the same bad result. Is there
> something more i can look for?
>
> Thank you in advance !
>
> regards
>
> ========== Log Output====================
> bash-3.2$ ./nutch crawl urls.txt -dir crawl -threads 1
> crawl started in: crawl
> rootUrlDir = urls.txt
> threads = 1
> depth = 5
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070521184833
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070521184833
> Fetcher: threads: 1
> fetching file:///C:/temp/test/
> fetch of file:///C:/temp/test/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> ....
>
> =====================================
>
> My Configuration:
>
> <configuration>
>
> <property>
> <name>file.content.limit</name>
> <value>-1</value>
> </property>
>
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-file|protocol-smb|urlfilter-crawl|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> <description>
> </description>
> </property>
>
> </configuration>
>
>
> ==================
>
> my crawl-urlfilter.txt
>
> # skip file:, ftp:, & mailto: urls
> -^(ftp|mailto):
> +^(file|smb):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
>
> # skip everything else
> -.
>
> ==========
>
>
> My Commandline Args I got them by echo the last line of ./nutch
> C:\Programme\Java_jdk1.5.0_04/bin/java -Xmx1000m
> -Dhadoop.log.dir=c:\Dipl\Nutch\logs -Dhadoop.log.file=hadoop.log
> -Djava.library.path=c:\Dipl\Nutch\lib\native\Windows_XP-x86-32
> -Djava.protocol.handler.pkgs=jcifs -classpath
> c:\Dipl\Nutch\conf;C;C:\Programme\Java_jdk1.5.0_04\lib\tools.jar;c:\Dipl\Nutch\build;c:\Dipl\Nutch\build\nutch-1.0-dev.job;c:\Dipl\Nutch\build\test\classes;c:\Dipl\Nutch\nutch-*.job;c:\Dipl\Nutch\lib\commons-cli-2.0-SNAPSHOT.jar;c:\Dipl\Nutch\lib\commons-codec-1.3.jar;c:\Dipl\Nutch\lib\commons-httpclient-3.0.1.jar;c:\Dipl\Nutch\lib\commons-lang-2.1.jar;c:\Dipl\Nutch\lib\commons-logging-1.0.4.jar;c:\Dipl\Nutch\lib\commons-logging-api-1.0.4.jar;c:\Dipl\Nutch\lib\hadoop-0.12.2-core.jar;c:\Dipl\Nutch\lib\jakarta-oro-2.0.7.jar;c:\Dipl\Nutch\lib\jets3t-0.5.0.jar;c:\Dipl\Nutch\lib\jetty-5.1.4.jar;c:\Dipl\Nutch\lib\junit-3.8.1.jar;c:\Dipl\Nutch\lib\log4j-1.2.13.jar;c:\Dipl\Nutch\lib\lucene-core-2.1.0.jar;c:\Dipl\Nutch\lib\lucene-misc-2.1.0.jar;c:\Dipl\Nutch\lib\servlet-api.jar;c:\Dipl\Nutch\lib\taglibs-i18n.jar;c:\Dipl\Nutch\lib\xerces-2_6_2-apis.jar;c:\Dipl\Nutch\lib\xerces-2_6_2.jar;c:\Dipl\Nutch\lib\jetty-ext\ant.jar;c:\Dipl\Nutch\lib\jetty-ext\commons-el.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-compiler.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-runtime.jar;c:\Dipl\Nutch\lib\jetty-ext\jsp-api.jar;C:\Dipl\Nutch\build\protocol-file\protocol-file.jar
> org.apache.nutch.crawl.Crawl urls.txt -dir crawl
>
>
> =========
> My Urls.txt
>
> file:///C:/temp/test/
>
>
>
--
View this message in context: http://www.nabble.com/Crawling-Local-file-System-tf3791589.html#a10737391
Sent from the Nutch - User mailing list archive at Nabble.com.