You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ever <ev...@gmx.de> on 2007/05/21 19:09:37 UTC

Crawling Local file System

Hi there,
I have a problem getting the local filesystem crawled by nutch. My current
config is as following: Nutch trunc Version which compiled nicely without
any errors especially protocol-file is ok. I also tryed to insert the
protocol-file.jar to the lib path but had the same bad result. Is there
something more i can look for? 

Thank you in advance !

regards 

========== Log Output====================
bash-3.2$ ./nutch crawl urls.txt -dir crawl -threads 1
crawl started in: crawl
rootUrlDir = urls.txt
threads = 1
depth = 5
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070521184833
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070521184833
Fetcher: threads: 1
fetching file:///C:/temp/test/
fetch of file:///C:/temp/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
Fetcher: done
CrawlDb update: starting
....

=====================================

My Configuration:

<configuration>

<property>
	<name>file.content.limit</name> 
		<value>-1</value>
</property> 

<property>
  <name>plugin.includes</name>
 
<value>protocol-file|protocol-smb|urlfilter-crawl|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>
  </description>
</property>

</configuration>


==================

 my crawl-urlfilter.txt 

# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):
+^(file|smb):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
#+^http://([a-z0-9]*\.)*apache.org/

# skip everything else
-.

==========


My Commandline Args I got them by echo the last line of ./nutch
C:\Programme\Java_jdk1.5.0_04/bin/java -Xmx1000m
-Dhadoop.log.dir=c:\Dipl\Nutch\logs -Dhadoop.log.file=hadoop.log
-Djava.library.path=c:\Dipl\Nutch\lib\native\Windows_XP-x86-32
-Djava.protocol.handler.pkgs=jcifs -classpath
c:\Dipl\Nutch\conf;C;C:\Programme\Java_jdk1.5.0_04\lib\tools.jar;c:\Dipl\Nutch\build;c:\Dipl\Nutch\build\nutch-1.0-dev.job;c:\Dipl\Nutch\build\test\classes;c:\Dipl\Nutch\nutch-*.job;c:\Dipl\Nutch\lib\commons-cli-2.0-SNAPSHOT.jar;c:\Dipl\Nutch\lib\commons-codec-1.3.jar;c:\Dipl\Nutch\lib\commons-httpclient-3.0.1.jar;c:\Dipl\Nutch\lib\commons-lang-2.1.jar;c:\Dipl\Nutch\lib\commons-logging-1.0.4.jar;c:\Dipl\Nutch\lib\commons-logging-api-1.0.4.jar;c:\Dipl\Nutch\lib\hadoop-0.12.2-core.jar;c:\Dipl\Nutch\lib\jakarta-oro-2.0.7.jar;c:\Dipl\Nutch\lib\jets3t-0.5.0.jar;c:\Dipl\Nutch\lib\jetty-5.1.4.jar;c:\Dipl\Nutch\lib\junit-3.8.1.jar;c:\Dipl\Nutch\lib\log4j-1.2.13.jar;c:\Dipl\Nutch\lib\lucene-core-2.1.0.jar;c:\Dipl\Nutch\lib\lucene-misc-2.1.0.jar;c:\Dipl\Nutch\lib\servlet-api.jar;c:\Dipl\Nutch\lib\taglibs-i18n.jar;c:\Dipl\Nutch\lib\xerces-2_6_2-apis.jar;c:\Dipl\Nutch\lib\xerces-2_6_2.jar;c:\Dipl\Nutch\lib\jetty-ext\ant.jar;c:\Dipl\Nutch\lib\jetty-ext\commons-el.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-compiler.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-runtime.jar;c:\Dipl\Nutch\lib\jetty-ext\jsp-api.jar;C:\Dipl\Nutch\build\protocol-file\protocol-file.jar
org.apache.nutch.crawl.Crawl urls.txt -dir crawl


=========
My Urls.txt

file:///C:/temp/test/


-- 
View this message in context: http://www.nabble.com/Crawling-Local-file-System-tf3791589.html#a10722948
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Crawling Local file System

Posted by Ever <ev...@gmx.de>.
Ok I got it

I used also the pluigin protocol-smb and there was an error in the
plugin.xml
Plugin id was set to "protocol-file"  and not to protocol-smb so there were
2 protocol-file

check out 

https://issues.apache.org/jira/browse/NUTCH-427


  <?xml version="1.0" encoding="UTF-8" ?> 
- <!--     Document   : plugin.xml
    Created on : 03 January 2007, 10:41
    Author     : Armel T. Nene
    Description:
        This file is used by Nutch to configure the SMB protocol 

  --> 
- <plugin id="protocol-file" name="SMB Protocol Plug-in" version="1.0.0"
provider-name="iDNA Solutions LTD">
- <runtime>
- <library name="protocol-smb.jar">
  <export name="*" /> 
  </library>
  <library name="jcifs-1.2.12.jar" /> 
  </runtime>
- <requires>
  <import plugin="nutch-extensionpoints" /> 
  </requires>
- <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
point="org.apache.nutch.protocol.Protocol">
- <implementation id="org.apache.nutch.protocol.smb.SMB"
class="org.apache.nutch.protocol.smb.SMB">
  <parameter name="protocolName" value="SMB" /> 
  </implementation>
  </extension>
  </plugin>



Ever wrote:
> 
> Hi there,
> I have a problem getting the local filesystem crawled by nutch. My current
> config is as following: Nutch trunc Version which compiled nicely without
> any errors especially protocol-file is ok. I also tryed to insert the
> protocol-file.jar to the lib path but had the same bad result. Is there
> something more i can look for? 
> 
> Thank you in advance !
> 
> regards 
> 
> ========== Log Output====================
> bash-3.2$ ./nutch crawl urls.txt -dir crawl -threads 1
> crawl started in: crawl
> rootUrlDir = urls.txt
> threads = 1
> depth = 5
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls.txt
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070521184833
> Generator: filtering: false
> Generator: topN: 2147483647
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070521184833
> Fetcher: threads: 1
> fetching file:///C:/temp/test/
> fetch of file:///C:/temp/test/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
> url=file
> Fetcher: done
> CrawlDb update: starting
> ....
> 
> =====================================
> 
> My Configuration:
> 
> <configuration>
> 
> <property>
> 	<name>file.content.limit</name> 
> 		<value>-1</value>
> </property> 
> 
> <property>
>   <name>plugin.includes</name>
>  
> <value>protocol-file|protocol-smb|urlfilter-crawl|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>
>   </description>
> </property>
> 
> </configuration>
> 
> 
> ==================
> 
>  my crawl-urlfilter.txt 
> 
> # skip file:, ftp:, & mailto: urls
> -^(ftp|mailto):
> +^(file|smb):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
> 
> # skip everything else
> -.
> 
> ==========
> 
> 
> My Commandline Args I got them by echo the last line of ./nutch
> C:\Programme\Java_jdk1.5.0_04/bin/java -Xmx1000m
> -Dhadoop.log.dir=c:\Dipl\Nutch\logs -Dhadoop.log.file=hadoop.log
> -Djava.library.path=c:\Dipl\Nutch\lib\native\Windows_XP-x86-32
> -Djava.protocol.handler.pkgs=jcifs -classpath
> c:\Dipl\Nutch\conf;C;C:\Programme\Java_jdk1.5.0_04\lib\tools.jar;c:\Dipl\Nutch\build;c:\Dipl\Nutch\build\nutch-1.0-dev.job;c:\Dipl\Nutch\build\test\classes;c:\Dipl\Nutch\nutch-*.job;c:\Dipl\Nutch\lib\commons-cli-2.0-SNAPSHOT.jar;c:\Dipl\Nutch\lib\commons-codec-1.3.jar;c:\Dipl\Nutch\lib\commons-httpclient-3.0.1.jar;c:\Dipl\Nutch\lib\commons-lang-2.1.jar;c:\Dipl\Nutch\lib\commons-logging-1.0.4.jar;c:\Dipl\Nutch\lib\commons-logging-api-1.0.4.jar;c:\Dipl\Nutch\lib\hadoop-0.12.2-core.jar;c:\Dipl\Nutch\lib\jakarta-oro-2.0.7.jar;c:\Dipl\Nutch\lib\jets3t-0.5.0.jar;c:\Dipl\Nutch\lib\jetty-5.1.4.jar;c:\Dipl\Nutch\lib\junit-3.8.1.jar;c:\Dipl\Nutch\lib\log4j-1.2.13.jar;c:\Dipl\Nutch\lib\lucene-core-2.1.0.jar;c:\Dipl\Nutch\lib\lucene-misc-2.1.0.jar;c:\Dipl\Nutch\lib\servlet-api.jar;c:\Dipl\Nutch\lib\taglibs-i18n.jar;c:\Dipl\Nutch\lib\xerces-2_6_2-apis.jar;c:\Dipl\Nutch\lib\xerces-2_6_2.jar;c:\Dipl\Nutch\lib\jetty-ext\ant.jar;c:\Dipl\Nutch\lib\jetty-ext\commons-el.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-compiler.jar;c:\Dipl\Nutch\lib\jetty-ext\jasper-runtime.jar;c:\Dipl\Nutch\lib\jetty-ext\jsp-api.jar;C:\Dipl\Nutch\build\protocol-file\protocol-file.jar
> org.apache.nutch.crawl.Crawl urls.txt -dir crawl
> 
> 
> =========
> My Urls.txt
> 
> file:///C:/temp/test/
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Crawling-Local-file-System-tf3791589.html#a10737391
Sent from the Nutch - User mailing list archive at Nabble.com.