You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by opoole <op...@pascall.co.uk> on 2007/05/24 15:08:11 UTC

WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Hi All, I hope you can help as I am becomming rather depressed with Nutch on
Windows.

Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from cygwin
site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0

I cannot stop Nutch from crawling parent directories, I have looked at other
threads and none seem to work.

I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting for
Java syntax corrections.

Below I have listed my configurations along with the command I type in
cygwin for jcifs:

CRAWL-URLFILTER
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(http|ftp|mailto):
+^(file|smb):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/

# skip everything else
-.

NUTCH-SITE

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->

<nutch-conf>

<property>
 <name>http.agent.name</name>
 <value>pascall</value>
 <description></description>
</property>

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be
truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>file.crawl.parent</name>
  <value>false</value>
  <description>The crawler is not restricted to the directories that you
specified in the
    Urls file but it is jumping into the parent directories as well. For
your own crawlings you can
    change this bahavior (set to false) the way that only directories
beneath the directories that you specify get
    crawled.</description>
</property>

<property>
<name>plugin.includes</name> 
<value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
</property> 

</nutch-conf>

CYGWIN

Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\

java -Djava.protocol.handler.pkgs=jcifs

When I press return the cygwin shell displays a list of java commands as
though I am using incorrect syntax.

Dump of Crawl from Cygwin:

2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
(parse-pdf)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS Documents
Framework (lib-parsems)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
crawl/segments/20070524140420
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering: false
2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN: 2147483647
2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
(parse-pdf)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS Documents
Framework (lib-parsems)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
(parse-pdf)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS Documents
Framework (lib-parsems)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
(parse-pdf)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS Documents
Framework (lib-parsems)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/'
2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
crawl/segments/20070524140420
2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
(parse-pdf)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS Documents
Framework (lib-parsems)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
smb://sql1/Sales/DATA/
2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
smb://sql1/Sales/DATA/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException:
unknown protocol: smb
2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
file:///C:/Policies/
2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered Plugins:
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch core
extension points (nutch-extensionpoints)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query Filter
(query-basic)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Indexing
Filter (index-basic)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse Plug-in
(parse-html)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
(parse-pdf)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query Filter
(query-site)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse Plug-in
(parse-text)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
Plug-in (parse-msword)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
Plug-in (protocol-smb)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
Plug-in (parse-msexcel)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
Plug-in (scoring-opic)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko HTML
Parser (lib-nekohtml)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
Plug-in (protocol-file)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query Filter
(query-url)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS Documents
Framework (lib-parsems)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
Extension-Points:
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
Parser (org.apache.nutch.parse.Parser)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
crawl/crawldb
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawl/segments/20070524140420]
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL filtering:
true
2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file: top-level
element not <configuration>
2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch-0.9\plugins
2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]


Thank you for reading my post, hope you can help.

Regards,

Oli
-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10783382
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by bikram <bi...@yahoo.com>.
Hi Vadim B  

I am getting same error 

org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

were u able to rectify this error...

if yes, can u please tell me what you did which cleared this error..

already posted here all the details...

http://www.nabble.com/Windows-Share-Crawling---searching-tf4277499.html#a12175266

I am using Linux not cygwin on windows

thanx
Bikram


Hi,

I am working on the same issue as you, So far I could crawl file:///C:/* but
i am stucked on the smb part. It looks to me that this plugin isn't working
properly so it needs to be fixed for the newer version of nutch.

The error I get differs a bit from yours it is:

2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
smb://mobidick/test/
2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
smb://mobidick/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

I will dive into the plugin-smb and try out to narrow the problem Maybe we
can work together to get a quick solution.



---SNIP---

# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
because the +^(file|smb) line above is already fitting so this will be
skipped 
---SNIP ---

---SNIP ---
2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/' 
//did you cuoted the url or is it displayed in the logs like this? I dont
get this error 
---SNIP ---

try this  in package org.apache.nutch.crawl.Crawl

  public static void main(String args[]) throws Exception {
	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
	  LOG.info("SMB Info: " +
System.getProperty("java.protocol.handler.pkgs")); //new 
	  LOG.info("SMB Info: " +  new
java.util.PropertyPermission("java.protocol.handler.pkgs","read,
write").toString());//new 
	  if (args.length < 1) {
      System.out.println
        ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
N]");
      return;
    }
---SNIP---

check out this:
http://java.sun.com/developer/onlineTraining/protocolhandlers/





-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a12269503
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by opoole <op...@pascall.co.uk>.
Hi,

Thanks for your help with this, I was sent an email from someone stating
that this is fixed using a new version of the jcifs implementation:

https://issues.apache.org/jira/browse/NUTCH-427

Give it a go and let me know if it works ;)


Vadim B wrote:
> 
> Hi,
> 
> I am working on the same issue as you, So far I could crawl file:///C:/*
> but i am stucked on the smb part. It looks to me that this plugin isn't
> working properly so it needs to be fixed for the newer version of nutch.
> 
> The error I get differs a bit from yours it is:
> 
> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
> smb://mobidick/test/
> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
> smb://mobidick/test/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
> 
> I will dive into the plugin-smb and try out to narrow the problem Maybe we
> can work together to get a quick solution.
> 
> 
> 
> ---SNIP---
> 
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
> because the +^(file|smb) line above is already fitting so this will be
> skipped 
> ---SNIP ---
> 
> ---SNIP ---
> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/' 
> //did you cuoted the url or is it displayed in the logs like this? I dont
> get this error 
> ---SNIP ---
> 
> try this  in package org.apache.nutch.crawl.Crawl
> 
>   public static void main(String args[]) throws Exception {
> 	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
> 	  LOG.info("SMB Info: " +
> System.getProperty("java.protocol.handler.pkgs")); //new 
> 	  LOG.info("SMB Info: " +  new
> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
> write").toString());//new 
> 	  if (args.length < 1) {
>       System.out.println
>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
> N]");
>       return;
>     }
> ---SNIP---
> 
> check out this:
> http://java.sun.com/developer/onlineTraining/protocolhandlers/
> 
> 
> 
> 
> 
> opoole wrote:
>> 
>> Hi All, I hope you can help as I am becomming rather depressed with Nutch
>> on Windows.
>> 
>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>> 
>> I cannot stop Nutch from crawling parent directories, I have looked at
>> other threads and none seem to work.
>> 
>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>> for Java syntax corrections.
>> 
>> Below I have listed my configurations along with the command I type in
>> cygwin for jcifs:
>> 
>> CRAWL-URLFILTER
>> # The url filter file used by the crawl command.
>> 
>> # Better for intranet crawling.
>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>> 
>> # Each non-comment, non-blank line contains a regular expression
>> # prefixed by '+' or '-'.  The first matching pattern in the file
>> # determines whether a URL is included or ignored.  If no pattern
>> # matches, the URL is ignored.
>> 
>> # skip file:, ftp:, & mailto: urls
>> -^(http|ftp|mailto):
>> +^(file|smb):
>> 
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>> 
>> # skip URLs containing certain characters as probable queries, etc.
>> 
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> 
>> # accept hosts in MY.DOMAIN.NAME
>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>> because the +^(file|smb) is already fitting !
>> 
>> # skip everything else
>> -.
>> 
>> NUTCH-SITE
>> 
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> 
>> <nutch-conf>
>> 
>> <property>
>>  <name>http.agent.name</name>
>>  <value>pascall</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>   <name>file.content.limit</name>
>>   <value>-1</value>
>>   <description>The length limit for downloaded content, in bytes.
>>   If this value is nonnegative (>=0), content longer than it will be
>> truncated;
>>   otherwise, no truncation at all.
>>   </description>
>> </property>
>> 
>> <property>
>>   <name>file.crawl.parent</name>
>>   <value>false</value>
>>   <description>The crawler is not restricted to the directories that you
>> specified in the
>>     Urls file but it is jumping into the parent directories as well. For
>> your own crawlings you can
>>     change this bahavior (set to false) the way that only directories
>> beneath the directories that you specify get
>>     crawled.</description>
>> </property>
>> 
>> <property>
>> <name>plugin.includes</name> 
>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>> </property> 
>> 
>> </nutch-conf>
>> 
>> CYGWIN
>> 
>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>> 
>> java -Djava.protocol.handler.pkgs=jcifs
>> 
>> When I press return the cygwin shell displays a list of java commands as
>> though I am using incorrect syntax.
>> 
>> Dump of Crawl from Cygwin:
>> 
>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>> crawl/crawldb
>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>> false
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>> 2147483647
>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
>> 'local', generating exactly one partition.
>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>> selected urls by host, for politeness.
>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>> smb://sql1/Sales/DATA/
>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>> smb://sql1/Sales/DATA/ failed with:
>> org.apache.nutch.protocol.ProtocolNotFound:
>> java.net.MalformedURLException: unknown protocol: smb
>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>> file:///C:/Policies/
>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>> impl: org.apache.nutch.crawl.MD5Signature
>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawl/crawldb
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>> [crawl/segments/20070524140420]
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>> allowed: true
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 
>> 
>> Thank you for reading my post, hope you can help.
>> 
>> Regards,
>> 
>> Oli
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10851108
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by Vadim B <Ma...@unterderbruecke.de>.
ok try this,

as you see the two filters have the same entry. I dont exactly why it has to
be 2 where one would be enough but this keeps me from crawl the parent dir
aswell.

check the nutch site.xml  if I put there .* it isnt working in my case so I
have to write the plugins I really need.

check also out my new SMB Protocol.


-Djava stuff 

copy jcifs to 
C:\Program Files\Java_jdk1.6.0_01\jre\lib\ext (in my case)

Add wollowing to the main method of crawl.java 

  /* Perform complete crawling and indexing given a set of root urls. */
  public static void main(String args[]) throws Exception {
-->	  System.setProperty("java.protocol.handler.pkgs", "jcifs");
-->	  LOG.info("SMB Info: " +
System.getProperty("java.protocol.handler.pkgs"));
-->	  LOG.info("SMB Info: " +  new
java.util.PropertyPermission("java.protocol.handler.pkgs","read,
write").toString());
	  if (args.length < 1) {
...and so on....

then you dont need to set the -Djava..  properties before starting the app.


good luck 




http://www.nabble.com/file/p11047384/protocol-smb.zip protocol-smb.zip 

http://www.nabble.com/file/p11047384/regex-urlfilter.txt regex-urlfilter.txt 
http://www.nabble.com/file/p11047384/crawl-urlfilter.txt crawl-urlfilter.txt 
http://www.nabble.com/file/p11047384/nutch-site.xml nutch-site.xml 

opoole wrote:
> 
> Hi Vadim,
> 
> To be honest I am somewhat behind you as my problem is that I cannot get
> the SMB protocol setup, I am unable to get the -djava bit to do anything,
> I am using cygwin and entering the command from within sun\java etc.
> 
> As for crawl speed, I'd love to get that far.
> 
> Also I noticed that you were crawling from the root of C:\ whereas I want
> to crawl a specific folder and the parent directory issue crops up, I
> cannot get it to stop crawling the parent.  One thing I had noticed is
> that I did not have a URLFILTER entry in my nucth-config.xml and that
> makes a difference in that if I try to set it up as in the tutorial it
> won't crawl a thing??!!
> 
> Sorry I cannot be of help but I feel somewhat behind you in terms of Nutch
> dev, I am thinking of trying Nutch using ver 8 instead of 9 as there is
> more documented on it although I have read that it is slow, half the speed
> of ver 9 in terms of crawl speed, are you using ver 8?
> 
> Regards,
> 
> Oli
> 
> 
> Vadim B wrote:
>> 
>> Could you solve the problem? 
>> 
>> I get about 800kb/s as transfer speed wich is not so fast to use it in
>> productiv enviroment, what about you?
>> 
>> 
>> 
>> opoole wrote:
>>> 
>>> Sorry Vadim,
>>> 
>>> I did not realise you had sent me the email [Doh!].
>>> 
>>> 
>>> Vadim B wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I am working on the same issue as you, So far I could crawl
>>>> file:///C:/* but i am stucked on the smb part. It looks to me that this
>>>> plugin isn't working properly so it needs to be fixed for the newer
>>>> version of nutch.
>>>> 
>>>> The error I get differs a bit from yours it is:
>>>> 
>>>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
>>>> smb://mobidick/test/
>>>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
>>>> smb://mobidick/test/ failed with:
>>>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>>>> url=smb
>>>> 
>>>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>>>> we can work together to get a quick solution.
>>>> 
>>>> 
>>>> 
>>>> ---SNIP---
>>>> 
>>>> # accept hosts in MY.DOMAIN.NAME
>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>>>> because the +^(file|smb) line above is already fitting so this will be
>>>> skipped 
>>>> ---SNIP ---
>>>> 
>>>> ---SNIP ---
>>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/' 
>>>> //did you cuoted the url or is it displayed in the logs like this? I
>>>> dont get this error 
>>>> ---SNIP ---
>>>> 
>>>> try this  in package org.apache.nutch.crawl.Crawl
>>>> 
>>>>   public static void main(String args[]) throws Exception {
>>>> 	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
>>>> 	  LOG.info("SMB Info: " +
>>>> System.getProperty("java.protocol.handler.pkgs")); //new 
>>>> 	  LOG.info("SMB Info: " +  new
>>>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>>>> write").toString());//new 
>>>> 	  if (args.length < 1) {
>>>>       System.out.println
>>>>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>>>> N]");
>>>>       return;
>>>>     }
>>>> ---SNIP---
>>>> 
>>>> check out this:
>>>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> opoole wrote:
>>>>> 
>>>>> Hi All, I hope you can help as I am becomming rather depressed with
>>>>> Nutch on Windows.
>>>>> 
>>>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>>>> 
>>>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>>>> other threads and none seem to work.
>>>>> 
>>>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps
>>>>> prompting for Java syntax corrections.
>>>>> 
>>>>> Below I have listed my configurations along with the command I type in
>>>>> cygwin for jcifs:
>>>>> 
>>>>> CRAWL-URLFILTER
>>>>> # The url filter file used by the crawl command.
>>>>> 
>>>>> # Better for intranet crawling.
>>>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>>> 
>>>>> # Each non-comment, non-blank line contains a regular expression
>>>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>>>> # determines whether a URL is included or ignored.  If no pattern
>>>>> # matches, the URL is ignored.
>>>>> 
>>>>> # skip file:, ftp:, & mailto: urls
>>>>> -^(http|ftp|mailto):
>>>>> +^(file|smb):
>>>>> 
>>>>> # skip image and other suffixes we can't yet parse
>>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>>>> 
>>>>> # skip URLs containing certain characters as probable queries, etc.
>>>>> 
>>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>>> break
>>>>> loops
>>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>> 
>>>>> # accept hosts in MY.DOMAIN.NAME
>>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>>>>> because the +^(file|smb) is already fitting !
>>>>> 
>>>>> # skip everything else
>>>>> -.
>>>>> 
>>>>> NUTCH-SITE
>>>>> 
>>>>> <?xml version="1.0"?>
>>>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>>>> <!-- Put site-specific property overrides in this file. -->
>>>>> 
>>>>> <nutch-conf>
>>>>> 
>>>>> <property>
>>>>>  <name>http.agent.name</name>
>>>>>  <value>pascall</value>
>>>>>  <description></description>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>>   <name>file.content.limit</name>
>>>>>   <value>-1</value>
>>>>>   <description>The length limit for downloaded content, in bytes.
>>>>>   If this value is nonnegative (>=0), content longer than it will be
>>>>> truncated;
>>>>>   otherwise, no truncation at all.
>>>>>   </description>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>>   <name>file.crawl.parent</name>
>>>>>   <value>false</value>
>>>>>   <description>The crawler is not restricted to the directories that
>>>>> you specified in the
>>>>>     Urls file but it is jumping into the parent directories as well.
>>>>> For your own crawlings you can
>>>>>     change this bahavior (set to false) the way that only directories
>>>>> beneath the directories that you specify get
>>>>>     crawled.</description>
>>>>> </property>
>>>>> 
>>>>> <property>
>>>>> <name>plugin.includes</name> 
>>>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>>>> </property> 
>>>>> 
>>>>> </nutch-conf>
>>>>> 
>>>>> CYGWIN
>>>>> 
>>>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>>>> 
>>>>> java -Djava.protocol.handler.pkgs=jcifs
>>>>> 
>>>>> When I press return the cygwin shell displays a list of java commands
>>>>> as though I am using incorrect syntax.
>>>>> 
>>>>> Dump of Crawl from Cygwin:
>>>>> 
>>>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>>>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>>>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>>>>> crawl/crawldb
>>>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir:
>>>>> urls.txt
>>>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>>>>> injected urls to crawl db entries.
>>>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging
>>>>> injected urls into crawl db.
>>>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>> where applicable
>>>>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>>>>> best-scoring urls due for fetch.
>>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>>>>> crawl/segments/20070524140420
>>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>>>>> false
>>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>>>>> 2147483647
>>>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker
>>>>> is 'local', generating exactly one partition.
>>>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed
>>>>> URL: 'smb://sql1/Sales/DATA/'
>>>>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator:
>>>>> Partitioning selected urls by host, for politeness.
>>>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed
>>>>> URL: 'smb://sql1/Sales/DATA/'
>>>>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>>>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>>>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>>>>> crawl/segments/20070524140420
>>>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>>>>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>>>>> smb://sql1/Sales/DATA/
>>>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>>>>> smb://sql1/Sales/DATA/ failed with:
>>>>> org.apache.nutch.protocol.ProtocolNotFound:
>>>>> java.net.MalformedURLException: unknown protocol: smb
>>>>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>>>>> file:///C:/Policies/
>>>>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>>>>> impl: org.apache.nutch.crawl.MD5Signature
>>>>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>>>> Plugins:
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch
>>>>> core extension points (nutch-extensionpoints)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>>> Parse Plug-in (parse-mspowerpoint)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query
>>>>> Filter (query-basic)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic
>>>>> Indexing Filter (index-basic)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse
>>>>> Plug-in (parse-html)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse
>>>>> Plug-in (parse-pdf)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query
>>>>> Filter (query-site)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse
>>>>> Plug-in (parse-text)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
>>>>> Plug-in (parse-msword)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
>>>>> Plug-in (protocol-smb)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>>> Plug-in (parse-msexcel)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>>> Plug-in (scoring-opic)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko
>>>>> HTML Parser (lib-nekohtml)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j
>>>>> (lib-log4j)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
>>>>> Plug-in (protocol-file)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query
>>>>> Filter (query-url)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS
>>>>> Documents Framework (lib-parsems)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>>>> Extension-Points:
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch
>>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch
>>>>> Protocol (org.apache.nutch.protocol.Protocol)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch
>>>>> Analysis (org.apache.nutch.analysis.NutchAnalyzer)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>>>>> Filter (org.apache.nutch.net.URLFilter)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch
>>>>> Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online
>>>>> Search Results Clustering Plugin
>>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse
>>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
>>>>> Parser (org.apache.nutch.parse.Parser)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query
>>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology
>>>>> Model Loader (org.apache.nutch.ontology.Ontology)
>>>>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>>>>> crawl/crawldb
>>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update:
>>>>> segments: [crawl/segments/20070524140420]
>>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update:
>>>>> additions allowed: true
>>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>>>> normalizing: true
>>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>>>> filtering: true
>>>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>>>>> segment data into db.
>>>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>>>> top-level element not <configuration>
>>>>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins:
>>>>> looking in: C:\nutch-0.9\plugins
>>>>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>>>>> Auto-activation mode: [true]
>>>>> 
>>>>> 
>>>>> Thank you for reading my post, hope you can help.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Oli
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a11047384
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by opoole <op...@pascall.co.uk>.
Hi Vadim,

To be honest I am somewhat behind you as my problem is that I cannot get the
SMB protocol setup, I am unable to get the -djava bit to do anything, I am
using cygwin and entering the command from within sun\java etc.

As for crawl speed, I'd love to get that far.

Also I noticed that you were crawling from the root of C:\ whereas I want to
crawl a specific folder and the parent directory issue crops up, I cannot
get it to stop crawling the parent.  One thing I had noticed is that I did
not have a URLFILTER entry in my nucth-config.xml and that makes a
difference in that if I try to set it up as in the tutorial it won't crawl a
thing??!!

Sorry I cannot be of help but I feel somewhat behind you in terms of Nutch
dev, I am thinking of trying Nutch using ver 8 instead of 9 as there is more
documented on it although I have read that it is slow, half the speed of ver
9 in terms of crawl speed, are you using ver 8?

Regards,

Oli


Vadim B wrote:
> 
> Could you solve the problem? 
> 
> I get about 800kb/s as transfer speed wich is not so fast to use it in
> productiv enviroment, what about you?
> 
> 
> 
> opoole wrote:
>> 
>> Sorry Vadim,
>> 
>> I did not realise you had sent me the email [Doh!].
>> 
>> 
>> Vadim B wrote:
>>> 
>>> Hi,
>>> 
>>> I am working on the same issue as you, So far I could crawl file:///C:/*
>>> but i am stucked on the smb part. It looks to me that this plugin isn't
>>> working properly so it needs to be fixed for the newer version of nutch.
>>> 
>>> The error I get differs a bit from yours it is:
>>> 
>>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
>>> smb://mobidick/test/
>>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
>>> smb://mobidick/test/ failed with:
>>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>>> url=smb
>>> 
>>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>>> we can work together to get a quick solution.
>>> 
>>> 
>>> 
>>> ---SNIP---
>>> 
>>> # accept hosts in MY.DOMAIN.NAME
>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>>> because the +^(file|smb) line above is already fitting so this will be
>>> skipped 
>>> ---SNIP ---
>>> 
>>> ---SNIP ---
>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/' 
>>> //did you cuoted the url or is it displayed in the logs like this? I
>>> dont get this error 
>>> ---SNIP ---
>>> 
>>> try this  in package org.apache.nutch.crawl.Crawl
>>> 
>>>   public static void main(String args[]) throws Exception {
>>> 	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
>>> 	  LOG.info("SMB Info: " +
>>> System.getProperty("java.protocol.handler.pkgs")); //new 
>>> 	  LOG.info("SMB Info: " +  new
>>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>>> write").toString());//new 
>>> 	  if (args.length < 1) {
>>>       System.out.println
>>>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>>> N]");
>>>       return;
>>>     }
>>> ---SNIP---
>>> 
>>> check out this:
>>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>>> 
>>> 
>>> 
>>> 
>>> 
>>> opoole wrote:
>>>> 
>>>> Hi All, I hope you can help as I am becomming rather depressed with
>>>> Nutch on Windows.
>>>> 
>>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>>> 
>>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>>> other threads and none seem to work.
>>>> 
>>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>>>> for Java syntax corrections.
>>>> 
>>>> Below I have listed my configurations along with the command I type in
>>>> cygwin for jcifs:
>>>> 
>>>> CRAWL-URLFILTER
>>>> # The url filter file used by the crawl command.
>>>> 
>>>> # Better for intranet crawling.
>>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>>> 
>>>> # Each non-comment, non-blank line contains a regular expression
>>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>>> # determines whether a URL is included or ignored.  If no pattern
>>>> # matches, the URL is ignored.
>>>> 
>>>> # skip file:, ftp:, & mailto: urls
>>>> -^(http|ftp|mailto):
>>>> +^(file|smb):
>>>> 
>>>> # skip image and other suffixes we can't yet parse
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>>> 
>>>> # skip URLs containing certain characters as probable queries, etc.
>>>> 
>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to
>>>> break
>>>> loops
>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>> 
>>>> # accept hosts in MY.DOMAIN.NAME
>>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>>>> because the +^(file|smb) is already fitting !
>>>> 
>>>> # skip everything else
>>>> -.
>>>> 
>>>> NUTCH-SITE
>>>> 
>>>> <?xml version="1.0"?>
>>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>>> <!-- Put site-specific property overrides in this file. -->
>>>> 
>>>> <nutch-conf>
>>>> 
>>>> <property>
>>>>  <name>http.agent.name</name>
>>>>  <value>pascall</value>
>>>>  <description></description>
>>>> </property>
>>>> 
>>>> <property>
>>>>   <name>file.content.limit</name>
>>>>   <value>-1</value>
>>>>   <description>The length limit for downloaded content, in bytes.
>>>>   If this value is nonnegative (>=0), content longer than it will be
>>>> truncated;
>>>>   otherwise, no truncation at all.
>>>>   </description>
>>>> </property>
>>>> 
>>>> <property>
>>>>   <name>file.crawl.parent</name>
>>>>   <value>false</value>
>>>>   <description>The crawler is not restricted to the directories that
>>>> you specified in the
>>>>     Urls file but it is jumping into the parent directories as well.
>>>> For your own crawlings you can
>>>>     change this bahavior (set to false) the way that only directories
>>>> beneath the directories that you specify get
>>>>     crawled.</description>
>>>> </property>
>>>> 
>>>> <property>
>>>> <name>plugin.includes</name> 
>>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>>> </property> 
>>>> 
>>>> </nutch-conf>
>>>> 
>>>> CYGWIN
>>>> 
>>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>>> 
>>>> java -Djava.protocol.handler.pkgs=jcifs
>>>> 
>>>> When I press return the cygwin shell displays a list of java commands
>>>> as though I am using incorrect syntax.
>>>> 
>>>> Dump of Crawl from Cygwin:
>>>> 
>>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>>>> crawl/crawldb
>>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir:
>>>> urls.txt
>>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>>>> injected urls to crawl db entries.
>>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging
>>>> injected urls into crawl db.
>>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>>> where applicable
>>>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>>>> best-scoring urls due for fetch.
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>>>> crawl/segments/20070524140420
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>>>> false
>>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>>>> 2147483647
>>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker
>>>> is 'local', generating exactly one partition.
>>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>>>> selected urls by host, for politeness.
>>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>>> 'smb://sql1/Sales/DATA/'
>>>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>>>> crawl/segments/20070524140420
>>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>>>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>>>> smb://sql1/Sales/DATA/
>>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>>>> smb://sql1/Sales/DATA/ failed with:
>>>> org.apache.nutch.protocol.ProtocolNotFound:
>>>> java.net.MalformedURLException: unknown protocol: smb
>>>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>>>> file:///C:/Policies/
>>>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>>>> impl: org.apache.nutch.crawl.MD5Signature
>>>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>>> Plugins:
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch core
>>>> extension points (nutch-extensionpoints)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint
>>>> Parse Plug-in (parse-mspowerpoint)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query
>>>> Filter (query-basic)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Indexing
>>>> Filter (index-basic)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse
>>>> Plug-in (parse-html)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse
>>>> Plug-in (parse-pdf)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query
>>>> Filter (query-site)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI -
>>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse
>>>> Plug-in (parse-text)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
>>>> Plug-in (parse-msword)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
>>>> Plug-in (protocol-smb)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
>>>> Plug-in (parse-msexcel)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
>>>> Plug-in (scoring-opic)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>>> Parser (lib-nekohtml)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j
>>>> (lib-log4j)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
>>>> Plug-in (protocol-file)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query
>>>> Filter (query-url)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS
>>>> Documents Framework (lib-parsems)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>>> Extension-Points:
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch
>>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Protocol
>>>> (org.apache.nutch.protocol.Protocol)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Analysis
>>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>>>> Filter (org.apache.nutch.net.URLFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Indexing
>>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online
>>>> Search Results Clustering Plugin
>>>> (org.apache.nutch.clustering.OnlineClusterer)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse
>>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
>>>> Parser (org.apache.nutch.parse.Parser)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
>>>> (org.apache.nutch.scoring.ScoringFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query
>>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology Model
>>>> Loader (org.apache.nutch.ontology.Ontology)
>>>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>>>> crawl/crawldb
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>>>> [crawl/segments/20070524140420]
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>>>> allowed: true
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>>> normalizing: true
>>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>>> filtering: true
>>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>>>> segment data into db.
>>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>>> top-level element not <configuration>
>>>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins:
>>>> looking in: C:\nutch-0.9\plugins
>>>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>>>> Auto-activation mode: [true]
>>>> 
>>>> 
>>>> Thank you for reading my post, hope you can help.
>>>> 
>>>> Regards,
>>>> 
>>>> Oli
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10970245
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by Vadim B <Ma...@unterderbruecke.de>.
Could you solve the problem? 

I get about 800kb/s as transfer speed wich is not so fast to use it in
productiv enviroment, what about you?



opoole wrote:
> 
> Sorry Vadim,
> 
> I did not realise you had sent me the email [Doh!].
> 
> 
> Vadim B wrote:
>> 
>> Hi,
>> 
>> I am working on the same issue as you, So far I could crawl file:///C:/*
>> but i am stucked on the smb part. It looks to me that this plugin isn't
>> working properly so it needs to be fixed for the newer version of nutch.
>> 
>> The error I get differs a bit from yours it is:
>> 
>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
>> smb://mobidick/test/
>> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
>> smb://mobidick/test/ failed with:
>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for
>> url=smb
>> 
>> I will dive into the plugin-smb and try out to narrow the problem Maybe
>> we can work together to get a quick solution.
>> 
>> 
>> 
>> ---SNIP---
>> 
>> # accept hosts in MY.DOMAIN.NAME
>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
>> because the +^(file|smb) line above is already fitting so this will be
>> skipped 
>> ---SNIP ---
>> 
>> ---SNIP ---
>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/' 
>> //did you cuoted the url or is it displayed in the logs like this? I dont
>> get this error 
>> ---SNIP ---
>> 
>> try this  in package org.apache.nutch.crawl.Crawl
>> 
>>   public static void main(String args[]) throws Exception {
>> 	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
>> 	  LOG.info("SMB Info: " +
>> System.getProperty("java.protocol.handler.pkgs")); //new 
>> 	  LOG.info("SMB Info: " +  new
>> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
>> write").toString());//new 
>> 	  if (args.length < 1) {
>>       System.out.println
>>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
>> N]");
>>       return;
>>     }
>> ---SNIP---
>> 
>> check out this:
>> http://java.sun.com/developer/onlineTraining/protocolhandlers/
>> 
>> 
>> 
>> 
>> 
>> opoole wrote:
>>> 
>>> Hi All, I hope you can help as I am becomming rather depressed with
>>> Nutch on Windows.
>>> 
>>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>>> 
>>> I cannot stop Nutch from crawling parent directories, I have looked at
>>> other threads and none seem to work.
>>> 
>>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>>> for Java syntax corrections.
>>> 
>>> Below I have listed my configurations along with the command I type in
>>> cygwin for jcifs:
>>> 
>>> CRAWL-URLFILTER
>>> # The url filter file used by the crawl command.
>>> 
>>> # Better for intranet crawling.
>>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>>> 
>>> # Each non-comment, non-blank line contains a regular expression
>>> # prefixed by '+' or '-'.  The first matching pattern in the file
>>> # determines whether a URL is included or ignored.  If no pattern
>>> # matches, the URL is ignored.
>>> 
>>> # skip file:, ftp:, & mailto: urls
>>> -^(http|ftp|mailto):
>>> +^(file|smb):
>>> 
>>> # skip image and other suffixes we can't yet parse
>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>>> 
>>> # skip URLs containing certain characters as probable queries, etc.
>>> 
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>> 
>>> # accept hosts in MY.DOMAIN.NAME
>>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>>> because the +^(file|smb) is already fitting !
>>> 
>>> # skip everything else
>>> -.
>>> 
>>> NUTCH-SITE
>>> 
>>> <?xml version="1.0"?>
>>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>>> <!-- Put site-specific property overrides in this file. -->
>>> 
>>> <nutch-conf>
>>> 
>>> <property>
>>>  <name>http.agent.name</name>
>>>  <value>pascall</value>
>>>  <description></description>
>>> </property>
>>> 
>>> <property>
>>>   <name>file.content.limit</name>
>>>   <value>-1</value>
>>>   <description>The length limit for downloaded content, in bytes.
>>>   If this value is nonnegative (>=0), content longer than it will be
>>> truncated;
>>>   otherwise, no truncation at all.
>>>   </description>
>>> </property>
>>> 
>>> <property>
>>>   <name>file.crawl.parent</name>
>>>   <value>false</value>
>>>   <description>The crawler is not restricted to the directories that you
>>> specified in the
>>>     Urls file but it is jumping into the parent directories as well. For
>>> your own crawlings you can
>>>     change this bahavior (set to false) the way that only directories
>>> beneath the directories that you specify get
>>>     crawled.</description>
>>> </property>
>>> 
>>> <property>
>>> <name>plugin.includes</name> 
>>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>>> </property> 
>>> 
>>> </nutch-conf>
>>> 
>>> CYGWIN
>>> 
>>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>>> 
>>> java -Djava.protocol.handler.pkgs=jcifs
>>> 
>>> When I press return the cygwin shell displays a list of java commands as
>>> though I am using incorrect syntax.
>>> 
>>> Dump of Crawl from Cygwin:
>>> 
>>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>>> crawl/crawldb
>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir:
>>> urls.txt
>>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>>> injected urls to crawl db entries.
>>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging
>>> injected urls into crawl db.
>>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where applicable
>>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>>> best-scoring urls due for fetch.
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>>> crawl/segments/20070524140420
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>>> false
>>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>>> 2147483647
>>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
>>> 'local', generating exactly one partition.
>>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/'
>>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>>> selected urls by host, for politeness.
>>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>>> 'smb://sql1/Sales/DATA/'
>>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>>> crawl/segments/20070524140420
>>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>>> smb://sql1/Sales/DATA/
>>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>>> smb://sql1/Sales/DATA/ failed with:
>>> org.apache.nutch.protocol.ProtocolNotFound:
>>> java.net.MalformedURLException: unknown protocol: smb
>>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>>> file:///C:/Policies/
>>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>>> impl: org.apache.nutch.crawl.MD5Signature
>>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>> Plugins:
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint
>>> Parse Plug-in (parse-mspowerpoint)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query
>>> Filter (query-basic)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Indexing
>>> Filter (index-basic)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse
>>> Plug-in (parse-html)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse
>>> Plug-in (parse-pdf)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query
>>> Filter (query-site)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI -
>>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse
>>> Plug-in (parse-text)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
>>> Plug-in (parse-msword)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
>>> Plug-in (protocol-smb)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
>>> Plug-in (parse-msexcel)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
>>> Plug-in (scoring-opic)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko HTML
>>> Parser (lib-nekohtml)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j
>>> (lib-log4j)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
>>> Plug-in (protocol-file)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query
>>> Filter (query-url)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS
>>> Documents Framework (lib-parsems)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>>> Extension-Points:
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch
>>> Summarizer (org.apache.nutch.searcher.Summarizer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>>> Normalizer (org.apache.nutch.net.URLNormalizer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>>> Filter (org.apache.nutch.net.URLFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Indexing
>>> Filter (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online
>>> Search Results Clustering Plugin
>>> (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse
>>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
>>> Parser (org.apache.nutch.parse.Parser)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query
>>> Filter (org.apache.nutch.searcher.QueryFilter)
>>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology Model
>>> Loader (org.apache.nutch.ontology.Ontology)
>>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>>> crawl/crawldb
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>>> [crawl/segments/20070524140420]
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>>> allowed: true
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>> normalizing: true
>>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>>> filtering: true
>>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>>> segment data into db.
>>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>>> top-level element not <configuration>
>>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
>>> in: C:\nutch-0.9\plugins
>>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>>> Auto-activation mode: [true]
>>> 
>>> 
>>> Thank you for reading my post, hope you can help.
>>> 
>>> Regards,
>>> 
>>> Oli
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10968398
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by opoole <op...@pascall.co.uk>.
Sorry Vadim,

I did not realise you had sent me the email [Doh!].


Vadim B wrote:
> 
> Hi,
> 
> I am working on the same issue as you, So far I could crawl file:///C:/*
> but i am stucked on the smb part. It looks to me that this plugin isn't
> working properly so it needs to be fixed for the newer version of nutch.
> 
> The error I get differs a bit from yours it is:
> 
> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
> smb://mobidick/test/
> 2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
> smb://mobidick/test/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
> 
> I will dive into the plugin-smb and try out to narrow the problem Maybe we
> can work together to get a quick solution.
> 
> 
> 
> ---SNIP---
> 
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
> because the +^(file|smb) line above is already fitting so this will be
> skipped 
> ---SNIP ---
> 
> ---SNIP ---
> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/' 
> //did you cuoted the url or is it displayed in the logs like this? I dont
> get this error 
> ---SNIP ---
> 
> try this  in package org.apache.nutch.crawl.Crawl
> 
>   public static void main(String args[]) throws Exception {
> 	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
> 	  LOG.info("SMB Info: " +
> System.getProperty("java.protocol.handler.pkgs")); //new 
> 	  LOG.info("SMB Info: " +  new
> java.util.PropertyPermission("java.protocol.handler.pkgs","read,
> write").toString());//new 
> 	  if (args.length < 1) {
>       System.out.println
>         ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
> N]");
>       return;
>     }
> ---SNIP---
> 
> check out this:
> http://java.sun.com/developer/onlineTraining/protocolhandlers/
> 
> 
> 
> 
> 
> opoole wrote:
>> 
>> Hi All, I hope you can help as I am becomming rather depressed with Nutch
>> on Windows.
>> 
>> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
>> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
>> 
>> I cannot stop Nutch from crawling parent directories, I have looked at
>> other threads and none seem to work.
>> 
>> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
>> for Java syntax corrections.
>> 
>> Below I have listed my configurations along with the command I type in
>> cygwin for jcifs:
>> 
>> CRAWL-URLFILTER
>> # The url filter file used by the crawl command.
>> 
>> # Better for intranet crawling.
>> # Be sure to change MY.DOMAIN.NAME to your domain name.
>> 
>> # Each non-comment, non-blank line contains a regular expression
>> # prefixed by '+' or '-'.  The first matching pattern in the file
>> # determines whether a URL is included or ignored.  If no pattern
>> # matches, the URL is ignored.
>> 
>> # skip file:, ftp:, & mailto: urls
>> -^(http|ftp|mailto):
>> +^(file|smb):
>> 
>> # skip image and other suffixes we can't yet parse
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>> 
>> # skip URLs containing certain characters as probable queries, etc.
>> 
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> 
>> # accept hosts in MY.DOMAIN.NAME
>> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
>> because the +^(file|smb) is already fitting !
>> 
>> # skip everything else
>> -.
>> 
>> NUTCH-SITE
>> 
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
>> <!-- Put site-specific property overrides in this file. -->
>> 
>> <nutch-conf>
>> 
>> <property>
>>  <name>http.agent.name</name>
>>  <value>pascall</value>
>>  <description></description>
>> </property>
>> 
>> <property>
>>   <name>file.content.limit</name>
>>   <value>-1</value>
>>   <description>The length limit for downloaded content, in bytes.
>>   If this value is nonnegative (>=0), content longer than it will be
>> truncated;
>>   otherwise, no truncation at all.
>>   </description>
>> </property>
>> 
>> <property>
>>   <name>file.crawl.parent</name>
>>   <value>false</value>
>>   <description>The crawler is not restricted to the directories that you
>> specified in the
>>     Urls file but it is jumping into the parent directories as well. For
>> your own crawlings you can
>>     change this bahavior (set to false) the way that only directories
>> beneath the directories that you specify get
>>     crawled.</description>
>> </property>
>> 
>> <property>
>> <name>plugin.includes</name> 
>> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
>> </property> 
>> 
>> </nutch-conf>
>> 
>> CYGWIN
>> 
>> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
>> 
>> java -Djava.protocol.handler.pkgs=jcifs
>> 
>> When I press return the cygwin shell displays a list of java commands as
>> though I am using incorrect syntax.
>> 
>> Dump of Crawl from Cygwin:
>> 
>> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
>> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
>> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
>> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
>> crawl/crawldb
>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
>> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
>> injected urls to crawl db entries.
>> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
>> best-scoring urls due for fetch.
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
>> false
>> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
>> 2147483647
>> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
>> 'local', generating exactly one partition.
>> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
>> selected urls by host, for politeness.
>> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
>> 'smb://sql1/Sales/DATA/'
>> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
>> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
>> crawl/segments/20070524140420
>> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
>> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
>> smb://sql1/Sales/DATA/
>> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
>> smb://sql1/Sales/DATA/ failed with:
>> org.apache.nutch.protocol.ProtocolNotFound:
>> java.net.MalformedURLException: unknown protocol: smb
>> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
>> file:///C:/Policies/
>> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
>> impl: org.apache.nutch.crawl.MD5Signature
>> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>> Plugins:
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint
>> Parse Plug-in (parse-mspowerpoint)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query
>> Filter (query-basic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Indexing
>> Filter (index-basic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse
>> Plug-in (parse-html)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse
>> Plug-in (parse-pdf)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query
>> Filter (query-site)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI -
>> Java API To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse
>> Plug-in (parse-text)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
>> Plug-in (parse-msword)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
>> Plug-in (protocol-smb)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
>> Plug-in (parse-msexcel)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
>> Plug-in (scoring-opic)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko HTML
>> Parser (lib-nekohtml)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j
>> (lib-log4j)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
>> Plug-in (protocol-file)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query Filter
>> (query-url)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS
>> Documents Framework (lib-parsems)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
>> Extension-Points:
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
>> Normalizer (org.apache.nutch.net.URLNormalizer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Indexing
>> Filter (org.apache.nutch.indexer.IndexingFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online
>> Search Results Clustering Plugin
>> (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse
>> Filter (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
>> Parser (org.apache.nutch.parse.Parser)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query
>> Filter (org.apache.nutch.searcher.QueryFilter)
>> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology Model
>> Loader (org.apache.nutch.ontology.Ontology)
>> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
>> crawl/crawldb
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
>> [crawl/segments/20070524140420]
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
>> allowed: true
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>> normalizing: true
>> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
>> filtering: true
>> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
>> segment data into db.
>> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
>> top-level element not <configuration>
>> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
>> in: C:\nutch-0.9\plugins
>> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
>> Auto-activation mode: [true]
>> 
>> 
>> Thank you for reading my post, hope you can help.
>> 
>> Regards,
>> 
>> Oli
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10852315
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: WIN XP PRO -Djava.protocol* file:///c:/folder/ Crawling Parents

Posted by Ever <ev...@gmx.de>.
Hi,

I am working on the same issue as you, So far I could crawl file:///C:/* but
i am stucked on the smb part. It looks to me that this plugin isn't working
properly so it needs to be fixed for the newer version of nutch.

The error I get differs a bit from yours it is:

2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetching
smb://mobidick/test/
2007-05-25 18:06:29,573 INFO  fetcher.Fetcher - fetch of
smb://mobidick/test/ failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

I will dive into the plugin-smb and try out to narrow the problem Maybe we
can work together to get a quick solution.



---SNIP---

# accept hosts in MY.DOMAIN.NAME
# Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^file:///C:/Policies/ <<-- why you put it here it doesn't make sense
because the +^(file|smb) line above is already fitting so this will be
skipped 
---SNIP ---

---SNIP ---
2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
'smb://sql1/Sales/DATA/' 
//did you cuoted the url or is it displayed in the logs like this? I dont
get this error 
---SNIP ---

try this  in package org.apache.nutch.crawl.Crawl

  public static void main(String args[]) throws Exception {
	  System.setProperty("java.protocol.handler.pkgs", "jcifs"); // new 
	  LOG.info("SMB Info: " +
System.getProperty("java.protocol.handler.pkgs")); //new 
	  LOG.info("SMB Info: " +  new
java.util.PropertyPermission("java.protocol.handler.pkgs","read,
write").toString());//new 
	  if (args.length < 1) {
      System.out.println
        ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN
N]");
      return;
    }
---SNIP---

check out this:
http://java.sun.com/developer/onlineTraining/protocolhandlers/





opoole wrote:
> 
> Hi All, I hope you can help as I am becomming rather depressed with Nutch
> on Windows.
> 
> Using: Windows XP Pro SP2 - Nutch-0.9 - Cygwin [current version from
> cygwin site] - Java JDK 1.6.0 - Java Platform Standard Edition 1.6.0
> 
> I cannot stop Nutch from crawling parent directories, I have looked at
> other threads and none seem to work.
> 
> I have tried to include protocol-smb [jcifs] but Cygwin keeps prompting
> for Java syntax corrections.
> 
> Below I have listed my configurations along with the command I type in
> cygwin for jcifs:
> 
> CRAWL-URLFILTER
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(http|ftp|mailto):
> +^(file|smb):
> 
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> # accept hosts in MY.DOMAIN.NAME
> # Standart +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^file:///C:/Policies/ <<-- why you put it here it doesnt make sese 
> because the +^(file|smb) is already fitting !
> 
> # skip everything else
> -.
> 
> NUTCH-SITE
> 
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> 
> <nutch-conf>
> 
> <property>
>  <name>http.agent.name</name>
>  <value>pascall</value>
>  <description></description>
> </property>
> 
> <property>
>   <name>file.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content, in bytes.
>   If this value is nonnegative (>=0), content longer than it will be
> truncated;
>   otherwise, no truncation at all.
>   </description>
> </property>
> 
> <property>
>   <name>file.crawl.parent</name>
>   <value>false</value>
>   <description>The crawler is not restricted to the directories that you
> specified in the
>     Urls file but it is jumping into the parent directories as well. For
> your own crawlings you can
>     change this bahavior (set to false) the way that only directories
> beneath the directories that you specify get
>     crawled.</description>
> </property>
> 
> <property>
> <name>plugin.includes</name> 
> <value>protocol-file|protocol-smb|scoring-opic|parse-(msexcel|mspowerpoint|msword|xml|text|html|pdf)|index-basic|query-(basic|site|url)</value>
> </property> 
> 
> </nutch-conf>
> 
> CYGWIN
> 
> Using cygwin I enter the command from C:\Sun\Java\jdk160\bin\
> 
> java -Djava.protocol.handler.pkgs=jcifs
> 
> When I press return the cygwin shell displays a list of java commands as
> though I am using incorrect syntax.
> 
> Dump of Crawl from Cygwin:
> 
> 2007-05-24 14:04:16,140 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - crawl started in: crawl
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - rootUrlDir = urls.txt
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - threads = 10
> 2007-05-24 14:04:16,171 INFO  crawl.Crawl - depth = 5
> 2007-05-24 14:04:16,281 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: starting
> 2007-05-24 14:04:16,281 INFO  crawl.Injector - Injector: crawlDb:
> crawl/crawldb
> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: urlDir: urls.txt
> 2007-05-24 14:04:16,296 INFO  crawl.Injector - Injector: Converting
> injected urls to crawl db entries.
> 2007-05-24 14:04:16,328 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,843 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:16,953 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
> (parse-pdf)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Site Query Filter
> (query-site)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	URL Query Filter
> (query-url)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:17,156 INFO  plugin.PluginRepository - 	Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:17,875 INFO  crawl.Injector - Injector: Merging injected
> urls into crawl db.
> 2007-05-24 14:04:17,906 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:18,156 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:18,375 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 2007-05-24 14:04:19,281 INFO  crawl.Injector - Injector: done
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: starting
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: segment:
> crawl/segments/20070524140420
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: filtering:
> false
> 2007-05-24 14:04:20,281 INFO  crawl.Generator - Generator: topN:
> 2147483647
> 2007-05-24 14:04:20,312 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:20,312 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> 2007-05-24 14:04:20,562 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:20,609 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
> (parse-pdf)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Site Query Filter
> (query-site)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	URL Query Filter
> (query-url)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:20,781 INFO  plugin.PluginRepository - 	Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:20,796 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> 2007-05-24 14:04:20,843 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
> (parse-pdf)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Site Query Filter
> (query-site)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	URL Query Filter
> (query-url)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:21,000 INFO  plugin.PluginRepository - 	Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:21,578 INFO  crawl.Generator - Generator: Partitioning
> selected urls by host, for politeness.
> 2007-05-24 14:04:21,593 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:21,828 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:21,859 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
> (parse-pdf)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Site Query Filter
> (query-site)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	URL Query Filter
> (query-url)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:22,000 INFO  plugin.PluginRepository - 	Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:22,000 WARN  crawl.PartitionUrlByHost - Malformed URL:
> 'smb://sql1/Sales/DATA/'
> 2007-05-24 14:04:22,843 INFO  crawl.Generator - Generator: done.
> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: starting
> 2007-05-24 14:04:22,843 INFO  fetcher.Fetcher - Fetcher: segment:
> crawl/segments/20070524140420
> 2007-05-24 14:04:22,859 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:23,156 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:23,187 INFO  fetcher.Fetcher - Fetcher: threads: 10
> 2007-05-24 14:04:23,203 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
> (parse-pdf)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Site Query Filter
> (query-site)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	URL Query Filter
> (query-url)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:23,343 INFO  plugin.PluginRepository - 	Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetching
> smb://sql1/Sales/DATA/
> 2007-05-24 14:04:23,390 INFO  fetcher.Fetcher - fetch of
> smb://sql1/Sales/DATA/ failed with:
> org.apache.nutch.protocol.ProtocolNotFound:
> java.net.MalformedURLException: unknown protocol: smb
> 2007-05-24 14:04:23,500 INFO  fetcher.Fetcher - fetching
> file:///C:/Policies/
> 2007-05-24 14:04:23,718 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2007-05-24 14:04:24,671 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
> Plugins:
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	the nutch core
> extension points (nutch-extensionpoints)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSPowerPoint
> Parse Plug-in (parse-mspowerpoint)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Query
> Filter (query-basic)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Basic Indexing
> Filter (index-basic)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Html Parse
> Plug-in (parse-html)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Pdf Parse Plug-in
> (parse-pdf)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Site Query Filter
> (query-site)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Jakarta POI -
> Java API To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Text Parse
> Plug-in (parse-text)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSWord Parse
> Plug-in (parse-msword)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	SMB Protocol
> Plug-in (protocol-smb)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	MSExcel Parse
> Plug-in (parse-msexcel)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	OPIC Scoring
> Plug-in (scoring-opic)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	CyberNeko HTML
> Parser (lib-nekohtml)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Log4j (lib-log4j)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	File Protocol
> Plug-in (protocol-file)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	URL Query Filter
> (query-url)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Parse MS
> Documents Framework (lib-parsems)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - Registered
> Extension-Points:
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL
> Normalizer (org.apache.nutch.net.URLNormalizer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Indexing
> Filter (org.apache.nutch.indexer.IndexingFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Online
> Search Results Clustering Plugin
> (org.apache.nutch.clustering.OnlineClusterer)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Content
> Parser (org.apache.nutch.parse.Parser)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Nutch Query
> Filter (org.apache.nutch.searcher.QueryFilter)
> 2007-05-24 14:04:24,812 INFO  plugin.PluginRepository - 	Ontology Model
> Loader (org.apache.nutch.ontology.Ontology)
> 2007-05-24 14:04:25,171 INFO  fetcher.Fetcher - Fetcher: done
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: starting
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: db:
> crawl/crawldb
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: segments:
> [crawl/segments/20070524140420]
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: additions
> allowed: true
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
> normalizing: true
> 2007-05-24 14:04:25,171 INFO  crawl.CrawlDb - CrawlDb update: URL
> filtering: true
> 2007-05-24 14:04:25,203 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:25,203 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2007-05-24 14:04:25,421 FATAL conf.Configuration - bad conf file:
> top-level element not <configuration>
> 2007-05-24 14:04:25,468 INFO  plugin.PluginRepository - Plugins: looking
> in: C:\nutch-0.9\plugins
> 2007-05-24 14:04:25,593 INFO  plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 
> 
> Thank you for reading my post, hope you can help.
> 
> Regards,
> 
> Oli
> 

-- 
View this message in context: http://www.nabble.com/WIN-XP-PRO--Djava.protocol*-file%3A---c%3A-folder--Crawling-Parents-tf3809966.html#a10806240
Sent from the Nutch - User mailing list archive at Nabble.com.