You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rochelle Rees <ro...@canterbury.ac.nz> on 2008/05/20 04:57:33 UTC

Help Please! Nutch crawl fails on Dedup

Hi there,

I have a problem with my crawl failing at:

Dedup adding indexes in: crawls/test/indexes
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
	at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I have tried searching for threads with a similar problem and found a
number - however the only solution I could find was to install the
patches from:
https://issues.apache.org/jira/browse/NUTCH-525
However running deleteDups.patch and RededupUnitTest.patch made no
difference whatsoever.

Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.

Intranet.canterbury.ac.nz requires authentication, so I ran the
NUTCH-559v0.5.patch file - however the error I have occurs with or
without this patch, and regardless of what I put in the
conf/httpclient-auth.xml file.

Does anyone have any ideas what I can do to fix this issue?

For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
urls/urls.txt files are pasted below.

Please let me know if you need any further info.

--------------------------------------------
conf/nutch-site.xml
--------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>http.agent.name</name>

  <value>University of Canterbury Intranet</value>

  <description>
    University of Canterbury Intranet
  </description>

</property>

 

<property>

  <name>http.agent.description</name>

  <value>Intranet for University of Canterbury</value>

  <description> Intranet for University of Canterbury

  </description>

</property>

 

<property>

  <name>http.agent.url</name>

  <value></value>

  <description>

  </description>

</property>

 

<property>

  <name>http.agent.email</name>

  <value>Web Support Email</value>

  <description>websupport@canterbury.ac.nz

  </description>

</property>

</configuration>
--------------------------------------------
--------------------------------------------

conf/crawl-urlfilter.txt
--------------------------------------------
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/

# skip everything else
-.
--------------------------------------------
--------------------------------------------


urls/urls.txt
--------------------------------------------
http://intranet.canterbury.ac.nz

--------------------------------------------
--------------------------------------------

Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD) 
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@canterbury.ac.nz
http://www.canterbury.ac.nz/ 

For all web enquiries please contact:
websupport@canterbury.ac.nz Ext: 3100
http://www.canterbury.ac.nz/web  


RE: Help Please! Nutch crawl fails on Dedup

Posted by Rochelle Rees <ro...@canterbury.ac.nz>.
Sorry, I didn't realise what I was getting was a standard error.
Logs/hadoop.log is pasted below - hopefully that helps.

-------------------------------------------
2008-05-21 09:43:53,158 INFO  crawl.Crawl - crawl started in:
crawls/intranet
2008-05-21 09:43:53,158 INFO  crawl.Crawl - rootUrlDir = urls
2008-05-21 09:43:53,158 INFO  crawl.Crawl - threads = 10
2008-05-21 09:43:53,158 INFO  crawl.Crawl - depth = 3
2008-05-21 09:43:53,158 INFO  crawl.Crawl - topN = 50
2008-05-21 09:43:53,236 INFO  crawl.Injector - Injector: starting
2008-05-21 09:43:53,236 INFO  crawl.Injector - Injector: crawlDb:
crawls/intranet/crawldb
2008-05-21 09:43:53,236 INFO  crawl.Injector - Injector: urlDir: urls
2008-05-21 09:43:53,236 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2008-05-21 09:43:53,801 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:43:53,958 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:43:54,021 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2008-05-21 09:43:54,758 INFO  crawl.Injector - Injector: Merging
injected urls into crawl db.
2008-05-21 09:43:55,307 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2008-05-21 09:43:57,126 INFO  crawl.Injector - Injector: done
2008-05-21 09:43:58,129 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2008-05-21 09:43:58,129 INFO  crawl.Generator - Generator: starting
2008-05-21 09:43:58,129 INFO  crawl.Generator - Generator: segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:43:58,129 INFO  crawl.Generator - Generator: filtering:
false
2008-05-21 09:43:58,129 INFO  crawl.Generator - Generator: topN: 50
2008-05-21 09:43:58,145 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2008-05-21 09:43:58,584 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:43:58,710 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:43:58,741 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2008-05-21 09:43:58,804 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:43:58,929 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:43:59,572 INFO  crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2008-05-21 09:43:59,949 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:00,043 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:00,058 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2008-05-21 09:44:00,937 INFO  crawl.Generator - Generator: done.
2008-05-21 09:44:00,937 INFO  fetcher.Fetcher - Fetcher: starting
2008-05-21 09:44:00,937 INFO  fetcher.Fetcher - Fetcher: segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:44:01,329 INFO  fetcher.Fetcher - Fetcher: threads: 10
2008-05-21 09:44:01,329 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:01,423 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:01,470 INFO  fetcher.Fetcher - fetching
http://intranet.canterbury.ac.nz/
2008-05-21 09:44:01,486 FATAL api.RobotRulesParser - Agent we advertise
(University of Canterbury Intranet) not listed first in
'http.robots.agents' property!
2008-05-21 09:44:01,486 INFO  http.Http - http.proxy.host = null
2008-05-21 09:44:01,486 INFO  http.Http - http.proxy.port = 8080
2008-05-21 09:44:01,486 INFO  http.Http - http.timeout = 10000
2008-05-21 09:44:01,486 INFO  http.Http - http.content.limit = 65536
2008-05-21 09:44:01,486 INFO  http.Http - http.agent = University of
Canterbury Intranet/Nutch-0.9 (Intranet for University of Canterbury;
Web Support Email)
2008-05-21 09:44:01,486 INFO  http.Http - protocol.plugin.check.blocking
= true
2008-05-21 09:44:01,486 INFO  http.Http - protocol.plugin.check.robots =
true
2008-05-21 09:44:01,486 INFO  http.Http - fetcher.server.delay = 1000
2008-05-21 09:44:01,486 INFO  http.Http - http.max.delays = 1000
2008-05-21 09:44:03,195 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:03,289 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:04,308 INFO  fetcher.Fetcher - Fetcher: done
2008-05-21 09:44:04,308 INFO  crawl.CrawlDb - CrawlDb update: starting
2008-05-21 09:44:04,308 INFO  crawl.CrawlDb - CrawlDb update: db:
crawls/intranet/crawldb
2008-05-21 09:44:04,308 INFO  crawl.CrawlDb - CrawlDb update: segments:
[crawls/intranet/segments/20080521094358]
2008-05-21 09:44:04,308 INFO  crawl.CrawlDb - CrawlDb update: additions
allowed: true
2008-05-21 09:44:04,308 INFO  crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2008-05-21 09:44:04,308 INFO  crawl.CrawlDb - CrawlDb update: URL
filtering: true
2008-05-21 09:44:04,324 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2008-05-21 09:44:04,700 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:04,795 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:04,810 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2008-05-21 09:44:04,857 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:04,936 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:04,951 WARN  regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2008-05-21 09:44:05,030 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:05,140 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:05,202 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:05,296 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:05,783 INFO  crawl.CrawlDb - CrawlDb update: done
2008-05-21 09:44:06,786 INFO  crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2008-05-21 09:44:06,786 INFO  crawl.Generator - Generator: starting
2008-05-21 09:44:06,786 INFO  crawl.Generator - Generator: segment:
crawls/intranet/segments/20080521094406
2008-05-21 09:44:06,786 INFO  crawl.Generator - Generator: filtering:
false
2008-05-21 09:44:06,786 INFO  crawl.Generator - Generator: topN: 50
2008-05-21 09:44:06,802 INFO  crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2008-05-21 09:44:07,178 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:07,257 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:07,319 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:07,461 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:08,166 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2008-05-21 09:44:08,166 INFO  crawl.Crawl - Stopping at depth=1 - no
more URLs to fetch.
2008-05-21 09:44:08,166 INFO  crawl.LinkDb - LinkDb: starting
2008-05-21 09:44:08,166 INFO  crawl.LinkDb - LinkDb: linkdb:
crawls/intranet/linkdb
2008-05-21 09:44:08,166 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2008-05-21 09:44:08,166 INFO  crawl.LinkDb - LinkDb: URL filter: true
2008-05-21 09:44:08,182 INFO  crawl.LinkDb - LinkDb: adding segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:44:10,440 INFO  crawl.LinkDb - LinkDb: done
2008-05-21 09:44:10,440 INFO  indexer.Indexer - Indexer: starting
2008-05-21 09:44:10,440 INFO  indexer.Indexer - Indexer: linkdb:
crawls/intranet/linkdb
2008-05-21 09:44:10,456 INFO  indexer.Indexer - Indexer: adding segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:44:10,848 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:10,911 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:10,926 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:10,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:10,958 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,083 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,083 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,115 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,177 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,193 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,240 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,319 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,319 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,350 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,444 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,444 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,789 INFO  plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Basic
Query Filter (query-basic)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Site
Query Filter (query-site)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	HTTP
Framework (lib-http)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	URL
Query Filter (query-url)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,914 INFO  plugin.PluginRepository - 	Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,914 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,993 INFO  indexer.Indexer - Optimizing index.
2008-05-21 09:44:12,855 INFO  indexer.Indexer - Indexer: done
2008-05-21 09:44:12,855 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-05-21 09:44:12,871 INFO  indexer.DeleteDuplicates - Dedup: adding
indexes in: crawls/intranet/indexes
2008-05-21 09:44:13,310 WARN  mapred.LocalJobRunner - job_m8kjse
java.lang.ArrayIndexOutOfBoundsException: -1
	at
org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
	at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.nex
t(DeleteDuplicates.java:176)
	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
	at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
------------------------------------------------------------------

Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD) 
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@canterbury.ac.nz
http://www.canterbury.ac.nz/ 

For all web enquiries please contact:
websupport@canterbury.ac.nz Ext: 3100
http://www.canterbury.ac.nz/web  

-----Original Message-----
From: Rochelle Rees 
Sent: Tuesday, 20 May 2008 2:58 p.m.
To: 'nutch-user@lucene.apache.org'
Subject: Help Please! Nutch crawl fails on Dedup

Hi there,

I have a problem with my crawl failing at:

Dedup adding indexes in: crawls/test/indexes
Exception in thread "main" java.io.IOException: Job failed!
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
	at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

I have tried searching for threads with a similar problem and found a
number - however the only solution I could find was to install the
patches from:
https://issues.apache.org/jira/browse/NUTCH-525
However running deleteDups.patch and RededupUnitTest.patch made no
difference whatsoever.

Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.

Intranet.canterbury.ac.nz requires authentication, so I ran the
NUTCH-559v0.5.patch file - however the error I have occurs with or
without this patch, and regardless of what I put in the
conf/httpclient-auth.xml file.

Does anyone have any ideas what I can do to fix this issue?

For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
urls/urls.txt files are pasted below.

Please let me know if you need any further info.

--------------------------------------------
conf/nutch-site.xml
--------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>http.agent.name</name>

  <value>University of Canterbury Intranet</value>

  <description>
    University of Canterbury Intranet
  </description>

</property>

 

<property>

  <name>http.agent.description</name>

  <value>Intranet for University of Canterbury</value>

  <description> Intranet for University of Canterbury

  </description>

</property>

 

<property>

  <name>http.agent.url</name>

  <value></value>

  <description>

  </description>

</property>

 

<property>

  <name>http.agent.email</name>

  <value>Web Support Email</value>

  <description>websupport@canterbury.ac.nz

  </description>

</property>

</configuration>
--------------------------------------------
--------------------------------------------

conf/crawl-urlfilter.txt
--------------------------------------------
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/

# skip everything else
-.
--------------------------------------------
--------------------------------------------


urls/urls.txt
--------------------------------------------
http://intranet.canterbury.ac.nz

--------------------------------------------
--------------------------------------------

Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD) 
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@canterbury.ac.nz
http://www.canterbury.ac.nz/ 

For all web enquiries please contact:
websupport@canterbury.ac.nz Ext: 3100
http://www.canterbury.ac.nz/web  


Re: Help Please! Nutch crawl fails on Dedup

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On Tue, May 20, 2008 at 5:57 AM, Rochelle Rees <
rochelle.rees@canterbury.ac.nz> wrote:

> Hi there,
>
> I have a problem with my crawl failing at:
>
> Dedup adding indexes in: crawls/test/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
> 9)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)



This is a generic error that just tells us that job has failed. You should
have a more detailed log somewhere. For example, if you have a distributed
setup, check your tasktracker log files.


>
>
> I have tried searching for threads with a similar problem and found a
> number - however the only solution I could find was to install the
> patches from:
> https://issues.apache.org/jira/browse/NUTCH-525
> However running deleteDups.patch and RededupUnitTest.patch made no
> difference whatsoever.
>
> Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
> www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.
>
> Intranet.canterbury.ac.nz requires authentication, so I ran the
> NUTCH-559v0.5.patch file - however the error I have occurs with or
> without this patch, and regardless of what I put in the
> conf/httpclient-auth.xml file.
>
> Does anyone have any ideas what I can do to fix this issue?
>
> For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
> urls/urls.txt files are pasted below.
>
> Please let me know if you need any further info.
>
> --------------------------------------------
> conf/nutch-site.xml
> --------------------------------------------
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>
>  <name>http.agent.name</name>
>
>  <value>University of Canterbury Intranet</value>
>
>  <description>
>    University of Canterbury Intranet
>  </description>
>
> </property>
>
>
>
> <property>
>
>  <name>http.agent.description</name>
>
>  <value>Intranet for University of Canterbury</value>
>
>  <description> Intranet for University of Canterbury
>
>  </description>
>
> </property>
>
>
>
> <property>
>
>  <name>http.agent.url</name>
>
>  <value></value>
>
>  <description>
>
>  </description>
>
> </property>
>
>
>
> <property>
>
>  <name>http.agent.email</name>
>
>  <value>Web Support Email</value>
>
>  <description>websupport@canterbury.ac.nz
>
>  </description>
>
> </property>
>
> </configuration>
> --------------------------------------------
> --------------------------------------------
>
> conf/crawl-urlfilter.txt
> --------------------------------------------
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/
>
> # skip everything else
> -.
> --------------------------------------------
> --------------------------------------------
>
>
> urls/urls.txt
> --------------------------------------------
> http://intranet.canterbury.ac.nz
>
> --------------------------------------------
> --------------------------------------------
>
> Regards
> Rochelle Rees
> Web Team, Student Recruitment and Development (SRD)
> University of Canterbury, Te Whare Wananga o Waitaha
> Rm: 419, Law Building
> +64-3-364 2987 Ext: 6125
> rochelle.rees@canterbury.ac.nz
> http://www.canterbury.ac.nz/
>
> For all web enquiries please contact:
> websupport@canterbury.ac.nz Ext: 3100
> http://www.canterbury.ac.nz/web
>
>


-- 
Doğacan Güney