You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rochelle Rees <ro...@canterbury.ac.nz> on 2008/05/20 04:57:33 UTC
Help Please! Nutch crawl fails on Dedup
Hi there,
I have a problem with my crawl failing at:
Dedup adding indexes in: crawls/test/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
I have tried searching for threads with a similar problem and found a
number - however the only solution I could find was to install the
patches from:
https://issues.apache.org/jira/browse/NUTCH-525
However running deleteDups.patch and RededupUnitTest.patch made no
difference whatsoever.
Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.
Intranet.canterbury.ac.nz requires authentication, so I ran the
NUTCH-559v0.5.patch file - however the error I have occurs with or
without this patch, and regardless of what I put in the
conf/httpclient-auth.xml file.
Does anyone have any ideas what I can do to fix this issue?
For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
urls/urls.txt files are pasted below.
Please let me know if you need any further info.
--------------------------------------------
conf/nutch-site.xml
--------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>University of Canterbury Intranet</value>
<description>
University of Canterbury Intranet
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Intranet for University of Canterbury</value>
<description> Intranet for University of Canterbury
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>
</description>
</property>
<property>
<name>http.agent.email</name>
<value>Web Support Email</value>
<description>websupport@canterbury.ac.nz
</description>
</property>
</configuration>
--------------------------------------------
--------------------------------------------
conf/crawl-urlfilter.txt
--------------------------------------------
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/
# skip everything else
-.
--------------------------------------------
--------------------------------------------
urls/urls.txt
--------------------------------------------
http://intranet.canterbury.ac.nz
--------------------------------------------
--------------------------------------------
Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD)
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@canterbury.ac.nz
http://www.canterbury.ac.nz/
For all web enquiries please contact:
websupport@canterbury.ac.nz Ext: 3100
http://www.canterbury.ac.nz/web
RE: Help Please! Nutch crawl fails on Dedup
Posted by Rochelle Rees <ro...@canterbury.ac.nz>.
Sorry, I didn't realise what I was getting was a standard error.
Logs/hadoop.log is pasted below - hopefully that helps.
-------------------------------------------
2008-05-21 09:43:53,158 INFO crawl.Crawl - crawl started in:
crawls/intranet
2008-05-21 09:43:53,158 INFO crawl.Crawl - rootUrlDir = urls
2008-05-21 09:43:53,158 INFO crawl.Crawl - threads = 10
2008-05-21 09:43:53,158 INFO crawl.Crawl - depth = 3
2008-05-21 09:43:53,158 INFO crawl.Crawl - topN = 50
2008-05-21 09:43:53,236 INFO crawl.Injector - Injector: starting
2008-05-21 09:43:53,236 INFO crawl.Injector - Injector: crawlDb:
crawls/intranet/crawldb
2008-05-21 09:43:53,236 INFO crawl.Injector - Injector: urlDir: urls
2008-05-21 09:43:53,236 INFO crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2008-05-21 09:43:53,801 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:43:53,958 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:43:54,021 WARN regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
2008-05-21 09:43:54,758 INFO crawl.Injector - Injector: Merging
injected urls into crawl db.
2008-05-21 09:43:55,307 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2008-05-21 09:43:57,126 INFO crawl.Injector - Injector: done
2008-05-21 09:43:58,129 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2008-05-21 09:43:58,129 INFO crawl.Generator - Generator: starting
2008-05-21 09:43:58,129 INFO crawl.Generator - Generator: segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:43:58,129 INFO crawl.Generator - Generator: filtering:
false
2008-05-21 09:43:58,129 INFO crawl.Generator - Generator: topN: 50
2008-05-21 09:43:58,145 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2008-05-21 09:43:58,584 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:43:58,710 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:43:58,741 WARN regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2008-05-21 09:43:58,804 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:43:58,929 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:43:59,572 INFO crawl.Generator - Generator: Partitioning
selected urls by host, for politeness.
2008-05-21 09:43:59,949 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:00,043 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:00,058 WARN regex.RegexURLNormalizer - can't find
rules for scope 'partition', using default
2008-05-21 09:44:00,937 INFO crawl.Generator - Generator: done.
2008-05-21 09:44:00,937 INFO fetcher.Fetcher - Fetcher: starting
2008-05-21 09:44:00,937 INFO fetcher.Fetcher - Fetcher: segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:44:01,329 INFO fetcher.Fetcher - Fetcher: threads: 10
2008-05-21 09:44:01,329 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:01,423 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:01,470 INFO fetcher.Fetcher - fetching
http://intranet.canterbury.ac.nz/
2008-05-21 09:44:01,486 FATAL api.RobotRulesParser - Agent we advertise
(University of Canterbury Intranet) not listed first in
'http.robots.agents' property!
2008-05-21 09:44:01,486 INFO http.Http - http.proxy.host = null
2008-05-21 09:44:01,486 INFO http.Http - http.proxy.port = 8080
2008-05-21 09:44:01,486 INFO http.Http - http.timeout = 10000
2008-05-21 09:44:01,486 INFO http.Http - http.content.limit = 65536
2008-05-21 09:44:01,486 INFO http.Http - http.agent = University of
Canterbury Intranet/Nutch-0.9 (Intranet for University of Canterbury;
Web Support Email)
2008-05-21 09:44:01,486 INFO http.Http - protocol.plugin.check.blocking
= true
2008-05-21 09:44:01,486 INFO http.Http - protocol.plugin.check.robots =
true
2008-05-21 09:44:01,486 INFO http.Http - fetcher.server.delay = 1000
2008-05-21 09:44:01,486 INFO http.Http - http.max.delays = 1000
2008-05-21 09:44:03,195 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:03,289 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:04,308 INFO fetcher.Fetcher - Fetcher: done
2008-05-21 09:44:04,308 INFO crawl.CrawlDb - CrawlDb update: starting
2008-05-21 09:44:04,308 INFO crawl.CrawlDb - CrawlDb update: db:
crawls/intranet/crawldb
2008-05-21 09:44:04,308 INFO crawl.CrawlDb - CrawlDb update: segments:
[crawls/intranet/segments/20080521094358]
2008-05-21 09:44:04,308 INFO crawl.CrawlDb - CrawlDb update: additions
allowed: true
2008-05-21 09:44:04,308 INFO crawl.CrawlDb - CrawlDb update: URL
normalizing: true
2008-05-21 09:44:04,308 INFO crawl.CrawlDb - CrawlDb update: URL
filtering: true
2008-05-21 09:44:04,324 INFO crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2008-05-21 09:44:04,700 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:04,795 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:04,810 WARN regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2008-05-21 09:44:04,857 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:04,936 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:04,951 WARN regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2008-05-21 09:44:05,030 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:05,140 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:05,202 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:05,296 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:05,783 INFO crawl.CrawlDb - CrawlDb update: done
2008-05-21 09:44:06,786 INFO crawl.Generator - Generator: Selecting
best-scoring urls due for fetch.
2008-05-21 09:44:06,786 INFO crawl.Generator - Generator: starting
2008-05-21 09:44:06,786 INFO crawl.Generator - Generator: segment:
crawls/intranet/segments/20080521094406
2008-05-21 09:44:06,786 INFO crawl.Generator - Generator: filtering:
false
2008-05-21 09:44:06,786 INFO crawl.Generator - Generator: topN: 50
2008-05-21 09:44:06,802 INFO crawl.Generator - Generator: jobtracker is
'local', generating exactly one partition.
2008-05-21 09:44:07,178 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:07,257 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:07,319 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:07,461 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:08,166 WARN crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2008-05-21 09:44:08,166 INFO crawl.Crawl - Stopping at depth=1 - no
more URLs to fetch.
2008-05-21 09:44:08,166 INFO crawl.LinkDb - LinkDb: starting
2008-05-21 09:44:08,166 INFO crawl.LinkDb - LinkDb: linkdb:
crawls/intranet/linkdb
2008-05-21 09:44:08,166 INFO crawl.LinkDb - LinkDb: URL normalize: true
2008-05-21 09:44:08,166 INFO crawl.LinkDb - LinkDb: URL filter: true
2008-05-21 09:44:08,182 INFO crawl.LinkDb - LinkDb: adding segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:44:10,440 INFO crawl.LinkDb - LinkDb: done
2008-05-21 09:44:10,440 INFO indexer.Indexer - Indexer: starting
2008-05-21 09:44:10,440 INFO indexer.Indexer - Indexer: linkdb:
crawls/intranet/linkdb
2008-05-21 09:44:10,456 INFO indexer.Indexer - Indexer: adding segment:
crawls/intranet/segments/20080521094358
2008-05-21 09:44:10,848 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:10,911 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:10,926 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:10,926 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:10,958 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,083 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,083 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,115 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,177 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,193 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,240 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,319 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,319 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,350 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,444 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,444 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,789 INFO plugin.PluginRepository - Plugins: looking
in: C:\nutch\plugins
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Registered
Plugins:
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - the
nutch core extension points (nutch-extensionpoints)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Basic
Query Filter (query-basic)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Basic
URL Normalizer (urlnormalizer-basic)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Basic
Indexing Filter (index-basic)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Html
Parse Plug-in (parse-html)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Basic
Summarizer Plug-in (summary-basic)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Site
Query Filter (query-site)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - HTTP
Framework (lib-http)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Text
Parse Plug-in (parse-text)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Regex
URL Filter (urlfilter-regex)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Http
Protocol Plug-in (protocol-http)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Regex
URL Normalizer (urlnormalizer-regex)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - OPIC
Scoring Plug-in (scoring-opic)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository -
JavaScript Parser (parse-js)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - URL
Query Filter (query-url)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Regex
URL Filter Framework (lib-regex-filter)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Registered
Extension-Points:
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Summarizer (org.apache.nutch.searcher.Summarizer)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
URL Normalizer (org.apache.nutch.net.URLNormalizer)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Protocol (org.apache.nutch.protocol.Protocol)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
URL Filter (org.apache.nutch.net.URLFilter)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Online Search Results Clustering Plugin
(org.apache.nutch.clustering.OnlineClusterer)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - HTML
Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Content Parser (org.apache.nutch.parse.Parser)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Scoring (org.apache.nutch.scoring.ScoringFilter)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Nutch
Query Filter (org.apache.nutch.searcher.QueryFilter)
2008-05-21 09:44:11,914 INFO plugin.PluginRepository - Ontology
Model Loader (org.apache.nutch.ontology.Ontology)
2008-05-21 09:44:11,914 INFO indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-05-21 09:44:11,993 INFO indexer.Indexer - Optimizing index.
2008-05-21 09:44:12,855 INFO indexer.Indexer - Indexer: done
2008-05-21 09:44:12,855 INFO indexer.DeleteDuplicates - Dedup: starting
2008-05-21 09:44:12,871 INFO indexer.DeleteDuplicates - Dedup: adding
indexes in: crawls/intranet/indexes
2008-05-21 09:44:13,310 WARN mapred.LocalJobRunner - job_m8kjse
java.lang.ArrayIndexOutOfBoundsException: -1
at
org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.nex
t(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
------------------------------------------------------------------
Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD)
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@canterbury.ac.nz
http://www.canterbury.ac.nz/
For all web enquiries please contact:
websupport@canterbury.ac.nz Ext: 3100
http://www.canterbury.ac.nz/web
-----Original Message-----
From: Rochelle Rees
Sent: Tuesday, 20 May 2008 2:58 p.m.
To: 'nutch-user@lucene.apache.org'
Subject: Help Please! Nutch crawl fails on Dedup
Hi there,
I have a problem with my crawl failing at:
Dedup adding indexes in: crawls/test/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
I have tried searching for threads with a similar problem and found a
number - however the only solution I could find was to install the
patches from:
https://issues.apache.org/jira/browse/NUTCH-525
However running deleteDups.patch and RededupUnitTest.patch made no
difference whatsoever.
Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.
Intranet.canterbury.ac.nz requires authentication, so I ran the
NUTCH-559v0.5.patch file - however the error I have occurs with or
without this patch, and regardless of what I put in the
conf/httpclient-auth.xml file.
Does anyone have any ideas what I can do to fix this issue?
For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
urls/urls.txt files are pasted below.
Please let me know if you need any further info.
--------------------------------------------
conf/nutch-site.xml
--------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>University of Canterbury Intranet</value>
<description>
University of Canterbury Intranet
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Intranet for University of Canterbury</value>
<description> Intranet for University of Canterbury
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>
</description>
</property>
<property>
<name>http.agent.email</name>
<value>Web Support Email</value>
<description>websupport@canterbury.ac.nz
</description>
</property>
</configuration>
--------------------------------------------
--------------------------------------------
conf/crawl-urlfilter.txt
--------------------------------------------
# The url filter file used by the crawl command.
# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/
# skip everything else
-.
--------------------------------------------
--------------------------------------------
urls/urls.txt
--------------------------------------------
http://intranet.canterbury.ac.nz
--------------------------------------------
--------------------------------------------
Regards
Rochelle Rees
Web Team, Student Recruitment and Development (SRD)
University of Canterbury, Te Whare Wananga o Waitaha
Rm: 419, Law Building
+64-3-364 2987 Ext: 6125
rochelle.rees@canterbury.ac.nz
http://www.canterbury.ac.nz/
For all web enquiries please contact:
websupport@canterbury.ac.nz Ext: 3100
http://www.canterbury.ac.nz/web
Re: Help Please! Nutch crawl fails on Dedup
Posted by Doğacan Güney <do...@gmail.com>.
Hi,
On Tue, May 20, 2008 at 5:57 AM, Rochelle Rees <
rochelle.rees@canterbury.ac.nz> wrote:
> Hi there,
>
> I have a problem with my crawl failing at:
>
> Dedup adding indexes in: crawls/test/indexes
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
> 9)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
This is a generic error that just tells us that job has failed. You should
have a more detailed log somewhere. For example, if you have a distributed
setup, check your tasktracker log files.
>
>
> I have tried searching for threads with a similar problem and found a
> number - however the only solution I could find was to install the
> patches from:
> https://issues.apache.org/jira/browse/NUTCH-525
> However running deleteDups.patch and RededupUnitTest.patch made no
> difference whatsoever.
>
> Now, interestingly, my crawl runs fine on www.lovepigs.org.nz and
> www.tegelchicken.co.nz, but fails when I try intranet.canterbury.ac.nz.
>
> Intranet.canterbury.ac.nz requires authentication, so I ran the
> NUTCH-559v0.5.patch file - however the error I have occurs with or
> without this patch, and regardless of what I put in the
> conf/httpclient-auth.xml file.
>
> Does anyone have any ideas what I can do to fix this issue?
>
> For reference, my conf/nutch-site.xml, conf/crawl-urlfilter.txt and
> urls/urls.txt files are pasted below.
>
> Please let me know if you need any further info.
>
> --------------------------------------------
> conf/nutch-site.xml
> --------------------------------------------
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>
> <name>http.agent.name</name>
>
> <value>University of Canterbury Intranet</value>
>
> <description>
> University of Canterbury Intranet
> </description>
>
> </property>
>
>
>
> <property>
>
> <name>http.agent.description</name>
>
> <value>Intranet for University of Canterbury</value>
>
> <description> Intranet for University of Canterbury
>
> </description>
>
> </property>
>
>
>
> <property>
>
> <name>http.agent.url</name>
>
> <value></value>
>
> <description>
>
> </description>
>
> </property>
>
>
>
> <property>
>
> <name>http.agent.email</name>
>
> <value>Web Support Email</value>
>
> <description>websupport@canterbury.ac.nz
>
> </description>
>
> </property>
>
> </configuration>
> --------------------------------------------
> --------------------------------------------
>
> conf/crawl-urlfilter.txt
> --------------------------------------------
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*intranet.canterbury.ac.nz/
>
> # skip everything else
> -.
> --------------------------------------------
> --------------------------------------------
>
>
> urls/urls.txt
> --------------------------------------------
> http://intranet.canterbury.ac.nz
>
> --------------------------------------------
> --------------------------------------------
>
> Regards
> Rochelle Rees
> Web Team, Student Recruitment and Development (SRD)
> University of Canterbury, Te Whare Wananga o Waitaha
> Rm: 419, Law Building
> +64-3-364 2987 Ext: 6125
> rochelle.rees@canterbury.ac.nz
> http://www.canterbury.ac.nz/
>
> For all web enquiries please contact:
> websupport@canterbury.ac.nz Ext: 3100
> http://www.canterbury.ac.nz/web
>
>
--
Doğacan Güney