You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tim Benke <ze...@fusemail.com> on 2007/01/11 15:16:20 UTC
nutch in eclipse, No input directories specified
Hi,
thanks to these guides, I was able to get nutch into eclipse;
http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug+nutch
http://wiki.apache.org/nutch/RunNutchInEclipse
I get the exception:
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
hadoop-site.xml
arguments in eclipse:
to the program:
urls -dir crawl -depth 3 -topN 50
to the vm:
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
environment variables NUTCH_JAVA_HOME, JAVA_HOME are set.
file urls/nutch:
http://lucene.apache.org/nutch/
I really hope someone can help me with this, I need nutch for my
bachelor thesis.
regards,
Tim Benke
the complete log is:
2007-01-11 14:03:29,831 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:29,940 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:30,003 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:30,018 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,018 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(89)) - crawl
started in: crawl
2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(90)) -
rootUrlDir = urls
2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(91)) -
threads = 10
2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(92)) - depth = 3
2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(94)) - topN = 50
2007-01-11 14:03:30,097 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:30,112 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:30,128 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(135))
- Injector: starting
2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(136))
- Injector: crawlDb: crawl/crawldb
2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(137))
- Injector: urlDir: urls
2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(147))
- Injector: Converting injected urls to crawl db entries.
2007-01-11 14:03:30,175 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:30,175 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:30,190 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:30,206 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,206 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,425 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:30,425 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:30,440 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:30,440 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,456 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,456 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,472 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:30,487 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,503 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
2007-01-11 14:03:30,518 INFO mapred.JobClient
(JobClient.java:runJob(370)) - Running job: job_qo4f9q
2007-01-11 14:03:30,534 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:30,534 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,534 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
2007-01-11 14:03:30,565 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:30,643 INFO mapred.MapTask (MapTask.java:run(155)) -
opened part-0.out
2007-01-11 14:03:30,675 INFO plugin.PluginRepository
(PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
C:\wkspc\nutch_trunk\tmpBuild\src\plugin
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
mode: [true]
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(310)) - Registered Plugins:
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Creative Commons
Plugins (creativecommons)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Site Query Filter
(query-site)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http / Https Protocol
Plug-in (protocol-httpclient)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
(parse-html)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
(parse-pdf)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
(parse-msexcel)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - JavaScript Parser
(parse-js)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - URL Query Filter
(query-url)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
(parse-swf)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
(protocol-ftp)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
(analysis-fr)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
(parse-mp3)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
(parse-zip)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Online Search Results
Clustering using Carrot2's Lingo component (clustering-carrot2)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Suffix URL Filter
(urlfilter-suffix)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
(parse-rtf)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Language Identification
Parser/Filter (language-identifier)
2007-01-11 14:03:30,987 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
(parse-msword)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
(parse-text)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
(analysis-de)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
(urlnormalizer-regex)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
Parse Plug-in (parse-oo)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Automaton URL Filter
(urlfilter-automaton)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Highlighter
Summary Plug-in (summary-lucene)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Subcollection indexing
and query filter (subcollection)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
Framework (lib-regex-filter)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Analysers
(lib-lucene-analyzers)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
(index-basic)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Summarizer
Plug-in (summary-basic)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
(urlfilter-regex)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - External Parser Plug-in
(parse-ext)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
(protocol-http)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - the nutch core
extension points (nutch-extensionpoints)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Indexing Filter
(index-more)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Query Filter
(query-more)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
(lib-nekohtml)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Prefix URL Filter
(urlfilter-prefix)
2007-01-11 14:03:31,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
(urlnormalizer-basic)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
Client (lib-commons-httpclient)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
(protocol-file)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
To Access Microsoft Format Files (lib-jakarta-poi)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Query Filter
(query-basic)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Parse MS Documents
Framework (lib-parsems)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
(parse-rss)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
(scoring-opic)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Content Parser
(org.apache.nutch.parse.Parser)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-01-11 14:03:31,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-01-11 14:03:31,065 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
suffix-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
2007-01-11 14:03:31,065 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
automaton-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
2007-01-11 14:03:31,456 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
crawl-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
2007-01-11 14:03:31,472 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
not found
2007-01-11 14:03:31,487 WARN regex.RegexURLNormalizer
(RegexURLNormalizer.java:regexNormalize(159)) - can't find rules for
scope 'inject', using default
2007-01-11 14:03:31,487 INFO mapred.LocalJobRunner
(LocalJobRunner.java:progress(169)) - C:/wkspc/nutch_trunk/urls/nutch:0+33
2007-01-11 14:03:31,503 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:31,503 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:31,503 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
2007-01-11 14:03:31,518 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:31,534 INFO mapred.JobClient
(JobClient.java:runJob(385)) - map 100% reduce 0%
2007-01-11 14:03:31,753 INFO mapred.LocalJobRunner
(LocalJobRunner.java:progress(169)) - reduce > reduce
2007-01-11 14:03:32,534 INFO mapred.JobClient
(JobClient.java:runJob(401)) - Job complete: job_qo4f9q
2007-01-11 14:03:32,534 INFO crawl.Injector (Injector.java:inject(163))
- Injector: Merging injected urls into crawl db.
2007-01-11 14:03:32,534 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:32,534 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:32,534 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:32,550 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,550 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,581 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:32,597 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:32,597 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:32,597 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,612 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,612 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,628 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:32,628 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,628 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
2007-01-11 14:03:32,628 INFO mapred.JobClient
(JobClient.java:runJob(370)) - Running job: job_xiod9g
2007-01-11 14:03:32,643 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:32,643 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,643 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
2007-01-11 14:03:32,643 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,675 INFO mapred.MapTask (MapTask.java:run(155)) -
opened part-0.out
2007-01-11 14:03:32,675 INFO mapred.LocalJobRunner
(LocalJobRunner.java:progress(169)) -
C:/tmp/hadoop-tbenke/mapred/temp/inject-temp-2045807797/part-00000:0+82
2007-01-11 14:03:32,690 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:32,706 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,706 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
2007-01-11 14:03:32,706 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:32,722 INFO plugin.PluginRepository
(PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
C:\wkspc\nutch_trunk\tmpBuild\src\plugin
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
mode: [true]
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(310)) - Registered Plugins:
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Creative Commons
Plugins (creativecommons)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Site Query Filter
(query-site)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http / Https Protocol
Plug-in (protocol-httpclient)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
(parse-html)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
(parse-pdf)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
(parse-msexcel)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - JavaScript Parser
(parse-js)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - URL Query Filter
(query-url)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
(parse-swf)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
(protocol-ftp)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
(analysis-fr)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
(parse-mp3)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
(parse-zip)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Online Search Results
Clustering using Carrot2's Lingo component (clustering-carrot2)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Suffix URL Filter
(urlfilter-suffix)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
(parse-rtf)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Language Identification
Parser/Filter (language-identifier)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
(parse-msword)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
(parse-text)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
(analysis-de)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
(urlnormalizer-regex)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
Parse Plug-in (parse-oo)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Automaton URL Filter
(urlfilter-automaton)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Highlighter
Summary Plug-in (summary-lucene)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Subcollection indexing
and query filter (subcollection)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
Framework (lib-regex-filter)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Analysers
(lib-lucene-analyzers)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
(index-basic)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Summarizer
Plug-in (summary-basic)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
(urlfilter-regex)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - External Parser Plug-in
(parse-ext)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
(protocol-http)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - the nutch core
extension points (nutch-extensionpoints)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Indexing Filter
(index-more)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Query Filter
(query-more)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
(lib-nekohtml)
2007-01-11 14:03:33,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Prefix URL Filter
(urlfilter-prefix)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
(urlnormalizer-basic)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
Client (lib-commons-httpclient)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
(protocol-file)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
To Access Microsoft Format Files (lib-jakarta-poi)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Query Filter
(query-basic)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Parse MS Documents
Framework (lib-parsems)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
(parse-rss)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
(scoring-opic)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Content Parser
(org.apache.nutch.parse.Parser)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-01-11 14:03:33,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-01-11 14:03:33,143 WARN util.NativeCodeLoader
(NativeCodeLoader.java:<clinit>(50)) - Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
2007-01-11 14:03:33,175 INFO mapred.LocalJobRunner
(LocalJobRunner.java:progress(169)) - reduce > reduce
2007-01-11 14:03:33,628 INFO mapred.JobClient
(JobClient.java:runJob(401)) - Job complete: job_xiod9g
2007-01-11 14:03:33,659 INFO crawl.Injector (Injector.java:inject(173))
- Injector: done
2007-01-11 14:03:34,659 INFO crawl.Generator
(Generator.java:generate(371)) - Generator: Selecting best-scoring urls
due for fetch.
2007-01-11 14:03:34,659 INFO crawl.Generator
(Generator.java:generate(372)) - Generator: starting
2007-01-11 14:03:34,659 INFO crawl.Generator
(Generator.java:generate(373)) - Generator: segment:
crawl/segments/20070111140334
2007-01-11 14:03:34,659 INFO crawl.Generator
(Generator.java:generate(374)) - Generator: filtering: false
2007-01-11 14:03:34,659 INFO crawl.Generator
(Generator.java:generate(376)) - Generator: topN: 50
2007-01-11 14:03:34,659 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:34,659 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:34,675 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:34,675 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,675 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,675 INFO crawl.Generator
(Generator.java:generate(388)) - Generator: jobtracker is 'local',
generating exactly one partition.
2007-01-11 14:03:34,706 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:34,722 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:34,722 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:34,737 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,737 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,737 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,737 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:34,753 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,753 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
2007-01-11 14:03:34,753 INFO mapred.JobClient
(JobClient.java:runJob(370)) - Running job: job_m7h3ig
2007-01-11 14:03:34,753 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:34,768 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,768 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
2007-01-11 14:03:34,784 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:34,784 INFO mapred.MapTask (MapTask.java:run(155)) -
opened part-0.out
2007-01-11 14:03:34,784 INFO plugin.PluginRepository
(PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
C:\wkspc\nutch_trunk\tmpBuild\src\plugin
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
mode: [true]
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(310)) - Registered Plugins:
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Creative Commons
Plugins (creativecommons)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Site Query Filter
(query-site)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http / Https Protocol
Plug-in (protocol-httpclient)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
(parse-html)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
(parse-pdf)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
(parse-msexcel)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - JavaScript Parser
(parse-js)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - URL Query Filter
(query-url)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
(parse-swf)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
(protocol-ftp)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
(analysis-fr)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
(parse-mp3)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
(parse-zip)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Online Search Results
Clustering using Carrot2's Lingo component (clustering-carrot2)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Suffix URL Filter
(urlfilter-suffix)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
(parse-rtf)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Language Identification
Parser/Filter (language-identifier)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
(parse-msword)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
(parse-text)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
(analysis-de)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
(urlnormalizer-regex)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
Parse Plug-in (parse-oo)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Automaton URL Filter
(urlfilter-automaton)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Highlighter
Summary Plug-in (summary-lucene)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Subcollection indexing
and query filter (subcollection)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
Framework (lib-regex-filter)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Analysers
(lib-lucene-analyzers)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
(index-basic)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Summarizer
Plug-in (summary-basic)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
(urlfilter-regex)
2007-01-11 14:03:35,003 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - External Parser Plug-in
(parse-ext)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
(protocol-http)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - the nutch core
extension points (nutch-extensionpoints)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Indexing Filter
(index-more)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Query Filter
(query-more)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
(lib-nekohtml)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Prefix URL Filter
(urlfilter-prefix)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
(urlnormalizer-basic)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
Client (lib-commons-httpclient)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
(protocol-file)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
To Access Microsoft Format Files (lib-jakarta-poi)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Query Filter
(query-basic)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Parse MS Documents
Framework (lib-parsems)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
(parse-rss)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
(scoring-opic)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Content Parser
(org.apache.nutch.parse.Parser)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-01-11 14:03:35,018 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-01-11 14:03:35,018 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
suffix-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
2007-01-11 14:03:35,018 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
automaton-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
2007-01-11 14:03:35,128 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
crawl-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
2007-01-11 14:03:35,128 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
not found
2007-01-11 14:03:35,143 INFO mapred.LocalJobRunner
(LocalJobRunner.java:progress(169)) -
C:/wkspc/nutch_trunk/crawl/crawldb/current/part-00000/data:0+125
2007-01-11 14:03:35,159 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:35,175 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,175 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
2007-01-11 14:03:35,175 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,190 INFO plugin.PluginRepository
(PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
C:\wkspc\nutch_trunk\tmpBuild\src\plugin
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
mode: [true]
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(310)) - Registered Plugins:
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Creative Commons
Plugins (creativecommons)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Site Query Filter
(query-site)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http / Https Protocol
Plug-in (protocol-httpclient)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
(parse-html)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
(parse-pdf)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
(parse-msexcel)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - JavaScript Parser
(parse-js)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - URL Query Filter
(query-url)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
(parse-swf)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
(protocol-ftp)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
(analysis-fr)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
(parse-mp3)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
(parse-zip)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Online Search Results
Clustering using Carrot2's Lingo component (clustering-carrot2)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Suffix URL Filter
(urlfilter-suffix)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
(parse-rtf)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Language Identification
Parser/Filter (language-identifier)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
(parse-msword)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
(parse-text)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
(analysis-de)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
(urlnormalizer-regex)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
Parse Plug-in (parse-oo)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Automaton URL Filter
(urlfilter-automaton)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Highlighter
Summary Plug-in (summary-lucene)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Subcollection indexing
and query filter (subcollection)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
Framework (lib-regex-filter)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Lucene Analysers
(lib-lucene-analyzers)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
(index-basic)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Summarizer
Plug-in (summary-basic)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Regex URL Filter
(urlfilter-regex)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - External Parser Plug-in
(parse-ext)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
(protocol-http)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - the nutch core
extension points (nutch-extensionpoints)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Indexing Filter
(index-more)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - More Query Filter
(query-more)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
(lib-nekohtml)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Prefix URL Filter
(urlfilter-prefix)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
Plug-in (parse-mspowerpoint)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
(urlnormalizer-basic)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Pass-through URL
Normalizer (urlnormalizer-pass)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
Client (lib-commons-httpclient)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
(protocol-file)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
To Access Microsoft Format Files (lib-jakarta-poi)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Basic Query Filter
(query-basic)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - Parse MS Documents
Framework (lib-parsems)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
(parse-rss)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
(scoring-opic)
2007-01-11 14:03:35,394 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Content Parser
(org.apache.nutch.parse.Parser)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2007-01-11 14:03:35,409 INFO plugin.PluginRepository
(PluginRepository.java:displayStatus(325)) - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2007-01-11 14:03:35,409 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
suffix-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
2007-01-11 14:03:35,409 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
automaton-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
2007-01-11 14:03:35,519 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(441)) - found resource
crawl-urlfilter.txt at
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
2007-01-11 14:03:35,519 INFO conf.Configuration
(Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
not found
2007-01-11 14:03:35,706 INFO mapred.LocalJobRunner
(LocalJobRunner.java:progress(169)) - reduce > reduce
2007-01-11 14:03:35,753 INFO mapred.JobClient
(JobClient.java:runJob(401)) - Job complete: job_m7h3ig
2007-01-11 14:03:35,753 WARN crawl.Generator
(Generator.java:generate(419)) - Generator: 0 records selected for
fetching, exiting ...
2007-01-11 14:03:35,753 INFO crawl.Crawl (Crawl.java:main(121)) -
Stopping at depth=0 - no more URLs to fetch.
2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(219)) -
LinkDb: starting
2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(220)) -
LinkDb: linkdb: crawl/linkdb
2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(221)) -
LinkDb: URL normalize: true
2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(222)) -
LinkDb: URL filter: true
2007-01-11 14:03:35,769 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:35,769 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:35,784 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:35,784 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,784 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,800 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:35,800 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
2007-01-11 14:03:35,815 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
2007-01-11 14:03:35,815 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,815 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,815 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,831 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
2007-01-11 14:03:35,831 INFO conf.Configuration
(Configuration.java:loadResource(495)) - parsing
jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
2007-01-11 14:03:35,847 INFO conf.Configuration
(Configuration.java:loadResource(504)) - parsing
/tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xml
2007-01-11 14:03:35,847 INFO mapred.JobClient
(JobClient.java:runJob(370)) - Running job: job_kumfin
2007-01-11 14:03:35,847 WARN mapred.LocalJobRunner
(LocalJobRunner.java:run(147)) - job_kumfin
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
hadoop-site.xml
at
org.apache.hadoop.mapred.InputFormatBase.listPaths(InputFormatBase.java:99)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(SequenceFileInputFormat.java:39)
at
org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:119)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:93)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
Re: nutch-0.8 bundle for eclipse
Posted by Renaud Richardet <re...@oslutions.com>.
I will try ;-)
Cheers,
Renaud
jian chen wrote:
> Hi, Renaud,
>
> Thanks for the info, this is very useful stuff. Especially for using
> Eclipse to develop java apps.
>
> Is it possible to keep this going for future releases of Nutch?
>
> Cheers,
>
> Jian
> www.hongandjian.com <http://www.hongandjian.com>
>
> On 1/15/07, *Renaud Richardet* <ren@oslutions.com
> <ma...@oslutions.com>> wrote:
>
> Hello,
>
> It seems like many people are having questions re running Nutch in
> Eclipse, so here's a bundled version of Nutch-0.8 that can be imported
> into Eclipse. It should get you up to speed very quickly. I tested
> it on
> Ubuntu and WinXP. Please let me know if find some configuration
> problems.
>
> http://www.oslutions.com/ren/nutch/nutch-0.8-eclipse.tar.gz (*nix)
> http://www.oslutions.com/ren/nutch/nutch-0.8-eclipse.zip (windows)
>
> Requirements:
> Eclipse 3.2
> Java 1.4 or higher, tested with 1.5
>
> Import project into Eclipse:
> From the "File" menu select "Import..." and select "General",
> "Existing
> Project into Workspace", Click "Next >"
> Click "Browse..." next to "Select Root directory " and navigate to
> the
> root of this document. Click "Open"
> Click "Finish" and the Package Explorer will show the project.
>
> Configure:
> Change the value CHANGE<ME in the file conf\nutch-site.xml
> NUTCH WILL NOT RUN OTHERWISE
>
> Run it:
> Crawl: menu "Run", "Run..." then double click on "Crawl" on the
> left list
> Search: menu "Run", "Run..." then double click on "SearchBean"
> By default, Nutch is set up to crawl http://www.cnn.com and
> http://www.nytimes.com/
>
> More infos:
> see README-FIRST.txt
> http://lucene.apache.org/nutch/tutorial.html
> http://wiki.apache.org/nutch/RunNutchInEclipse
>
> HTH,
> Renaud
>
Re: nutch-0.8 bundle for eclipse
Posted by jian chen <ch...@gmail.com>.
Hi, Renaud,
Thanks for the info, this is very useful stuff. Especially for using Eclipse
to develop java apps.
Is it possible to keep this going for future releases of Nutch?
Cheers,
Jian
www.hongandjian.com
On 1/15/07, Renaud Richardet <re...@oslutions.com> wrote:
>
> Hello,
>
> It seems like many people are having questions re running Nutch in
> Eclipse, so here's a bundled version of Nutch-0.8 that can be imported
> into Eclipse. It should get you up to speed very quickly. I tested it on
> Ubuntu and WinXP. Please let me know if find some configuration problems.
>
> http://www.oslutions.com/ren/nutch/nutch-0.8-eclipse.tar.gz (*nix)
> http://www.oslutions.com/ren/nutch/nutch-0.8-eclipse.zip (windows)
>
> Requirements:
> Eclipse 3.2
> Java 1.4 or higher, tested with 1.5
>
> Import project into Eclipse:
> From the "File" menu select "Import..." and select "General", "Existing
> Project into Workspace", Click "Next >"
> Click "Browse..." next to "Select Root directory " and navigate to the
> root of this document. Click "Open"
> Click "Finish" and the Package Explorer will show the project.
>
> Configure:
> Change the value CHANGE<ME in the file conf\nutch-site.xml
> NUTCH WILL NOT RUN OTHERWISE
>
> Run it:
> Crawl: menu "Run", "Run..." then double click on "Crawl" on the left list
> Search: menu "Run", "Run..." then double click on "SearchBean"
> By default, Nutch is set up to crawl http://www.cnn.com and
> http://www.nytimes.com/
>
> More infos:
> see README-FIRST.txt
> http://lucene.apache.org/nutch/tutorial.html
> http://wiki.apache.org/nutch/RunNutchInEclipse
>
> HTH,
> Renaud
>
> --
> renaud richardet +1 617 230 9112
> renaud <at> oslutions.com http://www.oslutions.com
>
>
nutch-0.8 bundle for eclipse
Posted by Renaud Richardet <re...@oslutions.com>.
Hello,
It seems like many people are having questions re running Nutch in
Eclipse, so here’s a bundled version of Nutch-0.8 that can be imported
into Eclipse. It should get you up to speed very quickly. I tested it on
Ubuntu and WinXP. Please let me know if find some configuration problems.
http://www.oslutions.com/ren/nutch/nutch-0.8-eclipse.tar.gz (*nix)
http://www.oslutions.com/ren/nutch/nutch-0.8-eclipse.zip (windows)
Requirements:
Eclipse 3.2
Java 1.4 or higher, tested with 1.5
Import project into Eclipse:
From the "File" menu select "Import..." and select "General", "Existing
Project into Workspace", Click "Next >"
Click "Browse..." next to "Select Root directory " and navigate to the
root of this document. Click "Open"
Click "Finish" and the Package Explorer will show the project.
Configure:
Change the value CHANGE<ME in the file conf\nutch-site.xml
NUTCH WILL NOT RUN OTHERWISE
Run it:
Crawl: menu "Run", "Run..." then double click on "Crawl" on the left list
Search: menu "Run", "Run..." then double click on "SearchBean"
By default, Nutch is set up to crawl http://www.cnn.com and
http://www.nytimes.com/
More infos:
see README-FIRST.txt
http://lucene.apache.org/nutch/tutorial.html
http://wiki.apache.org/nutch/RunNutchInEclipse
HTH,
Renaud
--
renaud richardet +1 617 230 9112
renaud <at> oslutions.com http://www.oslutions.com
Re: nutch in eclipse, No input directories specified
Posted by Dennis Kubes <nu...@dragonflymc.com>.
please post (you can't attach) a copy of your nutch-site.xml file and
your .classpath file.
Dennis Kubes
Tim Benke wrote:
> Guys please help me with this, I tried to get it running for more than a
> week and I don't have a clue what to try else...
>
> > On Thu, 2007-01-11 at 15:16 +0100, Tim Benke wrote:
> >
> >> Hi,
> >>
> >> thanks to these guides, I was able to get nutch into eclipse;
> >>
> http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug+nutch
> >> http://wiki.apache.org/nutch/RunNutchInEclipse
> >>
> >> I get the exception:
> >> java.io.IOException: No input directories specified in: Configuration:
> >> defaults: hadoop-default.xml , mapred-default.xml ,
> >> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
> >> hadoop-site.xml
> >>
> >>
> >
>
> Thorsten Scherler wrote:
> > Hmm, not sure but above sounds that you have not
> > "add the folder "conf" to the classpath (scroll down the list and
> > right-click on "conf". This step is necessary)"
>
> I tried that, the same exception is thrown, but some of the INFO-Log
> messages are omitted.
> I suspect the problem has to do with reading or evaluating the
> urls-file. Everything works fine with the same url-file on the commandline;
> file urls/nutch:
> http://lucene.apache.org/nutch/
>
> in Eclipse: urls/nutch contains the url
>
> arguments in eclipse:
> to the program:
> urls -dir crawl -depth 3 -topN 50
>
>
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070111170258
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>
> commandline:
> $ ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: crawl/segments/20070111165009
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20070111165009
> Fetcher: threads: 10
> fetching http://lucene.apache.org/nutch/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segment: crawl/segments/20070111165009
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> ...
>
>
>>> arguments in eclipse:
>>> to the program:
>>> urls -dir crawl -depth 3 -topN 50
>>>
>>> to the vm:
>>> -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
>>>
>>> environment variables NUTCH_JAVA_HOME, JAVA_HOME are set.
>>> file urls/nutch:
>>> http://lucene.apache.org/nutch/
>>>
>>> I really hope someone can help me with this, I need nutch for my
>>> bachelor thesis.
>>>
>>> regards,
>>>
>>> Tim Benke
>>>
>>> the complete log is:
>>>
>>> 2007-01-11 14:03:29,831 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:29,940 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:30,003 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:30,018 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,018 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(89)) - crawl
>>> started in: crawl
>>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(90)) -
>>> rootUrlDir = urls
>>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(91)) -
>>> threads = 10
>>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(92)) -
>>> depth = 3
>>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(94)) -
>>> topN = 50
>>> 2007-01-11 14:03:30,097 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:30,112 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:30,128 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(135))
>>> - Injector: starting
>>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(136))
>>> - Injector: crawlDb: crawl/crawldb
>>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(137))
>>> - Injector: urlDir: urls
>>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(147))
>>> - Injector: Converting injected urls to crawl db entries.
>>> 2007-01-11 14:03:30,175 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:30,175 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:30,190 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:30,206 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,206 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,425 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:30,425 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:30,440 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:30,440 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,456 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,456 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,472 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:30,487 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,503 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>>> 2007-01-11 14:03:30,518 INFO mapred.JobClient
>>> (JobClient.java:runJob(370)) - Running job: job_qo4f9q
>>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>>> 2007-01-11 14:03:30,565 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:30,643 INFO mapred.MapTask (MapTask.java:run(155)) -
>>> opened part-0.out
>>> 2007-01-11 14:03:30,675 INFO plugin.PluginRepository
>>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>>> mode: [true]
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>>> Plugins (creativecommons)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>>> (query-site)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>>> Plug-in (protocol-httpclient)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>>> (parse-html)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>>> (parse-pdf)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>>> (parse-msexcel)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>>> (parse-js)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>>> (query-url)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>>> (parse-swf)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in
>>> (ontology)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>>> (protocol-ftp)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>>> (analysis-fr)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>>> (parse-mp3)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>>> (parse-zip)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>>> (urlfilter-suffix)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>>> Parser/Indexer/Querier (microformats-reltag)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>>> (parse-rtf)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Language Identification
>>> Parser/Filter (language-identifier)
>>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>>> (parse-msword)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>>> (parse-text)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>>> (analysis-de)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>>> (urlnormalizer-regex)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>>> Parse Plug-in (parse-oo)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>>> (urlfilter-automaton)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>>> Summary Plug-in (summary-lucene)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>>> and query filter (subcollection)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> Framework (lib-regex-filter)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>>> (lib-lucene-analyzers)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>>> (index-basic)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>>> Plug-in (summary-basic)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> (urlfilter-regex)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - HTTP Framework
>>> (lib-http)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>>> (parse-ext)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>>> (protocol-http)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>>> (index-more)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>>> (query-more)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>>> (lib-nekohtml)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>>> (urlfilter-prefix)
>>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>>> Plug-in (parse-mspowerpoint)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>>> (urlnormalizer-basic)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>>> Normalizer (urlnormalizer-pass)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>>> Client (lib-commons-httpclient)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>>> (protocol-file)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>>> To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>>> (query-basic)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>>> Framework (lib-parsems)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>>> (parse-rss)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>>> (scoring-opic)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(320)) - Registered
>>> Extension-Points:
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>>> (org.apache.nutch.searcher.Summarizer)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>>> (org.apache.nutch.net.URLNormalizer)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>>> (org.apache.nutch.net.URLFilter)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>>> (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>>> (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>>> (org.apache.nutch.parse.Parser)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>>> (org.apache.nutch.ontology.Ontology)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>>> (org.apache.nutch.searcher.QueryFilter)
>>> 2007-01-11 14:03:31,065 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> suffix-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>>> 2007-01-11 14:03:31,065 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> automaton-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>>> 2007-01-11 14:03:31,456 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> crawl-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>>> 2007-01-11 14:03:31,472 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>>> not found
>>> 2007-01-11 14:03:31,487 WARN regex.RegexURLNormalizer
>>> (RegexURLNormalizer.java:regexNormalize(159)) - can't find rules for
>>> scope 'inject', using default
>>> 2007-01-11 14:03:31,487 INFO mapred.LocalJobRunner
>>> (LocalJobRunner.java:progress(169)) -
>>> C:/wkspc/nutch_trunk/urls/nutch:0+33
>>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>>> 2007-01-11 14:03:31,518 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:31,534 INFO mapred.JobClient
>>> (JobClient.java:runJob(385)) - map 100% reduce 0%
>>> 2007-01-11 14:03:31,753 INFO mapred.LocalJobRunner
>>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>>> 2007-01-11 14:03:32,534 INFO mapred.JobClient
>>> (JobClient.java:runJob(401)) - Job complete: job_qo4f9q
>>> 2007-01-11 14:03:32,534 INFO crawl.Injector (Injector.java:inject(163))
>>> - Injector: Merging injected urls into crawl db.
>>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:32,550 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,550 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,581 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,612 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,612 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>>> 2007-01-11 14:03:32,628 INFO mapred.JobClient
>>> (JobClient.java:runJob(370)) - Running job: job_xiod9g
>>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,675 INFO mapred.MapTask (MapTask.java:run(155)) -
>>> opened part-0.out
>>> 2007-01-11 14:03:32,675 INFO mapred.LocalJobRunner
>>> (LocalJobRunner.java:progress(169)) -
>>> C:/tmp/hadoop-tbenke/mapred/temp/inject-temp-2045807797/part-00000:0+82
>>> 2007-01-11 14:03:32,690 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:32,722 INFO plugin.PluginRepository
>>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>>> mode: [true]
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>>> Plugins (creativecommons)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>>> (query-site)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>>> Plug-in (protocol-httpclient)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>>> (parse-html)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>>> (parse-pdf)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>>> (parse-msexcel)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>>> (parse-js)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>>> (query-url)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>>> (parse-swf)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in
>>> (ontology)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>>> (protocol-ftp)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>>> (analysis-fr)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>>> (parse-mp3)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>>> (parse-zip)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>>> (urlfilter-suffix)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>>> Parser/Indexer/Querier (microformats-reltag)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>>> (parse-rtf)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Language Identification
>>> Parser/Filter (language-identifier)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>>> (parse-msword)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>>> (parse-text)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>>> (analysis-de)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>>> (urlnormalizer-regex)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>>> Parse Plug-in (parse-oo)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>>> (urlfilter-automaton)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>>> Summary Plug-in (summary-lucene)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>>> and query filter (subcollection)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> Framework (lib-regex-filter)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>>> (lib-lucene-analyzers)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>>> (index-basic)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>>> Plug-in (summary-basic)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> (urlfilter-regex)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - HTTP Framework
>>> (lib-http)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>>> (parse-ext)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>>> (protocol-http)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>>> (index-more)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>>> (query-more)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>>> (lib-nekohtml)
>>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>>> (urlfilter-prefix)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>>> Plug-in (parse-mspowerpoint)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>>> (urlnormalizer-basic)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>>> Normalizer (urlnormalizer-pass)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>>> Client (lib-commons-httpclient)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>>> (protocol-file)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>>> To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>>> (query-basic)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>>> Framework (lib-parsems)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>>> (parse-rss)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>>> (scoring-opic)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(320)) - Registered
>>> Extension-Points:
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>>> (org.apache.nutch.searcher.Summarizer)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>>> (org.apache.nutch.net.URLNormalizer)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>>> (org.apache.nutch.net.URLFilter)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>>> (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>>> (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>>> (org.apache.nutch.parse.Parser)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>>> (org.apache.nutch.ontology.Ontology)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>>> (org.apache.nutch.searcher.QueryFilter)
>>> 2007-01-11 14:03:33,143 WARN util.NativeCodeLoader
>>> (NativeCodeLoader.java:<clinit>(50)) - Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 2007-01-11 14:03:33,175 INFO mapred.LocalJobRunner
>>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>>> 2007-01-11 14:03:33,628 INFO mapred.JobClient
>>> (JobClient.java:runJob(401)) - Job complete: job_xiod9g
>>> 2007-01-11 14:03:33,659 INFO crawl.Injector (Injector.java:inject(173))
>>> - Injector: done
>>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>>> (Generator.java:generate(371)) - Generator: Selecting best-scoring urls
>>> due for fetch.
>>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>>> (Generator.java:generate(372)) - Generator: starting
>>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>>> (Generator.java:generate(373)) - Generator: segment:
>>> crawl/segments/20070111140334
>>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>>> (Generator.java:generate(374)) - Generator: filtering: false
>>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>>> (Generator.java:generate(376)) - Generator: topN: 50
>>> 2007-01-11 14:03:34,659 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:34,659 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,675 INFO crawl.Generator
>>> (Generator.java:generate(388)) - Generator: jobtracker is 'local',
>>> generating exactly one partition.
>>> 2007-01-11 14:03:34,706 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:34,722 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:34,722 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>>> 2007-01-11 14:03:34,753 INFO mapred.JobClient
>>> (JobClient.java:runJob(370)) - Running job: job_m7h3ig
>>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:34,768 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,768 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>>> 2007-01-11 14:03:34,784 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:34,784 INFO mapred.MapTask (MapTask.java:run(155)) -
>>> opened part-0.out
>>> 2007-01-11 14:03:34,784 INFO plugin.PluginRepository
>>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>>> mode: [true]
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>>> Plugins (creativecommons)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>>> (query-site)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>>> Plug-in (protocol-httpclient)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>>> (parse-html)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>>> (parse-pdf)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>>> (parse-msexcel)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>>> (parse-js)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>>> (query-url)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>>> (parse-swf)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in
>>> (ontology)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>>> (protocol-ftp)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>>> (analysis-fr)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>>> (parse-mp3)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>>> (parse-zip)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>>> (urlfilter-suffix)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>>> Parser/Indexer/Querier (microformats-reltag)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>>> (parse-rtf)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Language Identification
>>> Parser/Filter (language-identifier)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>>> (parse-msword)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>>> (parse-text)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>>> (analysis-de)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>>> (urlnormalizer-regex)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>>> Parse Plug-in (parse-oo)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>>> (urlfilter-automaton)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>>> Summary Plug-in (summary-lucene)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>>> and query filter (subcollection)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> Framework (lib-regex-filter)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>>> (lib-lucene-analyzers)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>>> (index-basic)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>>> Plug-in (summary-basic)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> (urlfilter-regex)
>>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - HTTP Framework
>>> (lib-http)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>>> (parse-ext)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>>> (protocol-http)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>>> (index-more)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>>> (query-more)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>>> (lib-nekohtml)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>>> (urlfilter-prefix)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>>> Plug-in (parse-mspowerpoint)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>>> (urlnormalizer-basic)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>>> Normalizer (urlnormalizer-pass)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>>> Client (lib-commons-httpclient)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>>> (protocol-file)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>>> To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>>> (query-basic)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>>> Framework (lib-parsems)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>>> (parse-rss)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>>> (scoring-opic)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(320)) - Registered
>>> Extension-Points:
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>>> (org.apache.nutch.searcher.Summarizer)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>>> (org.apache.nutch.net.URLNormalizer)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>>> (org.apache.nutch.net.URLFilter)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>>> (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>>> (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>>> (org.apache.nutch.parse.Parser)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>>> (org.apache.nutch.ontology.Ontology)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>>> (org.apache.nutch.searcher.QueryFilter)
>>> 2007-01-11 14:03:35,018 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> suffix-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>>> 2007-01-11 14:03:35,018 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> automaton-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>>> 2007-01-11 14:03:35,128 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> crawl-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>>> 2007-01-11 14:03:35,128 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>>> not found
>>> 2007-01-11 14:03:35,143 INFO mapred.LocalJobRunner
>>> (LocalJobRunner.java:progress(169)) -
>>> C:/wkspc/nutch_trunk/crawl/crawldb/current/part-00000/data:0+125
>>> 2007-01-11 14:03:35,159 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,190 INFO plugin.PluginRepository
>>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>>> mode: [true]
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>>> Plugins (creativecommons)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>>> (query-site)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>>> Plug-in (protocol-httpclient)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>>> (parse-html)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>>> (parse-pdf)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>>> (parse-msexcel)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>>> (parse-js)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>>> (query-url)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>>> (parse-swf)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in
>>> (ontology)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>>> (protocol-ftp)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>>> (analysis-fr)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>>> (parse-mp3)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>>> (parse-zip)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>>> (urlfilter-suffix)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>>> Parser/Indexer/Querier (microformats-reltag)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>>> (parse-rtf)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Language Identification
>>> Parser/Filter (language-identifier)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>>> (parse-msword)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>>> (parse-text)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>>> (analysis-de)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>>> (urlnormalizer-regex)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>>> Parse Plug-in (parse-oo)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>>> (urlfilter-automaton)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>>> Summary Plug-in (summary-lucene)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>>> and query filter (subcollection)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> Framework (lib-regex-filter)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>>> (lib-lucene-analyzers)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>>> (index-basic)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>>> Plug-in (summary-basic)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>>> (urlfilter-regex)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - HTTP Framework
>>> (lib-http)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>>> (parse-ext)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>>> (protocol-http)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - the nutch core
>>> extension points (nutch-extensionpoints)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>>> (index-more)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>>> (query-more)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>>> (lib-nekohtml)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>>> (urlfilter-prefix)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>>> Plug-in (parse-mspowerpoint)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>>> (urlnormalizer-basic)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>>> Normalizer (urlnormalizer-pass)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>>> Client (lib-commons-httpclient)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>>> (protocol-file)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>>> To Access Microsoft Format Files (lib-jakarta-poi)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>>> (query-basic)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>>> Framework (lib-parsems)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>>> (parse-rss)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>>> (scoring-opic)
>>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(320)) - Registered
>>> Extension-Points:
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>>> (org.apache.nutch.searcher.Summarizer)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>>> (org.apache.nutch.scoring.ScoringFilter)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>>> (org.apache.nutch.protocol.Protocol)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>>> (org.apache.nutch.net.URLNormalizer)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>>> (org.apache.nutch.net.URLFilter)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>>> (org.apache.nutch.parse.HtmlParseFilter)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>>> (org.apache.nutch.indexer.IndexingFilter)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>>> (org.apache.nutch.parse.Parser)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>>> (org.apache.nutch.ontology.Ontology)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>>> (org.apache.nutch.analysis.NutchAnalyzer)
>>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>>> (org.apache.nutch.searcher.QueryFilter)
>>> 2007-01-11 14:03:35,409 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> suffix-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>>> 2007-01-11 14:03:35,409 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> automaton-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>>> 2007-01-11 14:03:35,519 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>>> crawl-urlfilter.txt at
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>>> 2007-01-11 14:03:35,519 INFO conf.Configuration
>>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>>> not found
>>> 2007-01-11 14:03:35,706 INFO mapred.LocalJobRunner
>>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>>> 2007-01-11 14:03:35,753 INFO mapred.JobClient
>>> (JobClient.java:runJob(401)) - Job complete: job_m7h3ig
>>> 2007-01-11 14:03:35,753 WARN crawl.Generator
>>> (Generator.java:generate(419)) - Generator: 0 records selected for
>>> fetching, exiting ...
>>> 2007-01-11 14:03:35,753 INFO crawl.Crawl (Crawl.java:main(121)) -
>>> Stopping at depth=0 - no more URLs to fetch.
>>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(219)) -
>>> LinkDb: starting
>>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(220)) -
>>> LinkDb: linkdb: crawl/linkdb
>>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(221)) -
>>> LinkDb: URL normalize: true
>>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(222)) -
>>> LinkDb: URL filter: true
>>> 2007-01-11 14:03:35,769 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:35,769 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,800 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:35,800 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,831 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>>> 2007-01-11 14:03:35,831 INFO conf.Configuration
>>> (Configuration.java:loadResource(495)) - parsing
>>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>>> 2007-01-11 14:03:35,847 INFO conf.Configuration
>>> (Configuration.java:loadResource(504)) - parsing
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xml
>>> 2007-01-11 14:03:35,847 INFO mapred.JobClient
>>> (JobClient.java:runJob(370)) - Running job: job_kumfin
>>> 2007-01-11 14:03:35,847 WARN mapred.LocalJobRunner
>>> (LocalJobRunner.java:run(147)) - job_kumfin
>>> java.io.IOException: No input directories specified in: Configuration:
>>> defaults: hadoop-default.xml , mapred-default.xml ,
>>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
>>> hadoop-site.xml
>>> at
>>> org.apache.hadoop.mapred.InputFormatBase.listPaths(InputFormatBase.java:99)
>>>
>>> at
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(SequenceFileInputFormat.java:39)
>>>
>>> at
>>> org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:119)
>>>
>>> at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:93)
>>> Exception in thread "main" java.io.IOException: Job failed!
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>>>
>>>
>>
>>
>
Re: nutch in eclipse, No input directories specified
Posted by Tim Benke <ze...@fusemail.com>.
Guys please help me with this, I tried to get it running for more than a
week and I don't have a clue what to try else...
> On Thu, 2007-01-11 at 15:16 +0100, Tim Benke wrote:
>
>> Hi,
>>
>> thanks to these guides, I was able to get nutch into eclipse;
>> http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug+nutch
>> http://wiki.apache.org/nutch/RunNutchInEclipse
>>
>> I get the exception:
>> java.io.IOException: No input directories specified in: Configuration:
>> defaults: hadoop-default.xml , mapred-default.xml ,
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
>> hadoop-site.xml
>>
>>
>
Thorsten Scherler wrote:
> Hmm, not sure but above sounds that you have not
> "add the folder "conf" to the classpath (scroll down the list and
> right-click on "conf". This step is necessary)"
I tried that, the same exception is thrown, but some of the INFO-Log
messages are omitted.
I suspect the problem has to do with reading or evaluating the
urls-file. Everything works fine with the same url-file on the commandline;
file urls/nutch:
http://lucene.apache.org/nutch/
in Eclipse: urls/nutch contains the url
arguments in eclipse:
to the program:
urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070111170258
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
commandline:
$ ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: crawl/segments/20070111165009
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070111165009
Fetcher: threads: 10
fetching http://lucene.apache.org/nutch/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segment: crawl/segments/20070111165009
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
...
>> arguments in eclipse:
>> to the program:
>> urls -dir crawl -depth 3 -topN 50
>>
>> to the vm:
>> -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
>>
>> environment variables NUTCH_JAVA_HOME, JAVA_HOME are set.
>> file urls/nutch:
>> http://lucene.apache.org/nutch/
>>
>> I really hope someone can help me with this, I need nutch for my
>> bachelor thesis.
>>
>> regards,
>>
>> Tim Benke
>>
>> the complete log is:
>>
>> 2007-01-11 14:03:29,831 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:29,940 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,003 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,018 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,018 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(89)) - crawl
>> started in: crawl
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(90)) -
>> rootUrlDir = urls
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(91)) -
>> threads = 10
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(92)) - depth = 3
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(94)) - topN = 50
>> 2007-01-11 14:03:30,097 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,112 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,128 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(135))
>> - Injector: starting
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(136))
>> - Injector: crawlDb: crawl/crawldb
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(137))
>> - Injector: urlDir: urls
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(147))
>> - Injector: Converting injected urls to crawl db entries.
>> 2007-01-11 14:03:30,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,190 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,206 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,206 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,425 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,425 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,440 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,440 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,456 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,456 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,472 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,487 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,503 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>> 2007-01-11 14:03:30,518 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_qo4f9q
>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>> 2007-01-11 14:03:30,565 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,643 INFO mapred.MapTask (MapTask.java:run(155)) -
>> opened part-0.out
>> 2007-01-11 14:03:30,675 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:31,065 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> suffix-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>> 2007-01-11 14:03:31,065 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> automaton-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>> 2007-01-11 14:03:31,456 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> crawl-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>> 2007-01-11 14:03:31,472 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>> not found
>> 2007-01-11 14:03:31,487 WARN regex.RegexURLNormalizer
>> (RegexURLNormalizer.java:regexNormalize(159)) - can't find rules for
>> scope 'inject', using default
>> 2007-01-11 14:03:31,487 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - C:/wkspc/nutch_trunk/urls/nutch:0+33
>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>> 2007-01-11 14:03:31,518 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:31,534 INFO mapred.JobClient
>> (JobClient.java:runJob(385)) - map 100% reduce 0%
>> 2007-01-11 14:03:31,753 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>> 2007-01-11 14:03:32,534 INFO mapred.JobClient
>> (JobClient.java:runJob(401)) - Job complete: job_qo4f9q
>> 2007-01-11 14:03:32,534 INFO crawl.Injector (Injector.java:inject(163))
>> - Injector: Merging injected urls into crawl db.
>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:32,550 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,550 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,581 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,612 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,612 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>> 2007-01-11 14:03:32,628 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_xiod9g
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,675 INFO mapred.MapTask (MapTask.java:run(155)) -
>> opened part-0.out
>> 2007-01-11 14:03:32,675 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) -
>> C:/tmp/hadoop-tbenke/mapred/temp/inject-temp-2045807797/part-00000:0+82
>> 2007-01-11 14:03:32,690 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,722 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:33,143 WARN util.NativeCodeLoader
>> (NativeCodeLoader.java:<clinit>(50)) - Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 2007-01-11 14:03:33,175 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>> 2007-01-11 14:03:33,628 INFO mapred.JobClient
>> (JobClient.java:runJob(401)) - Job complete: job_xiod9g
>> 2007-01-11 14:03:33,659 INFO crawl.Injector (Injector.java:inject(173))
>> - Injector: done
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(371)) - Generator: Selecting best-scoring urls
>> due for fetch.
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(372)) - Generator: starting
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(373)) - Generator: segment:
>> crawl/segments/20070111140334
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(374)) - Generator: filtering: false
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(376)) - Generator: topN: 50
>> 2007-01-11 14:03:34,659 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,659 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,675 INFO crawl.Generator
>> (Generator.java:generate(388)) - Generator: jobtracker is 'local',
>> generating exactly one partition.
>> 2007-01-11 14:03:34,706 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,722 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:34,722 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>> 2007-01-11 14:03:34,753 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_m7h3ig
>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,768 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,768 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>> 2007-01-11 14:03:34,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,784 INFO mapred.MapTask (MapTask.java:run(155)) -
>> opened part-0.out
>> 2007-01-11 14:03:34,784 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:35,018 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> suffix-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>> 2007-01-11 14:03:35,018 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> automaton-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>> 2007-01-11 14:03:35,128 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> crawl-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>> 2007-01-11 14:03:35,128 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>> not found
>> 2007-01-11 14:03:35,143 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) -
>> C:/wkspc/nutch_trunk/crawl/crawldb/current/part-00000/data:0+125
>> 2007-01-11 14:03:35,159 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,190 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:35,409 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> suffix-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>> 2007-01-11 14:03:35,409 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> automaton-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>> 2007-01-11 14:03:35,519 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> crawl-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>> 2007-01-11 14:03:35,519 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>> not found
>> 2007-01-11 14:03:35,706 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>> 2007-01-11 14:03:35,753 INFO mapred.JobClient
>> (JobClient.java:runJob(401)) - Job complete: job_m7h3ig
>> 2007-01-11 14:03:35,753 WARN crawl.Generator
>> (Generator.java:generate(419)) - Generator: 0 records selected for
>> fetching, exiting ...
>> 2007-01-11 14:03:35,753 INFO crawl.Crawl (Crawl.java:main(121)) -
>> Stopping at depth=0 - no more URLs to fetch.
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(219)) -
>> LinkDb: starting
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(220)) -
>> LinkDb: linkdb: crawl/linkdb
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(221)) -
>> LinkDb: URL normalize: true
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(222)) -
>> LinkDb: URL filter: true
>> 2007-01-11 14:03:35,769 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,769 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,800 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,800 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,831 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,831 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,847 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xml
>> 2007-01-11 14:03:35,847 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_kumfin
>> 2007-01-11 14:03:35,847 WARN mapred.LocalJobRunner
>> (LocalJobRunner.java:run(147)) - job_kumfin
>> java.io.IOException: No input directories specified in: Configuration:
>> defaults: hadoop-default.xml , mapred-default.xml ,
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
>> hadoop-site.xml
>> at
>> org.apache.hadoop.mapred.InputFormatBase.listPaths(InputFormatBase.java:99)
>> at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(SequenceFileInputFormat.java:39)
>> at
>> org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:119)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:93)
>> Exception in thread "main" java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>>
>>
>
>
Re: nutch in eclipse, No input directories specified
Posted by Tim Benke <ze...@fusemail.com>.
Thorsten Scherler wrote:
> On Thu, 2007-01-11 at 15:16 +0100, Tim Benke wrote:
>
>> Hi,
>>
>> thanks to these guides, I was able to get nutch into eclipse;
>> http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug+nutch
>> http://wiki.apache.org/nutch/RunNutchInEclipse
>>
>> I get the exception:
>> java.io.IOException: No input directories specified in: Configuration:
>> defaults: hadoop-default.xml , mapred-default.xml ,
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
>> hadoop-site.xml
>>
>>
>
> Hmm, not sure but above sounds that you have not
> "add the folder "conf" to the classpath (scroll down the list and
> right-click on "conf". This step is necessary)"
>
> HTH
> salu2
>
>
I tried that, the same exception is thrown, but some of the INFO-Log.
messages are omitted.
I suspect the problem has to do with the urls-file, because everything
works
fine with the same url-file on the commandline;
in Eclipse: urls/nutch contains the url
arguments in eclipse:
to the program:
urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070111170258
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
commandline:
$ ./bin/nutch crawl urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: crawl/segments/20070111165009
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070111165009
Fetcher: threads: 10
fetching http://lucene.apache.org/nutch/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segment: crawl/segments/20070111165009
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
...
>> arguments in eclipse:
>> to the program:
>> urls -dir crawl -depth 3 -topN 50
>>
>> to the vm:
>> -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
>>
>> environment variables NUTCH_JAVA_HOME, JAVA_HOME are set.
>> file urls/nutch:
>> http://lucene.apache.org/nutch/
>>
>> I really hope someone can help me with this, I need nutch for my
>> bachelor thesis.
>>
>> regards,
>>
>> Tim Benke
>>
>> the complete log is:
>>
>> 2007-01-11 14:03:29,831 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:29,940 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,003 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,018 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,018 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(89)) - crawl
>> started in: crawl
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(90)) -
>> rootUrlDir = urls
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(91)) -
>> threads = 10
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(92)) - depth = 3
>> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(94)) - topN = 50
>> 2007-01-11 14:03:30,097 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,112 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,128 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(135))
>> - Injector: starting
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(136))
>> - Injector: crawlDb: crawl/crawldb
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(137))
>> - Injector: urlDir: urls
>> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(147))
>> - Injector: Converting injected urls to crawl db entries.
>> 2007-01-11 14:03:30,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,190 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,206 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,206 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,425 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,425 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:30,440 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:30,440 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,456 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,456 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,472 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,487 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,503 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>> 2007-01-11 14:03:30,518 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_qo4f9q
>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,534 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>> 2007-01-11 14:03:30,565 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:30,643 INFO mapred.MapTask (MapTask.java:run(155)) -
>> opened part-0.out
>> 2007-01-11 14:03:30,675 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:31,065 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> suffix-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>> 2007-01-11 14:03:31,065 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> automaton-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>> 2007-01-11 14:03:31,456 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> crawl-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>> 2007-01-11 14:03:31,472 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>> not found
>> 2007-01-11 14:03:31,487 WARN regex.RegexURLNormalizer
>> (RegexURLNormalizer.java:regexNormalize(159)) - can't find rules for
>> scope 'inject', using default
>> 2007-01-11 14:03:31,487 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - C:/wkspc/nutch_trunk/urls/nutch:0+33
>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:31,503 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
>> 2007-01-11 14:03:31,518 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:31,534 INFO mapred.JobClient
>> (JobClient.java:runJob(385)) - map 100% reduce 0%
>> 2007-01-11 14:03:31,753 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>> 2007-01-11 14:03:32,534 INFO mapred.JobClient
>> (JobClient.java:runJob(401)) - Job complete: job_qo4f9q
>> 2007-01-11 14:03:32,534 INFO crawl.Injector (Injector.java:inject(163))
>> - Injector: Merging injected urls into crawl db.
>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:32,534 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:32,550 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,550 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,581 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:32,597 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,612 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,612 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,628 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>> 2007-01-11 14:03:32,628 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_xiod9g
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>> 2007-01-11 14:03:32,643 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,675 INFO mapred.MapTask (MapTask.java:run(155)) -
>> opened part-0.out
>> 2007-01-11 14:03:32,675 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) -
>> C:/tmp/hadoop-tbenke/mapred/temp/inject-temp-2045807797/part-00000:0+82
>> 2007-01-11 14:03:32,690 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
>> 2007-01-11 14:03:32,706 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:32,722 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:33,143 WARN util.NativeCodeLoader
>> (NativeCodeLoader.java:<clinit>(50)) - Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 2007-01-11 14:03:33,175 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>> 2007-01-11 14:03:33,628 INFO mapred.JobClient
>> (JobClient.java:runJob(401)) - Job complete: job_xiod9g
>> 2007-01-11 14:03:33,659 INFO crawl.Injector (Injector.java:inject(173))
>> - Injector: done
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(371)) - Generator: Selecting best-scoring urls
>> due for fetch.
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(372)) - Generator: starting
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(373)) - Generator: segment:
>> crawl/segments/20070111140334
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(374)) - Generator: filtering: false
>> 2007-01-11 14:03:34,659 INFO crawl.Generator
>> (Generator.java:generate(376)) - Generator: topN: 50
>> 2007-01-11 14:03:34,659 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,659 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,675 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,675 INFO crawl.Generator
>> (Generator.java:generate(388)) - Generator: jobtracker is 'local',
>> generating exactly one partition.
>> 2007-01-11 14:03:34,706 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,722 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:34,722 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,737 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>> 2007-01-11 14:03:34,753 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_m7h3ig
>> 2007-01-11 14:03:34,753 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:34,768 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,768 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>> 2007-01-11 14:03:34,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:34,784 INFO mapred.MapTask (MapTask.java:run(155)) -
>> opened part-0.out
>> 2007-01-11 14:03:34,784 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:35,018 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> suffix-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>> 2007-01-11 14:03:35,018 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> automaton-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>> 2007-01-11 14:03:35,128 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> crawl-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>> 2007-01-11 14:03:35,128 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>> not found
>> 2007-01-11 14:03:35,143 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) -
>> C:/wkspc/nutch_trunk/crawl/crawldb/current/part-00000/data:0+125
>> 2007-01-11 14:03:35,159 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
>> 2007-01-11 14:03:35,175 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,190 INFO plugin.PluginRepository
>> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
>> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
>> mode: [true]
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Creative Commons
>> Plugins (creativecommons)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Site Query Filter
>> (query-site)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
>> Plug-in (protocol-httpclient)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
>> (parse-html)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
>> (parse-pdf)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
>> (parse-msexcel)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
>> (parse-js)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - URL Query Filter
>> (query-url)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
>> (parse-swf)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
>> (protocol-ftp)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
>> (analysis-fr)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
>> (parse-mp3)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
>> (parse-zip)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Online Search Results
>> Clustering using Carrot2's Lingo component (clustering-carrot2)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
>> (urlfilter-suffix)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
>> Parser/Indexer/Querier (microformats-reltag)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
>> (parse-rtf)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Language Identification
>> Parser/Filter (language-identifier)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
>> (parse-msword)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
>> (parse-text)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
>> (analysis-de)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
>> (urlnormalizer-regex)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
>> Parse Plug-in (parse-oo)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
>> (urlfilter-automaton)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
>> Summary Plug-in (summary-lucene)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
>> and query filter (subcollection)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> Framework (lib-regex-filter)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
>> (lib-lucene-analyzers)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
>> (index-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
>> Plug-in (summary-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
>> (urlfilter-regex)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
>> (parse-ext)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
>> (protocol-http)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - the nutch core
>> extension points (nutch-extensionpoints)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
>> (index-more)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - More Query Filter
>> (query-more)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
>> (lib-nekohtml)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
>> (urlfilter-prefix)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
>> Plug-in (parse-mspowerpoint)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
>> (urlnormalizer-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Pass-through URL
>> Normalizer (urlnormalizer-pass)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
>> Client (lib-commons-httpclient)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
>> (protocol-file)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
>> To Access Microsoft Format Files (lib-jakarta-poi)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
>> (query-basic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
>> Framework (lib-parsems)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
>> (parse-rss)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
>> (scoring-opic)
>> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
>> (org.apache.nutch.searcher.Summarizer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
>> (org.apache.nutch.scoring.ScoringFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
>> (org.apache.nutch.protocol.Protocol)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
>> (org.apache.nutch.net.URLNormalizer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
>> (org.apache.nutch.net.URLFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
>> (org.apache.nutch.parse.HtmlParseFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
>> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
>> (org.apache.nutch.indexer.IndexingFilter)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
>> (org.apache.nutch.parse.Parser)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
>> (org.apache.nutch.ontology.Ontology)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
>> (org.apache.nutch.analysis.NutchAnalyzer)
>> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
>> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
>> (org.apache.nutch.searcher.QueryFilter)
>> 2007-01-11 14:03:35,409 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> suffix-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
>> 2007-01-11 14:03:35,409 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> automaton-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
>> 2007-01-11 14:03:35,519 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(441)) - found resource
>> crawl-urlfilter.txt at
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
>> 2007-01-11 14:03:35,519 INFO conf.Configuration
>> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
>> not found
>> 2007-01-11 14:03:35,706 INFO mapred.LocalJobRunner
>> (LocalJobRunner.java:progress(169)) - reduce > reduce
>> 2007-01-11 14:03:35,753 INFO mapred.JobClient
>> (JobClient.java:runJob(401)) - Job complete: job_m7h3ig
>> 2007-01-11 14:03:35,753 WARN crawl.Generator
>> (Generator.java:generate(419)) - Generator: 0 records selected for
>> fetching, exiting ...
>> 2007-01-11 14:03:35,753 INFO crawl.Crawl (Crawl.java:main(121)) -
>> Stopping at depth=0 - no more URLs to fetch.
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(219)) -
>> LinkDb: starting
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(220)) -
>> LinkDb: linkdb: crawl/linkdb
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(221)) -
>> LinkDb: URL normalize: true
>> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(222)) -
>> LinkDb: URL filter: true
>> 2007-01-11 14:03:35,769 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,769 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,784 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,800 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,800 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,815 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,831 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
>> 2007-01-11 14:03:35,831 INFO conf.Configuration
>> (Configuration.java:loadResource(495)) - parsing
>> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
>> 2007-01-11 14:03:35,847 INFO conf.Configuration
>> (Configuration.java:loadResource(504)) - parsing
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xml
>> 2007-01-11 14:03:35,847 INFO mapred.JobClient
>> (JobClient.java:runJob(370)) - Running job: job_kumfin
>> 2007-01-11 14:03:35,847 WARN mapred.LocalJobRunner
>> (LocalJobRunner.java:run(147)) - job_kumfin
>> java.io.IOException: No input directories specified in: Configuration:
>> defaults: hadoop-default.xml , mapred-default.xml ,
>> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
>> hadoop-site.xml
>> at
>> org.apache.hadoop.mapred.InputFormatBase.listPaths(InputFormatBase.java:99)
>> at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(SequenceFileInputFormat.java:39)
>> at
>> org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:119)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:93)
>> Exception in thread "main" java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>>
>>
>
>
Re: nutch in eclipse, No input directories specified
Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2007-01-11 at 15:16 +0100, Tim Benke wrote:
> Hi,
>
> thanks to these guides, I was able to get nutch into eclipse;
> http://wiki.media-style.com/display/nutchDocu/use+eclipse+to+debug+nutch
> http://wiki.apache.org/nutch/RunNutchInEclipse
>
> I get the exception:
> java.io.IOException: No input directories specified in: Configuration:
> defaults: hadoop-default.xml , mapred-default.xml ,
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
> hadoop-site.xml
>
Hmm, not sure but above sounds that you have not
"add the folder "conf" to the classpath (scroll down the list and
right-click on "conf". This step is necessary)"
HTH
salu2
> arguments in eclipse:
> to the program:
> urls -dir crawl -depth 3 -topN 50
>
> to the vm:
> -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
>
> environment variables NUTCH_JAVA_HOME, JAVA_HOME are set.
> file urls/nutch:
> http://lucene.apache.org/nutch/
>
> I really hope someone can help me with this, I need nutch for my
> bachelor thesis.
>
> regards,
>
> Tim Benke
>
> the complete log is:
>
> 2007-01-11 14:03:29,831 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:29,940 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:30,003 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:30,018 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,018 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(89)) - crawl
> started in: crawl
> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(90)) -
> rootUrlDir = urls
> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(91)) -
> threads = 10
> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(92)) - depth = 3
> 2007-01-11 14:03:30,034 INFO crawl.Crawl (Crawl.java:main(94)) - topN = 50
> 2007-01-11 14:03:30,097 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:30,112 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:30,128 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(135))
> - Injector: starting
> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(136))
> - Injector: crawlDb: crawl/crawldb
> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(137))
> - Injector: urlDir: urls
> 2007-01-11 14:03:30,159 INFO crawl.Injector (Injector.java:inject(147))
> - Injector: Converting injected urls to crawl db entries.
> 2007-01-11 14:03:30,175 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:30,175 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:30,190 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:30,206 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,206 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,425 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:30,425 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:30,440 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:30,440 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,456 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,456 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,472 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:30,487 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,503 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
> 2007-01-11 14:03:30,518 INFO mapred.JobClient
> (JobClient.java:runJob(370)) - Running job: job_qo4f9q
> 2007-01-11 14:03:30,534 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:30,534 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,534 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
> 2007-01-11 14:03:30,565 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:30,643 INFO mapred.MapTask (MapTask.java:run(155)) -
> opened part-0.out
> 2007-01-11 14:03:30,675 INFO plugin.PluginRepository
> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
> mode: [true]
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Creative Commons
> Plugins (creativecommons)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Site Query Filter
> (query-site)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
> Plug-in (protocol-httpclient)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
> (parse-html)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
> (parse-pdf)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
> (parse-msexcel)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
> (parse-js)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - URL Query Filter
> (query-url)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
> (parse-swf)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
> (protocol-ftp)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
> (analysis-fr)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
> (parse-mp3)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
> (parse-zip)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Online Search Results
> Clustering using Carrot2's Lingo component (clustering-carrot2)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
> (urlfilter-suffix)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
> Parser/Indexer/Querier (microformats-reltag)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
> (parse-rtf)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Language Identification
> Parser/Filter (language-identifier)
> 2007-01-11 14:03:30,987 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
> (parse-msword)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
> (parse-text)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
> (analysis-de)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
> (urlnormalizer-regex)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
> Parse Plug-in (parse-oo)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
> (urlfilter-automaton)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
> Summary Plug-in (summary-lucene)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
> and query filter (subcollection)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> Framework (lib-regex-filter)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
> (lib-lucene-analyzers)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
> (index-basic)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
> Plug-in (summary-basic)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> (urlfilter-regex)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
> (parse-ext)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
> (protocol-http)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - the nutch core
> extension points (nutch-extensionpoints)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
> (index-more)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Query Filter
> (query-more)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
> (lib-nekohtml)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
> (urlfilter-prefix)
> 2007-01-11 14:03:31,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
> Plug-in (parse-mspowerpoint)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
> (urlnormalizer-basic)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pass-through URL
> Normalizer (urlnormalizer-pass)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
> Client (lib-commons-httpclient)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
> (protocol-file)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
> To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
> (query-basic)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
> Framework (lib-parsems)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
> (parse-rss)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
> (scoring-opic)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
> (org.apache.nutch.ontology.Ontology)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-01-11 14:03:31,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
> (org.apache.nutch.searcher.QueryFilter)
> 2007-01-11 14:03:31,065 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> suffix-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
> 2007-01-11 14:03:31,065 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> automaton-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
> 2007-01-11 14:03:31,456 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> crawl-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
> 2007-01-11 14:03:31,472 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
> not found
> 2007-01-11 14:03:31,487 WARN regex.RegexURLNormalizer
> (RegexURLNormalizer.java:regexNormalize(159)) - can't find rules for
> scope 'inject', using default
> 2007-01-11 14:03:31,487 INFO mapred.LocalJobRunner
> (LocalJobRunner.java:progress(169)) - C:/wkspc/nutch_trunk/urls/nutch:0+33
> 2007-01-11 14:03:31,503 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:31,503 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:31,503 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_qo4f9q.xml
> 2007-01-11 14:03:31,518 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:31,534 INFO mapred.JobClient
> (JobClient.java:runJob(385)) - map 100% reduce 0%
> 2007-01-11 14:03:31,753 INFO mapred.LocalJobRunner
> (LocalJobRunner.java:progress(169)) - reduce > reduce
> 2007-01-11 14:03:32,534 INFO mapred.JobClient
> (JobClient.java:runJob(401)) - Job complete: job_qo4f9q
> 2007-01-11 14:03:32,534 INFO crawl.Injector (Injector.java:inject(163))
> - Injector: Merging injected urls into crawl db.
> 2007-01-11 14:03:32,534 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:32,534 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:32,534 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:32,550 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,550 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,581 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:32,597 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:32,597 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:32,597 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,612 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,612 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,628 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:32,628 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,628 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
> 2007-01-11 14:03:32,628 INFO mapred.JobClient
> (JobClient.java:runJob(370)) - Running job: job_xiod9g
> 2007-01-11 14:03:32,643 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:32,643 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,643 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
> 2007-01-11 14:03:32,643 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,675 INFO mapred.MapTask (MapTask.java:run(155)) -
> opened part-0.out
> 2007-01-11 14:03:32,675 INFO mapred.LocalJobRunner
> (LocalJobRunner.java:progress(169)) -
> C:/tmp/hadoop-tbenke/mapred/temp/inject-temp-2045807797/part-00000:0+82
> 2007-01-11 14:03:32,690 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:32,706 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,706 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_xiod9g.xml
> 2007-01-11 14:03:32,706 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:32,722 INFO plugin.PluginRepository
> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
> mode: [true]
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Creative Commons
> Plugins (creativecommons)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Site Query Filter
> (query-site)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
> Plug-in (protocol-httpclient)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
> (parse-html)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
> (parse-pdf)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
> (parse-msexcel)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
> (parse-js)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - URL Query Filter
> (query-url)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
> (parse-swf)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
> (protocol-ftp)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
> (analysis-fr)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
> (parse-mp3)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
> (parse-zip)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Online Search Results
> Clustering using Carrot2's Lingo component (clustering-carrot2)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
> (urlfilter-suffix)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
> Parser/Indexer/Querier (microformats-reltag)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
> (parse-rtf)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Language Identification
> Parser/Filter (language-identifier)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
> (parse-msword)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
> (parse-text)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
> (analysis-de)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
> (urlnormalizer-regex)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
> Parse Plug-in (parse-oo)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
> (urlfilter-automaton)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
> Summary Plug-in (summary-lucene)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
> and query filter (subcollection)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> Framework (lib-regex-filter)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
> (lib-lucene-analyzers)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
> (index-basic)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
> Plug-in (summary-basic)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> (urlfilter-regex)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
> (parse-ext)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
> (protocol-http)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - the nutch core
> extension points (nutch-extensionpoints)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
> (index-more)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Query Filter
> (query-more)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
> (lib-nekohtml)
> 2007-01-11 14:03:33,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
> (urlfilter-prefix)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
> Plug-in (parse-mspowerpoint)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
> (urlnormalizer-basic)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pass-through URL
> Normalizer (urlnormalizer-pass)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
> Client (lib-commons-httpclient)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
> (protocol-file)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
> To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
> (query-basic)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
> Framework (lib-parsems)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
> (parse-rss)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
> (scoring-opic)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
> (org.apache.nutch.ontology.Ontology)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-01-11 14:03:33,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
> (org.apache.nutch.searcher.QueryFilter)
> 2007-01-11 14:03:33,143 WARN util.NativeCodeLoader
> (NativeCodeLoader.java:<clinit>(50)) - Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 2007-01-11 14:03:33,175 INFO mapred.LocalJobRunner
> (LocalJobRunner.java:progress(169)) - reduce > reduce
> 2007-01-11 14:03:33,628 INFO mapred.JobClient
> (JobClient.java:runJob(401)) - Job complete: job_xiod9g
> 2007-01-11 14:03:33,659 INFO crawl.Injector (Injector.java:inject(173))
> - Injector: done
> 2007-01-11 14:03:34,659 INFO crawl.Generator
> (Generator.java:generate(371)) - Generator: Selecting best-scoring urls
> due for fetch.
> 2007-01-11 14:03:34,659 INFO crawl.Generator
> (Generator.java:generate(372)) - Generator: starting
> 2007-01-11 14:03:34,659 INFO crawl.Generator
> (Generator.java:generate(373)) - Generator: segment:
> crawl/segments/20070111140334
> 2007-01-11 14:03:34,659 INFO crawl.Generator
> (Generator.java:generate(374)) - Generator: filtering: false
> 2007-01-11 14:03:34,659 INFO crawl.Generator
> (Generator.java:generate(376)) - Generator: topN: 50
> 2007-01-11 14:03:34,659 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:34,659 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:34,675 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:34,675 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,675 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,675 INFO crawl.Generator
> (Generator.java:generate(388)) - Generator: jobtracker is 'local',
> generating exactly one partition.
> 2007-01-11 14:03:34,706 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:34,722 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:34,722 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:34,737 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,737 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,737 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,737 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:34,753 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,753 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
> 2007-01-11 14:03:34,753 INFO mapred.JobClient
> (JobClient.java:runJob(370)) - Running job: job_m7h3ig
> 2007-01-11 14:03:34,753 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:34,768 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,768 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
> 2007-01-11 14:03:34,784 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:34,784 INFO mapred.MapTask (MapTask.java:run(155)) -
> opened part-0.out
> 2007-01-11 14:03:34,784 INFO plugin.PluginRepository
> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
> mode: [true]
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Creative Commons
> Plugins (creativecommons)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Site Query Filter
> (query-site)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
> Plug-in (protocol-httpclient)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
> (parse-html)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
> (parse-pdf)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
> (parse-msexcel)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
> (parse-js)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - URL Query Filter
> (query-url)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
> (parse-swf)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
> (protocol-ftp)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
> (analysis-fr)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
> (parse-mp3)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
> (parse-zip)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Online Search Results
> Clustering using Carrot2's Lingo component (clustering-carrot2)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
> (urlfilter-suffix)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
> Parser/Indexer/Querier (microformats-reltag)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
> (parse-rtf)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Language Identification
> Parser/Filter (language-identifier)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
> (parse-msword)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
> (parse-text)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
> (analysis-de)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
> (urlnormalizer-regex)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
> Parse Plug-in (parse-oo)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
> (urlfilter-automaton)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
> Summary Plug-in (summary-lucene)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
> and query filter (subcollection)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> Framework (lib-regex-filter)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
> (lib-lucene-analyzers)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
> (index-basic)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
> Plug-in (summary-basic)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> (urlfilter-regex)
> 2007-01-11 14:03:35,003 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
> (parse-ext)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
> (protocol-http)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - the nutch core
> extension points (nutch-extensionpoints)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
> (index-more)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Query Filter
> (query-more)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
> (lib-nekohtml)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
> (urlfilter-prefix)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
> Plug-in (parse-mspowerpoint)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
> (urlnormalizer-basic)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pass-through URL
> Normalizer (urlnormalizer-pass)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
> Client (lib-commons-httpclient)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
> (protocol-file)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
> To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
> (query-basic)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
> Framework (lib-parsems)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
> (parse-rss)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
> (scoring-opic)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
> (org.apache.nutch.ontology.Ontology)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-01-11 14:03:35,018 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
> (org.apache.nutch.searcher.QueryFilter)
> 2007-01-11 14:03:35,018 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> suffix-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
> 2007-01-11 14:03:35,018 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> automaton-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
> 2007-01-11 14:03:35,128 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> crawl-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
> 2007-01-11 14:03:35,128 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
> not found
> 2007-01-11 14:03:35,143 INFO mapred.LocalJobRunner
> (LocalJobRunner.java:progress(169)) -
> C:/wkspc/nutch_trunk/crawl/crawldb/current/part-00000/data:0+125
> 2007-01-11 14:03:35,159 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:35,175 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,175 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_m7h3ig.xml
> 2007-01-11 14:03:35,175 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,190 INFO plugin.PluginRepository
> (PluginManifestParser.java:parsePluginFolder(86)) - Plugins: looking in:
> C:\wkspc\nutch_trunk\tmpBuild\src\plugin
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(309)) - Plugin Auto-activation
> mode: [true]
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(310)) - Registered Plugins:
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Creative Commons
> Plugins (creativecommons)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Site Query Filter
> (query-site)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http / Https Protocol
> Plug-in (protocol-httpclient)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Html Parse Plug-in
> (parse-html)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pdf Parse Plug-in
> (parse-pdf)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSExcel Parse Plug-in
> (parse-msexcel)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - JavaScript Parser
> (parse-js)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - URL Query Filter
> (query-url)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - SWF Parse Plug-in
> (parse-swf)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Log4j (lib-log4j)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ontology Plug-in (ontology)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Ftp Protocol Plug-in
> (protocol-ftp)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - French Analysis Plug-in
> (analysis-fr)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MP3 Parse Plug-in
> (parse-mp3)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Zip Parse Plug-in
> (parse-zip)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Online Search Results
> Clustering using Carrot2's Lingo component (clustering-carrot2)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Suffix URL Filter
> (urlfilter-suffix)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Rel-Tag microformat
> Parser/Indexer/Querier (microformats-reltag)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RTF Parse Plug-in
> (parse-rtf)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Language Identification
> Parser/Filter (language-identifier)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSWord Parse Plug-in
> (parse-msword)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Text Parse Plug-in
> (parse-text)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - German Analysis Plug-in
> (analysis-de)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Normalizer
> (urlnormalizer-regex)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OpenOffice/OpenDocument
> Parse Plug-in (parse-oo)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Automaton URL Filter
> (urlfilter-automaton)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Highlighter
> Summary Plug-in (summary-lucene)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Subcollection indexing
> and query filter (subcollection)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> Framework (lib-regex-filter)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Lucene Analysers
> (lib-lucene-analyzers)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Indexing Filter
> (index-basic)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Summarizer
> Plug-in (summary-basic)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Regex URL Filter
> (urlfilter-regex)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - HTTP Framework (lib-http)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - External Parser Plug-in
> (parse-ext)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Http Protocol Plug-in
> (protocol-http)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - the nutch core
> extension points (nutch-extensionpoints)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Indexing Filter
> (index-more)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - More Query Filter
> (query-more)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - CyberNeko HTML Parser
> (lib-nekohtml)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Prefix URL Filter
> (urlfilter-prefix)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - MSPowerPoint Parse
> Plug-in (parse-mspowerpoint)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic URL Normalizer
> (urlnormalizer-basic)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Pass-through URL
> Normalizer (urlnormalizer-pass)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta Commons HTTP
> Client (lib-commons-httpclient)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - File Protocol Plug-in
> (protocol-file)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Jakarta POI - Java API
> To Access Microsoft Format Files (lib-jakarta-poi)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Basic Query Filter
> (query-basic)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - XML Libraries (lib-xml)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - Parse MS Documents
> Framework (lib-parsems)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - RSS Parse Plug-in
> (parse-rss)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(316)) - OPIC Scoring Plug-in
> (scoring-opic)
> 2007-01-11 14:03:35,394 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(320)) - Registered Extension-Points:
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Summarizer
> (org.apache.nutch.searcher.Summarizer)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Online Search
> Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Ontology Model Loader
> (org.apache.nutch.ontology.Ontology)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Analysis
> (org.apache.nutch.analysis.NutchAnalyzer)
> 2007-01-11 14:03:35,409 INFO plugin.PluginRepository
> (PluginRepository.java:displayStatus(325)) - Nutch Query Filter
> (org.apache.nutch.searcher.QueryFilter)
> 2007-01-11 14:03:35,409 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> suffix-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/suffix-urlfilter.txt
> 2007-01-11 14:03:35,409 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> automaton-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/automaton-urlfilter.txt
> 2007-01-11 14:03:35,519 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(441)) - found resource
> crawl-urlfilter.txt at
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-urlfilter.txt
> 2007-01-11 14:03:35,519 INFO conf.Configuration
> (Configuration.java:getConfResourceAsReader(438)) - prefix-urlfilter.txt
> not found
> 2007-01-11 14:03:35,706 INFO mapred.LocalJobRunner
> (LocalJobRunner.java:progress(169)) - reduce > reduce
> 2007-01-11 14:03:35,753 INFO mapred.JobClient
> (JobClient.java:runJob(401)) - Job complete: job_m7h3ig
> 2007-01-11 14:03:35,753 WARN crawl.Generator
> (Generator.java:generate(419)) - Generator: 0 records selected for
> fetching, exiting ...
> 2007-01-11 14:03:35,753 INFO crawl.Crawl (Crawl.java:main(121)) -
> Stopping at depth=0 - no more URLs to fetch.
> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(219)) -
> LinkDb: starting
> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(220)) -
> LinkDb: linkdb: crawl/linkdb
> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(221)) -
> LinkDb: URL normalize: true
> 2007-01-11 14:03:35,769 INFO crawl.LinkDb (LinkDb.java:invert(222)) -
> LinkDb: URL filter: true
> 2007-01-11 14:03:35,769 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:35,769 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:35,784 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:35,784 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,784 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,800 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:35,800 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/nutch-default.xml
> 2007-01-11 14:03:35,815 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> file:/C:/wkspc/nutch_trunk/tmpBuild/crawl-tool.xml
> 2007-01-11 14:03:35,815 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,815 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,815 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,831 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/hadoop-default.xml
> 2007-01-11 14:03:35,831 INFO conf.Configuration
> (Configuration.java:loadResource(495)) - parsing
> jar:file:/C:/wkspc/nutch_trunk/lib/hadoop-0.9.1.jar!/mapred-default.xml
> 2007-01-11 14:03:35,847 INFO conf.Configuration
> (Configuration.java:loadResource(504)) - parsing
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xml
> 2007-01-11 14:03:35,847 INFO mapred.JobClient
> (JobClient.java:runJob(370)) - Running job: job_kumfin
> 2007-01-11 14:03:35,847 WARN mapred.LocalJobRunner
> (LocalJobRunner.java:run(147)) - job_kumfin
> java.io.IOException: No input directories specified in: Configuration:
> defaults: hadoop-default.xml , mapred-default.xml ,
> /tmp/hadoop-tbenke/mapred/local/localRunner/job_kumfin.xmlfinal:
> hadoop-site.xml
> at
> org.apache.hadoop.mapred.InputFormatBase.listPaths(InputFormatBase.java:99)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(SequenceFileInputFormat.java:39)
> at
> org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:119)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:93)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:232)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:209)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
>