You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by jibjoice <su...@hotmail.com> on 2008/01/02 11:08:09 UTC
Re: Nutch crawl problem
i crawl "http://lucene.apache.org" and in conf/crawl-urlfilter.txt i set that
"+^http://([a-z0-9]*\.)*apache.org/" when i use command "bin/nutch crawl
urls -dir crawled -depth 3" have error that
- crawl started in: crawled
- rootUrlDir = urls
- threads = 10
- depth = 3
- Injector: starting
- Injector: crawlDb: crawled/crawldb
- Injector: urlDir: urls
- Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path doesnt exist : /user/nutch/urls
at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
-bash-3.1$ bin/nutch crawl inputs -dir crawled -depth 3
- crawl started in: crawled
- rootUrlDir = inputs
- threads = 10
- depth = 3
- Injector: starting
- Injector: crawlDb: crawled/crawldb
- Injector: urlDir: inputs
- Injector: Converting injected urls to crawl db entries.
- Total input paths to process : 1
- Running job: job_0001
- map 0% reduce 0%
- map 100% reduce 0%
- map 100% reduce 16%
- map 100% reduce 58%
- map 100% reduce 100%
- Job complete: job_0001
- Counters: 6
- Map-Reduce Framework
- Map input records=3
- Map output records=1
- Map input bytes=25
- Map output bytes=55
- Reduce input records=1
- Reduce output records=1
- Injector: Merging injected urls into crawl db.
- Total input paths to process : 2
- Running job: job_0002
- map 0% reduce 0%
- Task Id : task_0002_m_000000_0, Status : FAILED
task_0002_m_000000_0: - Plugins: looking in: /nutch/search/build/plugins
task_0002_m_000000_0: - Plugin Auto-activation mode: [true]
task_0002_m_000000_0: - Registered Plugins:
task_0002_m_000000_0: - the nutch core extension points
(nutch-extensionpoints)
task_0002_m_000000_0: - Basic Query Filter (query-basic)
task_0002_m_000000_0: - Basic URL Normalizer (urlnormalizer-basic)
task_0002_m_000000_0: - Basic Indexing Filter (index-basic)
task_0002_m_000000_0: - Html Parse Plug-in (parse-html)
task_0002_m_000000_0: - Basic Summarizer Plug-in (summary-basic)
task_0002_m_000000_0: - Site Query Filter (query-site)
task_0002_m_000000_0: - HTTP Framework (lib-http)
task_0002_m_000000_0: - Text Parse Plug-in (parse-text)
task_0002_m_000000_0: - Regex URL Filter (urlfilter-regex)
task_0002_m_000000_0: - Pass-through URL Normalizer
(urlnormalizer-pass)
task_0002_m_000000_0: - Http Protocol Plug-in (protocol-http)
task_0002_m_000000_0: - Regex URL Normalizer (urlnormalizer-regex)
task_0002_m_000000_0: - OPIC Scoring Plug-in (scoring-opic)
task_0002_m_000000_0: - CyberNeko HTML Parser (lib-nekohtml)
task_0002_m_000000_0: - JavaScript Parser (parse-js)
task_0002_m_000000_0: - URL Query Filter (query-url)
task_0002_m_000000_0: - Regex URL Filter Framework
(lib-regex-filter)
task_0002_m_000000_0: - Registered Extension-Points:
task_0002_m_000000_0: - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
task_0002_m_000000_0: - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
task_0002_m_000000_0: - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
task_0002_m_000000_0: - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
task_0002_m_000000_0: - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
task_0002_m_000000_0: - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
task_0002_m_000000_0: - Nutch Online Search Results Clustering
Plugin (org.apache.nutch.clustering.OnlineClusterer)
task_0002_m_000000_0: - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
task_0002_m_000000_0: - Nutch Content Parser
(org.apache.nutch.parse.Parser)
task_0002_m_000000_0: - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
task_0002_m_000000_0: - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
task_0002_m_000000_0: - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
task_0002_m_000000_0: - found resource crawl-urlfilter.txt at
file:/nutch/search/conf/crawl-urlfilter.txt
- map 50% reduce 0%
- map 100% reduce 0%
- map 100% reduce 8%
- map 100% reduce 25%
- map 100% reduce 58%
- map 100% reduce 100%
- Job complete: job_0002
- Counters: 6
- Map-Reduce Framework
- Map input records=3
- Map output records=1
- Map input bytes=63
- Map output bytes=55
- Reduce input records=1
- Reduce output records=1
- Injector: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: crawled/segments/25510102165746
- Generator: filtering: false
- Generator: topN: 2147483647
- Total input paths to process : 2
- Running job: job_0003
- map 0% reduce 0%
- map 50% reduce 0%
- map 100% reduce 0%
- map 100% reduce 8%
- map 100% reduce 16%
- map 100% reduce 58%
- map 100% reduce 100%
- Job complete: job_0003
- Counters: 6
- Map-Reduce Framework
- Map input records=3
- Map output records=1
- Map input bytes=62
- Map output bytes=80
- Reduce input records=1
- Reduce output records=1
- Generator: Partitioning selected urls by host, for politeness.
- Total input paths to process : 2
- Running job: job_0004
- map 0% reduce 0%
- map 50% reduce 0%
- map 100% reduce 0%
- Task Id : task_0004_r_000000_0, Status : FAILED
- Task Id : task_0004_r_000001_0, Status : FAILED
- map 100% reduce 8%
- map 100% reduce 0%
- Task Id : task_0004_r_000000_1, Status : FAILED
- Task Id : task_0004_r_000001_1, Status : FAILED
- map 100% reduce 8%
- map 100% reduce 0%
- Task Id : task_0004_r_000000_2, Status : FAILED
now i use hadoop-0.12.2, nutch-0.9 and java jdk1.6.0. Why? i can't solve it
1 month ago.
--
View this message in context: http://www.nabble.com/Nutch-crawl-problem-tp14327978p14575918.html
Sent from the Hadoop Users mailing list archive at Nabble.com.