You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by jibjoice <su...@hotmail.com> on 2008/01/02 11:08:09 UTC

Re: Nutch crawl problem

i crawl "http://lucene.apache.org" and in conf/crawl-urlfilter.txt i set that
"+^http://([a-z0-9]*\.)*apache.org/" when i use command "bin/nutch crawl
urls -dir crawled -depth 3" have error that

- crawl started in: crawled
- rootUrlDir = urls
- threads = 10
- depth = 3
- Injector: starting
- Injector: crawlDb: crawled/crawldb
- Injector: urlDir: urls
- Injector: Converting injected urls to crawl db entries.
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path doesnt exist : /user/nutch/urls
        at
org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
-bash-3.1$ bin/nutch crawl inputs -dir crawled -depth 3
- crawl started in: crawled
- rootUrlDir = inputs
- threads = 10
- depth = 3
- Injector: starting
- Injector: crawlDb: crawled/crawldb
- Injector: urlDir: inputs
- Injector: Converting injected urls to crawl db entries.
- Total input paths to process : 1
- Running job: job_0001
-  map 0% reduce 0%
-  map 100% reduce 0%
-  map 100% reduce 16%
-  map 100% reduce 58%
-  map 100% reduce 100%
- Job complete: job_0001
- Counters: 6
-   Map-Reduce Framework
-     Map input records=3
-     Map output records=1
-     Map input bytes=25
-     Map output bytes=55
-     Reduce input records=1
-     Reduce output records=1
- Injector: Merging injected urls into crawl db.
- Total input paths to process : 2
- Running job: job_0002
-  map 0% reduce 0%
- Task Id : task_0002_m_000000_0, Status : FAILED
task_0002_m_000000_0: - Plugins: looking in: /nutch/search/build/plugins
task_0002_m_000000_0: - Plugin Auto-activation mode: [true]
task_0002_m_000000_0: - Registered Plugins:
task_0002_m_000000_0: -         the nutch core extension points
(nutch-extensionpoints)
task_0002_m_000000_0: -         Basic Query Filter (query-basic)
task_0002_m_000000_0: -         Basic URL Normalizer (urlnormalizer-basic)
task_0002_m_000000_0: -         Basic Indexing Filter (index-basic)
task_0002_m_000000_0: -         Html Parse Plug-in (parse-html)
task_0002_m_000000_0: -         Basic Summarizer Plug-in (summary-basic)
task_0002_m_000000_0: -         Site Query Filter (query-site)
task_0002_m_000000_0: -         HTTP Framework (lib-http)
task_0002_m_000000_0: -         Text Parse Plug-in (parse-text)
task_0002_m_000000_0: -         Regex URL Filter (urlfilter-regex)
task_0002_m_000000_0: -         Pass-through URL Normalizer
(urlnormalizer-pass)
task_0002_m_000000_0: -         Http Protocol Plug-in (protocol-http)
task_0002_m_000000_0: -         Regex URL Normalizer (urlnormalizer-regex)
task_0002_m_000000_0: -         OPIC Scoring Plug-in (scoring-opic)
task_0002_m_000000_0: -         CyberNeko HTML Parser (lib-nekohtml)
task_0002_m_000000_0: -         JavaScript Parser (parse-js)
task_0002_m_000000_0: -         URL Query Filter (query-url)
task_0002_m_000000_0: -         Regex URL Filter Framework
(lib-regex-filter)
task_0002_m_000000_0: - Registered Extension-Points:
task_0002_m_000000_0: -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
task_0002_m_000000_0: -         Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
task_0002_m_000000_0: -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
task_0002_m_000000_0: -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
task_0002_m_000000_0: -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
task_0002_m_000000_0: -         Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
task_0002_m_000000_0: -         Nutch Online Search Results Clustering
Plugin (org.apache.nutch.clustering.OnlineClusterer)
task_0002_m_000000_0: -         HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
task_0002_m_000000_0: -         Nutch Content Parser
(org.apache.nutch.parse.Parser)
task_0002_m_000000_0: -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
task_0002_m_000000_0: -         Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
task_0002_m_000000_0: -         Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
task_0002_m_000000_0: - found resource crawl-urlfilter.txt at
file:/nutch/search/conf/crawl-urlfilter.txt
-  map 50% reduce 0%
-  map 100% reduce 0%
-  map 100% reduce 8%
-  map 100% reduce 25%
-  map 100% reduce 58%
-  map 100% reduce 100%
- Job complete: job_0002
- Counters: 6
-   Map-Reduce Framework
-     Map input records=3
-     Map output records=1
-     Map input bytes=63
-     Map output bytes=55
-     Reduce input records=1
-     Reduce output records=1
- Injector: done
- Generator: Selecting best-scoring urls due for fetch.
- Generator: starting
- Generator: segment: crawled/segments/25510102165746
- Generator: filtering: false
- Generator: topN: 2147483647
- Total input paths to process : 2
- Running job: job_0003
-  map 0% reduce 0%
-  map 50% reduce 0%
-  map 100% reduce 0%
-  map 100% reduce 8%
-  map 100% reduce 16%
-  map 100% reduce 58%
-  map 100% reduce 100%
- Job complete: job_0003
- Counters: 6
-   Map-Reduce Framework
-     Map input records=3
-     Map output records=1
-     Map input bytes=62
-     Map output bytes=80
-     Reduce input records=1
-     Reduce output records=1
- Generator: Partitioning selected urls by host, for politeness.
- Total input paths to process : 2
- Running job: job_0004
-  map 0% reduce 0%
-  map 50% reduce 0%
-  map 100% reduce 0%
- Task Id : task_0004_r_000000_0, Status : FAILED
- Task Id : task_0004_r_000001_0, Status : FAILED
-  map 100% reduce 8%
-  map 100% reduce 0%
- Task Id : task_0004_r_000000_1, Status : FAILED
- Task Id : task_0004_r_000001_1, Status : FAILED
-  map 100% reduce 8%
-  map 100% reduce 0%
- Task Id : task_0004_r_000000_2, Status : FAILED

now i use hadoop-0.12.2, nutch-0.9 and java jdk1.6.0. Why? i can't solve it
1 month ago.
-- 
View this message in context: http://www.nabble.com/Nutch-crawl-problem-tp14327978p14575918.html
Sent from the Hadoop Users mailing list archive at Nabble.com.