You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manikandan Saravanan <ma...@thesocialpeople.net> on 2014/05/28 13:22:36 UTC

Nutch not generating any URLs

Hi, I’m running Nutch 2 on a 2-node Hadoop cluster to do whole web crawling. I’m seeding about 700 URLs from the DMOZ directory. About the same number is being injected. The problem is that nothing is being generated after the inject phase. Subsequently nothing is being indexed either.

The trace of the entire crawl job is here:

14/05/28 06:54:23 INFO crawl.InjectorJob: InjectorJob: starting at 2014-05-28 06:54:23
14/05/28 06:54:23 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
14/05/28 06:54:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
14/05/28 06:54:25 INFO input.FileInputFormat: Total input paths to process : 1
14/05/28 06:54:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/28 06:54:25 WARN snappy.LoadSnappy: Snappy native library not loaded
14/05/28 06:54:25 INFO mapred.JobClient: Running job: job_201405280024_0015
14/05/28 06:54:26 INFO mapred.JobClient:  map 0% reduce 0%
14/05/28 06:54:36 INFO mapred.JobClient:  map 100% reduce 0%
14/05/28 06:54:40 INFO mapred.JobClient: Job complete: job_201405280024_0015
14/05/28 06:54:40 INFO mapred.JobClient: Counters: 20
14/05/28 06:54:40 INFO mapred.JobClient:   Job Counters 
14/05/28 06:54:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=10927
14/05/28 06:54:40 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/28 06:54:40 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/28 06:54:40 INFO mapred.JobClient:     Launched map tasks=1
14/05/28 06:54:40 INFO mapred.JobClient:     Data-local map tasks=1
14/05/28 06:54:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/05/28 06:54:40 INFO mapred.JobClient:   File Output Format Counters 
14/05/28 06:54:40 INFO mapred.JobClient:     Bytes Written=0
14/05/28 06:54:40 INFO mapred.JobClient:   injector
14/05/28 06:54:40 INFO mapred.JobClient:     urls_injected=765
14/05/28 06:54:40 INFO mapred.JobClient:     urls_filtered=14
14/05/28 06:54:40 INFO mapred.JobClient:   FileSystemCounters
14/05/28 06:54:40 INFO mapred.JobClient:     HDFS_BYTES_READ=26006
14/05/28 06:54:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=77762
14/05/28 06:54:40 INFO mapred.JobClient:   File Input Format Counters 
14/05/28 06:54:40 INFO mapred.JobClient:     Bytes Read=25896
14/05/28 06:54:40 INFO mapred.JobClient:   Map-Reduce Framework
14/05/28 06:54:40 INFO mapred.JobClient:     Map input records=779
14/05/28 06:54:40 INFO mapred.JobClient:     Physical memory (bytes) snapshot=113258496
14/05/28 06:54:40 INFO mapred.JobClient:     Spilled Records=0
14/05/28 06:54:40 INFO mapred.JobClient:     CPU time spent (ms)=2530
14/05/28 06:54:40 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
14/05/28 06:54:40 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1118162944
14/05/28 06:54:40 INFO mapred.JobClient:     Map output records=765
14/05/28 06:54:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
14/05/28 06:54:40 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 14
14/05/28 06:54:40 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 765
14/05/28 06:54:40 INFO crawl.InjectorJob: Injector: finished at 2014-05-28 06:54:40, elapsed: 00:00:16
Wed May 28 06:54:40 EDT 2014 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
Warning: $HADOOP_HOME is deprecated.

14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: starting at 2014-05-28 06:54:42
14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch.
14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: starting
14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: filtering: false
14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false
14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000
14/05/28 06:54:42 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
14/05/28 06:54:42 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
14/05/28 06:54:42 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
14/05/28 06:54:44 INFO mapred.JobClient: Running job: job_201405280024_0016
14/05/28 06:54:45 INFO mapred.JobClient:  map 0% reduce 0%
14/05/28 06:54:55 INFO mapred.JobClient:  map 100% reduce 0%
14/05/28 06:55:03 INFO mapred.JobClient:  map 100% reduce 16%
14/05/28 06:55:04 INFO mapred.JobClient:  map 100% reduce 50%
14/05/28 07:02:29 INFO mapred.JobClient: Task Id : attempt_201405280024_0016_r_000001_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
14/05/28 07:02:39 INFO mapred.JobClient:  map 100% reduce 66%
14/05/28 07:02:40 INFO mapred.JobClient:  map 100% reduce 100%
14/05/28 07:02:43 INFO mapred.JobClient: Job complete: job_201405280024_0016
14/05/28 07:02:43 INFO mapred.JobClient: Counters: 27
14/05/28 07:02:43 INFO mapred.JobClient:   Job Counters 
14/05/28 07:02:43 INFO mapred.JobClient:     Launched reduce tasks=3
14/05/28 07:02:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11387
14/05/28 07:02:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/28 07:02:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/28 07:02:43 INFO mapred.JobClient:     Launched map tasks=1
14/05/28 07:02:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=23048
14/05/28 07:02:43 INFO mapred.JobClient:   File Output Format Counters 
14/05/28 07:02:43 INFO mapred.JobClient:     Bytes Written=0
14/05/28 07:02:43 INFO mapred.JobClient:   FileSystemCounters
14/05/28 07:02:43 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/05/28 07:02:43 INFO mapred.JobClient:     HDFS_BYTES_READ=833
14/05/28 07:02:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=239555
14/05/28 07:02:43 INFO mapred.JobClient:   File Input Format Counters 
14/05/28 07:02:43 INFO mapred.JobClient:     Bytes Read=0
14/05/28 07:02:43 INFO mapred.JobClient:   Map-Reduce Framework
14/05/28 07:02:43 INFO mapred.JobClient:     Map output materialized bytes=28
14/05/28 07:02:43 INFO mapred.JobClient:     Map input records=0
14/05/28 07:02:43 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/05/28 07:02:43 INFO mapred.JobClient:     Spilled Records=0
14/05/28 07:02:43 INFO mapred.JobClient:     Map output bytes=0
14/05/28 07:02:43 INFO mapred.JobClient:     Total committed heap usage (bytes)=277872640
14/05/28 07:02:43 INFO mapred.JobClient:     CPU time spent (ms)=4130
14/05/28 07:02:43 INFO mapred.JobClient:     Combine input records=0
14/05/28 07:02:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=833
14/05/28 07:02:43 INFO mapred.JobClient:     Reduce input records=0
14/05/28 07:02:43 INFO mapred.JobClient:     Reduce input groups=0
14/05/28 07:02:43 INFO mapred.JobClient:     Combine output records=0
14/05/28 07:02:43 INFO mapred.JobClient:     Physical memory (bytes) snapshot=422510592
14/05/28 07:02:43 INFO mapred.JobClient:     Reduce output records=0
14/05/28 07:02:43 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5982715904
14/05/28 07:02:43 INFO mapred.JobClient:     Map output records=0
14/05/28 07:02:43 INFO crawl.GeneratorJob: GeneratorJob: finished at 2014-05-28 07:02:43, time elapsed: 00:08:00
14/05/28 07:02:43 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1401274480-22738
Fetching : 
Warning: $HADOOP_HOME is deprecated.

14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: starting
14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: batchId: 1401274480-22738
14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: threads: 50
14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: parsing: false
14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: resuming: false
14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob : timelimit set for : 1401285765716
14/05/28 07:02:46 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar110933996696870181/classes/plugins
14/05/28 07:02:46 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/05/28 07:02:46 INFO plugin.PluginRepository: Registered Plugins:
14/05/28 07:02:46 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Http / Https Protocol Plug-in (protocol-httpclient)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Creative Commons Plugins (creativecommons)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	More Indexing Filter (index-more)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	JavaScript Parser (parse-js)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/05/28 07:02:46 INFO plugin.PluginRepository: Registered Extension-Points:
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/05/28 07:02:46 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/05/28 07:02:46 INFO httpclient.Http: http.proxy.host = null
14/05/28 07:02:46 INFO httpclient.Http: http.proxy.port = 8080
14/05/28 07:02:46 INFO httpclient.Http: http.timeout = 10000
14/05/28 07:02:46 INFO httpclient.Http: http.content.limit = 65536
14/05/28 07:02:46 INFO httpclient.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net)
14/05/28 07:02:46 INFO httpclient.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
14/05/28 07:02:46 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14/05/28 07:02:46 INFO conf.Configuration: found resource httpclient-auth.xml at file:/app/hadoop/tmp/hadoop-unjar110933996696870181/httpclient-auth.xml
14/05/28 07:02:46 INFO httpclient.Http: http.proxy.host = null
14/05/28 07:02:46 INFO httpclient.Http: http.proxy.port = 8080
14/05/28 07:02:46 INFO httpclient.Http: http.timeout = 10000
14/05/28 07:02:46 INFO httpclient.Http: http.content.limit = 65536
14/05/28 07:02:46 INFO httpclient.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net)
14/05/28 07:02:46 INFO httpclient.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
14/05/28 07:02:46 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14/05/28 07:02:49 INFO mapred.JobClient: Running job: job_201405280024_0017
14/05/28 07:02:50 INFO mapred.JobClient:  map 0% reduce 0%
14/05/28 07:03:01 INFO mapred.JobClient:  map 100% reduce 0%
14/05/28 07:03:10 INFO mapred.JobClient:  map 100% reduce 16%
14/05/28 07:03:13 INFO mapred.JobClient:  map 100% reduce 50%
14/05/28 07:10:34 INFO mapred.JobClient: Task Id : attempt_201405280024_0017_r_000001_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
14/05/28 07:10:44 INFO mapred.JobClient:  map 100% reduce 66%
14/05/28 07:10:47 INFO mapred.JobClient:  map 100% reduce 100%
14/05/28 07:10:54 INFO mapred.JobClient: Job complete: job_201405280024_0017
14/05/28 07:10:54 INFO mapred.JobClient: Counters: 28
14/05/28 07:10:54 INFO mapred.JobClient:   Job Counters 
14/05/28 07:10:54 INFO mapred.JobClient:     Launched reduce tasks=3
14/05/28 07:10:54 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11752
14/05/28 07:10:54 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/28 07:10:54 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/28 07:10:54 INFO mapred.JobClient:     Launched map tasks=1
14/05/28 07:10:54 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=33613
14/05/28 07:10:54 INFO mapred.JobClient:   File Output Format Counters 
14/05/28 07:10:54 INFO mapred.JobClient:     Bytes Written=0
14/05/28 07:10:54 INFO mapred.JobClient:   FileSystemCounters
14/05/28 07:10:54 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/05/28 07:10:54 INFO mapred.JobClient:     HDFS_BYTES_READ=817
14/05/28 07:10:54 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=238025
14/05/28 07:10:54 INFO mapred.JobClient:   File Input Format Counters 
14/05/28 07:10:54 INFO mapred.JobClient:     Bytes Read=0
14/05/28 07:10:54 INFO mapred.JobClient:   FetcherStatus
14/05/28 07:10:54 INFO mapred.JobClient:     HitByTimeLimit-QueueFeeder=0
14/05/28 07:10:54 INFO mapred.JobClient:   Map-Reduce Framework
14/05/28 07:10:54 INFO mapred.JobClient:     Map output materialized bytes=28
14/05/28 07:10:54 INFO mapred.JobClient:     Map input records=0
14/05/28 07:10:54 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/05/28 07:10:54 INFO mapred.JobClient:     Spilled Records=0
14/05/28 07:10:54 INFO mapred.JobClient:     Map output bytes=0
14/05/28 07:10:54 INFO mapred.JobClient:     Total committed heap usage (bytes)=317194240
14/05/28 07:10:54 INFO mapred.JobClient:     CPU time spent (ms)=6460
14/05/28 07:10:54 INFO mapred.JobClient:     Combine input records=0
14/05/28 07:10:54 INFO mapred.JobClient:     SPLIT_RAW_BYTES=817
14/05/28 07:10:54 INFO mapred.JobClient:     Reduce input records=0
14/05/28 07:10:54 INFO mapred.JobClient:     Reduce input groups=0
14/05/28 07:10:54 INFO mapred.JobClient:     Combine output records=0
14/05/28 07:10:54 INFO mapred.JobClient:     Physical memory (bytes) snapshot=444006400
14/05/28 07:10:54 INFO mapred.JobClient:     Reduce output records=0
14/05/28 07:10:54 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6052544512
14/05/28 07:10:54 INFO mapred.JobClient:     Map output records=0
14/05/28 07:10:54 INFO fetcher.FetcherJob: FetcherJob: done
Parsing : 
Warning: $HADOOP_HOME is deprecated.

14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: starting
14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: resuming:	false
14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: forced reparse:	false
14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: batchId:	1401274480-22738
14/05/28 07:10:57 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar1161270060222812225/classes/plugins
14/05/28 07:10:57 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/05/28 07:10:57 INFO plugin.PluginRepository: Registered Plugins:
14/05/28 07:10:57 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Http / Https Protocol Plug-in (protocol-httpclient)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Creative Commons Plugins (creativecommons)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	More Indexing Filter (index-more)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	JavaScript Parser (parse-js)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/05/28 07:10:57 INFO plugin.PluginRepository: Registered Extension-Points:
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/05/28 07:10:57 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/05/28 07:10:57 INFO conf.Configuration: found resource parse-plugins.xml at file:/app/hadoop/tmp/hadoop-unjar1161270060222812225/parse-plugins.xml
14/05/28 07:10:57 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature
14/05/28 07:10:59 INFO mapred.JobClient: Running job: job_201405280024_0018
14/05/28 07:11:00 INFO mapred.JobClient:  map 0% reduce 0%
14/05/28 07:11:07 INFO mapred.JobClient:  map 100% reduce 0%
14/05/28 07:11:09 INFO mapred.JobClient: Job complete: job_201405280024_0018
14/05/28 07:11:09 INFO mapred.JobClient: Counters: 17
14/05/28 07:11:09 INFO mapred.JobClient:   Job Counters 
14/05/28 07:11:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=7869
14/05/28 07:11:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/28 07:11:09 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/28 07:11:09 INFO mapred.JobClient:     Launched map tasks=1
14/05/28 07:11:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/05/28 07:11:09 INFO mapred.JobClient:   File Output Format Counters 
14/05/28 07:11:09 INFO mapred.JobClient:     Bytes Written=0
14/05/28 07:11:09 INFO mapred.JobClient:   FileSystemCounters
14/05/28 07:11:09 INFO mapred.JobClient:     HDFS_BYTES_READ=861
14/05/28 07:11:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78891
14/05/28 07:11:09 INFO mapred.JobClient:   File Input Format Counters 
14/05/28 07:11:09 INFO mapred.JobClient:     Bytes Read=0
14/05/28 07:11:09 INFO mapred.JobClient:   Map-Reduce Framework
14/05/28 07:11:09 INFO mapred.JobClient:     Map input records=0
14/05/28 07:11:09 INFO mapred.JobClient:     Physical memory (bytes) snapshot=114253824
14/05/28 07:11:09 INFO mapred.JobClient:     Spilled Records=0
14/05/28 07:11:09 INFO mapred.JobClient:     CPU time spent (ms)=1070
14/05/28 07:11:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
14/05/28 07:11:09 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1987776512
14/05/28 07:11:09 INFO mapred.JobClient:     Map output records=0
14/05/28 07:11:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=861
14/05/28 07:11:09 INFO parse.ParserJob: ParserJob: success
CrawlDB update for TestCrawl
Warning: $HADOOP_HOME is deprecated.

14/05/28 07:11:12 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
14/05/28 07:11:13 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar5400634919722418143/classes/plugins
14/05/28 07:11:13 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/05/28 07:11:13 INFO plugin.PluginRepository: Registered Plugins:
14/05/28 07:11:13 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Http / Https Protocol Plug-in (protocol-httpclient)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Creative Commons Plugins (creativecommons)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	More Indexing Filter (index-more)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	JavaScript Parser (parse-js)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/05/28 07:11:13 INFO plugin.PluginRepository: Registered Extension-Points:
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/05/28 07:11:13 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/05/28 07:11:16 INFO mapred.JobClient: Running job: job_201405280024_0019
14/05/28 07:11:17 INFO mapred.JobClient:  map 0% reduce 0%
14/05/28 07:11:28 INFO mapred.JobClient:  map 100% reduce 0%
14/05/28 07:11:38 INFO mapred.JobClient:  map 100% reduce 16%
14/05/28 07:11:39 INFO mapred.JobClient:  map 100% reduce 50%
14/05/28 07:19:00 INFO mapred.JobClient: Task Id : attempt_201405280024_0019_r_000001_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
14/05/28 07:19:00 WARN mapred.JobClient: Error reading task outputnutch-two-qontifi
14/05/28 07:19:00 WARN mapred.JobClient: Error reading task outputnutch-two-qontifi
14/05/28 07:19:11 INFO mapred.JobClient:  map 100% reduce 66%
14/05/28 07:19:12 INFO mapred.JobClient:  map 100% reduce 100%
14/05/28 07:19:13 INFO mapred.JobClient: Job complete: job_201405280024_0019
14/05/28 07:19:13 INFO mapred.JobClient: Counters: 27
14/05/28 07:19:13 INFO mapred.JobClient:   Job Counters 
14/05/28 07:19:13 INFO mapred.JobClient:     Launched reduce tasks=3
14/05/28 07:19:13 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=10614
14/05/28 07:19:13 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/28 07:19:13 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/28 07:19:13 INFO mapred.JobClient:     Launched map tasks=1
14/05/28 07:19:13 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=23263
14/05/28 07:19:13 INFO mapred.JobClient:   File Output Format Counters 
14/05/28 07:19:13 INFO mapred.JobClient:     Bytes Written=0
14/05/28 07:19:13 INFO mapred.JobClient:   FileSystemCounters
14/05/28 07:19:13 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/05/28 07:19:13 INFO mapred.JobClient:     HDFS_BYTES_READ=910
14/05/28 07:19:13 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=238016
14/05/28 07:19:13 INFO mapred.JobClient:   File Input Format Counters 
14/05/28 07:19:13 INFO mapred.JobClient:     Bytes Read=0
14/05/28 07:19:13 INFO mapred.JobClient:   Map-Reduce Framework
14/05/28 07:19:13 INFO mapred.JobClient:     Map output materialized bytes=28
14/05/28 07:19:13 INFO mapred.JobClient:     Map input records=0
14/05/28 07:19:13 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/05/28 07:19:13 INFO mapred.JobClient:     Spilled Records=0
14/05/28 07:19:13 INFO mapred.JobClient:     Map output bytes=0
14/05/28 07:19:13 INFO mapred.JobClient:     Total committed heap usage (bytes)=293601280
14/05/28 07:19:13 INFO mapred.JobClient:     CPU time spent (ms)=6540
14/05/28 07:19:13 INFO mapred.JobClient:     Combine input records=0
14/05/28 07:19:13 INFO mapred.JobClient:     SPLIT_RAW_BYTES=910
14/05/28 07:19:13 INFO mapred.JobClient:     Reduce input records=0
14/05/28 07:19:13 INFO mapred.JobClient:     Reduce input groups=0
14/05/28 07:19:13 INFO mapred.JobClient:     Combine output records=0
14/05/28 07:19:13 INFO mapred.JobClient:     Physical memory (bytes) snapshot=470159360
14/05/28 07:19:13 INFO mapred.JobClient:     Reduce output records=0
14/05/28 07:19:13 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5987823616
14/05/28 07:19:13 INFO mapred.JobClient:     Map output records=0
14/05/28 07:19:13 INFO crawl.DbUpdaterJob: DbUpdaterJob: done
Indexing TestCrawl on SOLR index -> http://128.199.207.54:8983/solr/nutch
Warning: $HADOOP_HOME is deprecated.

14/05/28 07:19:16 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
14/05/28 07:19:16 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar5241938989393377870/classes/plugins
14/05/28 07:19:16 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/05/28 07:19:16 INFO plugin.PluginRepository: Registered Plugins:
14/05/28 07:19:16 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Http / Https Protocol Plug-in (protocol-httpclient)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Creative Commons Plugins (creativecommons)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	More Indexing Filter (index-more)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	JavaScript Parser (parse-js)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/05/28 07:19:16 INFO plugin.PluginRepository: Registered Extension-Points:
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/05/28 07:19:16 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/05/28 07:19:16 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100
14/05/28 07:19:16 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
14/05/28 07:19:16 INFO indexer.IndexingFilters: Adding org.creativecommons.nutch.CCIndexingFilter
14/05/28 07:19:17 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.more.MoreIndexingFilter
14/05/28 07:19:21 INFO mapred.JobClient: Running job: job_201405280024_0020
14/05/28 07:19:22 INFO mapred.JobClient:  map 0% reduce 0%
14/05/28 07:19:31 INFO mapred.JobClient:  map 100% reduce 0%
14/05/28 07:19:33 INFO mapred.JobClient: Job complete: job_201405280024_0020
14/05/28 07:19:33 INFO mapred.JobClient: Counters: 17
14/05/28 07:19:33 INFO mapred.JobClient:   Job Counters 
14/05/28 07:19:33 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=9290
14/05/28 07:19:33 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/28 07:19:33 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/05/28 07:19:33 INFO mapred.JobClient:     Launched map tasks=1
14/05/28 07:19:33 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/05/28 07:19:33 INFO mapred.JobClient:   File Output Format Counters 
14/05/28 07:19:33 INFO mapred.JobClient:     Bytes Written=0
14/05/28 07:19:33 INFO mapred.JobClient:   FileSystemCounters
14/05/28 07:19:33 INFO mapred.JobClient:     HDFS_BYTES_READ=877
14/05/28 07:19:33 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=79006
14/05/28 07:19:33 INFO mapred.JobClient:   File Input Format Counters 
14/05/28 07:19:33 INFO mapred.JobClient:     Bytes Read=0
14/05/28 07:19:33 INFO mapred.JobClient:   Map-Reduce Framework
14/05/28 07:19:33 INFO mapred.JobClient:     Map input records=0
14/05/28 07:19:33 INFO mapred.JobClient:     Physical memory (bytes) snapshot=117587968
14/05/28 07:19:33 INFO mapred.JobClient:     Spilled Records=0
14/05/28 07:19:33 INFO mapred.JobClient:     CPU time spent (ms)=1040
14/05/28 07:19:33 INFO mapred.JobClient:     Total committed heap usage (bytes)=59768832
14/05/28 07:19:33 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1992785920
14/05/28 07:19:33 INFO mapred.JobClient:     Map output records=0
14/05/28 07:19:33 INFO mapred.JobClient:     SPLIT_RAW_BYTES=877
14/05/28 07:19:33 INFO solr.SolrIndexerJob: SolrIndexerJob: done.

 Am I missing anything?

-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

Re: Nutch not generating any URLs

Posted by Talat Uyarer <ta...@uyarer.com>.
Hi,

14/05/28 07:02:29 INFO mapred.JobClient: Task Id :
attempt_201405280024_0016_r_000001_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

This error show us you have firewall or DNS problem. You can look at
this link: http://stackoverflow.com/questions/10729543/shuffle-errorexceeded-max-failed-unique-matche-bailing-out

Talat

2014-05-28 14:22 GMT+03:00 Manikandan Saravanan
<ma...@thesocialpeople.net>:
> Hi, I’m running Nutch 2 on a 2-node Hadoop cluster to do whole web crawling. I’m seeding about 700 URLs from the DMOZ directory. About the same number is being injected. The problem is that nothing is being generated after the inject phase. Subsequently nothing is being indexed either.
>
> The trace of the entire crawl job is here:
>
> 14/05/28 06:54:23 INFO crawl.InjectorJob: InjectorJob: starting at 2014-05-28 06:54:23
> 14/05/28 06:54:23 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
> 14/05/28 06:54:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
> 14/05/28 06:54:25 INFO input.FileInputFormat: Total input paths to process : 1
> 14/05/28 06:54:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 14/05/28 06:54:25 WARN snappy.LoadSnappy: Snappy native library not loaded
> 14/05/28 06:54:25 INFO mapred.JobClient: Running job: job_201405280024_0015
> 14/05/28 06:54:26 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/28 06:54:36 INFO mapred.JobClient:  map 100% reduce 0%
> 14/05/28 06:54:40 INFO mapred.JobClient: Job complete: job_201405280024_0015
> 14/05/28 06:54:40 INFO mapred.JobClient: Counters: 20
> 14/05/28 06:54:40 INFO mapred.JobClient:   Job Counters
> 14/05/28 06:54:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=10927
> 14/05/28 06:54:40 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 14/05/28 06:54:40 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 14/05/28 06:54:40 INFO mapred.JobClient:     Launched map tasks=1
> 14/05/28 06:54:40 INFO mapred.JobClient:     Data-local map tasks=1
> 14/05/28 06:54:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/05/28 06:54:40 INFO mapred.JobClient:   File Output Format Counters
> 14/05/28 06:54:40 INFO mapred.JobClient:     Bytes Written=0
> 14/05/28 06:54:40 INFO mapred.JobClient:   injector
> 14/05/28 06:54:40 INFO mapred.JobClient:     urls_injected=765
> 14/05/28 06:54:40 INFO mapred.JobClient:     urls_filtered=14
> 14/05/28 06:54:40 INFO mapred.JobClient:   FileSystemCounters
> 14/05/28 06:54:40 INFO mapred.JobClient:     HDFS_BYTES_READ=26006
> 14/05/28 06:54:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=77762
> 14/05/28 06:54:40 INFO mapred.JobClient:   File Input Format Counters
> 14/05/28 06:54:40 INFO mapred.JobClient:     Bytes Read=25896
> 14/05/28 06:54:40 INFO mapred.JobClient:   Map-Reduce Framework
> 14/05/28 06:54:40 INFO mapred.JobClient:     Map input records=779
> 14/05/28 06:54:40 INFO mapred.JobClient:     Physical memory (bytes) snapshot=113258496
> 14/05/28 06:54:40 INFO mapred.JobClient:     Spilled Records=0
> 14/05/28 06:54:40 INFO mapred.JobClient:     CPU time spent (ms)=2530
> 14/05/28 06:54:40 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
> 14/05/28 06:54:40 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1118162944
> 14/05/28 06:54:40 INFO mapred.JobClient:     Map output records=765
> 14/05/28 06:54:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
> 14/05/28 06:54:40 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 14
> 14/05/28 06:54:40 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 765
> 14/05/28 06:54:40 INFO crawl.InjectorJob: Injector: finished at 2014-05-28 06:54:40, elapsed: 00:00:16
> Wed May 28 06:54:40 EDT 2014 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> Warning: $HADOOP_HOME is deprecated.
>
> 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: starting at 2014-05-28 06:54:42
> 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch.
> 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: starting
> 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: filtering: false
> 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false
> 14/05/28 06:54:42 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000
> 14/05/28 06:54:42 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 14/05/28 06:54:42 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 14/05/28 06:54:42 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 14/05/28 06:54:44 INFO mapred.JobClient: Running job: job_201405280024_0016
> 14/05/28 06:54:45 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/28 06:54:55 INFO mapred.JobClient:  map 100% reduce 0%
> 14/05/28 06:55:03 INFO mapred.JobClient:  map 100% reduce 16%
> 14/05/28 06:55:04 INFO mapred.JobClient:  map 100% reduce 50%
> 14/05/28 07:02:29 INFO mapred.JobClient: Task Id : attempt_201405280024_0016_r_000001_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 14/05/28 07:02:39 INFO mapred.JobClient:  map 100% reduce 66%
> 14/05/28 07:02:40 INFO mapred.JobClient:  map 100% reduce 100%
> 14/05/28 07:02:43 INFO mapred.JobClient: Job complete: job_201405280024_0016
> 14/05/28 07:02:43 INFO mapred.JobClient: Counters: 27
> 14/05/28 07:02:43 INFO mapred.JobClient:   Job Counters
> 14/05/28 07:02:43 INFO mapred.JobClient:     Launched reduce tasks=3
> 14/05/28 07:02:43 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11387
> 14/05/28 07:02:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Launched map tasks=1
> 14/05/28 07:02:43 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=23048
> 14/05/28 07:02:43 INFO mapred.JobClient:   File Output Format Counters
> 14/05/28 07:02:43 INFO mapred.JobClient:     Bytes Written=0
> 14/05/28 07:02:43 INFO mapred.JobClient:   FileSystemCounters
> 14/05/28 07:02:43 INFO mapred.JobClient:     FILE_BYTES_READ=44
> 14/05/28 07:02:43 INFO mapred.JobClient:     HDFS_BYTES_READ=833
> 14/05/28 07:02:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=239555
> 14/05/28 07:02:43 INFO mapred.JobClient:   File Input Format Counters
> 14/05/28 07:02:43 INFO mapred.JobClient:     Bytes Read=0
> 14/05/28 07:02:43 INFO mapred.JobClient:   Map-Reduce Framework
> 14/05/28 07:02:43 INFO mapred.JobClient:     Map output materialized bytes=28
> 14/05/28 07:02:43 INFO mapred.JobClient:     Map input records=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Reduce shuffle bytes=28
> 14/05/28 07:02:43 INFO mapred.JobClient:     Spilled Records=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Map output bytes=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Total committed heap usage (bytes)=277872640
> 14/05/28 07:02:43 INFO mapred.JobClient:     CPU time spent (ms)=4130
> 14/05/28 07:02:43 INFO mapred.JobClient:     Combine input records=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=833
> 14/05/28 07:02:43 INFO mapred.JobClient:     Reduce input records=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Reduce input groups=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Combine output records=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Physical memory (bytes) snapshot=422510592
> 14/05/28 07:02:43 INFO mapred.JobClient:     Reduce output records=0
> 14/05/28 07:02:43 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5982715904
> 14/05/28 07:02:43 INFO mapred.JobClient:     Map output records=0
> 14/05/28 07:02:43 INFO crawl.GeneratorJob: GeneratorJob: finished at 2014-05-28 07:02:43, time elapsed: 00:08:00
> 14/05/28 07:02:43 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1401274480-22738
> Fetching :
> Warning: $HADOOP_HOME is deprecated.
>
> 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: starting
> 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: batchId: 1401274480-22738
> 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: threads: 50
> 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: parsing: false
> 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob: resuming: false
> 14/05/28 07:02:45 INFO fetcher.FetcherJob: FetcherJob : timelimit set for : 1401285765716
> 14/05/28 07:02:46 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar110933996696870181/classes/plugins
> 14/05/28 07:02:46 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 14/05/28 07:02:46 INFO plugin.PluginRepository: Registered Plugins:
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Http / Https Protocol Plug-in (protocol-httpclient)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Creative Commons Plugins (creativecommons)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         More Indexing Filter (index-more)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         JavaScript Parser (parse-js)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 14/05/28 07:02:46 INFO plugin.PluginRepository: Registered Extension-Points:
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Parse Filter (org.apache.nutch.parse.ParseFilter)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 14/05/28 07:02:46 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.host = null
> 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.port = 8080
> 14/05/28 07:02:46 INFO httpclient.Http: http.timeout = 10000
> 14/05/28 07:02:46 INFO httpclient.Http: http.content.limit = 65536
> 14/05/28 07:02:46 INFO httpclient.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net)
> 14/05/28 07:02:46 INFO httpclient.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
> 14/05/28 07:02:46 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 14/05/28 07:02:46 INFO conf.Configuration: found resource httpclient-auth.xml at file:/app/hadoop/tmp/hadoop-unjar110933996696870181/httpclient-auth.xml
> 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.host = null
> 14/05/28 07:02:46 INFO httpclient.Http: http.proxy.port = 8080
> 14/05/28 07:02:46 INFO httpclient.Http: http.timeout = 10000
> 14/05/28 07:02:46 INFO httpclient.Http: http.content.limit = 65536
> 14/05/28 07:02:46 INFO httpclient.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net)
> 14/05/28 07:02:46 INFO httpclient.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
> 14/05/28 07:02:46 INFO httpclient.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 14/05/28 07:02:49 INFO mapred.JobClient: Running job: job_201405280024_0017
> 14/05/28 07:02:50 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/28 07:03:01 INFO mapred.JobClient:  map 100% reduce 0%
> 14/05/28 07:03:10 INFO mapred.JobClient:  map 100% reduce 16%
> 14/05/28 07:03:13 INFO mapred.JobClient:  map 100% reduce 50%
> 14/05/28 07:10:34 INFO mapred.JobClient: Task Id : attempt_201405280024_0017_r_000001_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 14/05/28 07:10:44 INFO mapred.JobClient:  map 100% reduce 66%
> 14/05/28 07:10:47 INFO mapred.JobClient:  map 100% reduce 100%
> 14/05/28 07:10:54 INFO mapred.JobClient: Job complete: job_201405280024_0017
> 14/05/28 07:10:54 INFO mapred.JobClient: Counters: 28
> 14/05/28 07:10:54 INFO mapred.JobClient:   Job Counters
> 14/05/28 07:10:54 INFO mapred.JobClient:     Launched reduce tasks=3
> 14/05/28 07:10:54 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11752
> 14/05/28 07:10:54 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Launched map tasks=1
> 14/05/28 07:10:54 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=33613
> 14/05/28 07:10:54 INFO mapred.JobClient:   File Output Format Counters
> 14/05/28 07:10:54 INFO mapred.JobClient:     Bytes Written=0
> 14/05/28 07:10:54 INFO mapred.JobClient:   FileSystemCounters
> 14/05/28 07:10:54 INFO mapred.JobClient:     FILE_BYTES_READ=44
> 14/05/28 07:10:54 INFO mapred.JobClient:     HDFS_BYTES_READ=817
> 14/05/28 07:10:54 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=238025
> 14/05/28 07:10:54 INFO mapred.JobClient:   File Input Format Counters
> 14/05/28 07:10:54 INFO mapred.JobClient:     Bytes Read=0
> 14/05/28 07:10:54 INFO mapred.JobClient:   FetcherStatus
> 14/05/28 07:10:54 INFO mapred.JobClient:     HitByTimeLimit-QueueFeeder=0
> 14/05/28 07:10:54 INFO mapred.JobClient:   Map-Reduce Framework
> 14/05/28 07:10:54 INFO mapred.JobClient:     Map output materialized bytes=28
> 14/05/28 07:10:54 INFO mapred.JobClient:     Map input records=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Reduce shuffle bytes=28
> 14/05/28 07:10:54 INFO mapred.JobClient:     Spilled Records=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Map output bytes=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Total committed heap usage (bytes)=317194240
> 14/05/28 07:10:54 INFO mapred.JobClient:     CPU time spent (ms)=6460
> 14/05/28 07:10:54 INFO mapred.JobClient:     Combine input records=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     SPLIT_RAW_BYTES=817
> 14/05/28 07:10:54 INFO mapred.JobClient:     Reduce input records=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Reduce input groups=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Combine output records=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Physical memory (bytes) snapshot=444006400
> 14/05/28 07:10:54 INFO mapred.JobClient:     Reduce output records=0
> 14/05/28 07:10:54 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6052544512
> 14/05/28 07:10:54 INFO mapred.JobClient:     Map output records=0
> 14/05/28 07:10:54 INFO fetcher.FetcherJob: FetcherJob: done
> Parsing :
> Warning: $HADOOP_HOME is deprecated.
>
> 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: starting
> 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: resuming:    false
> 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: forced reparse:      false
> 14/05/28 07:10:56 INFO parse.ParserJob: ParserJob: batchId:     1401274480-22738
> 14/05/28 07:10:57 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar1161270060222812225/classes/plugins
> 14/05/28 07:10:57 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 14/05/28 07:10:57 INFO plugin.PluginRepository: Registered Plugins:
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Http / Https Protocol Plug-in (protocol-httpclient)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Creative Commons Plugins (creativecommons)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         More Indexing Filter (index-more)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         JavaScript Parser (parse-js)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 14/05/28 07:10:57 INFO plugin.PluginRepository: Registered Extension-Points:
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Parse Filter (org.apache.nutch.parse.ParseFilter)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 14/05/28 07:10:57 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 14/05/28 07:10:57 INFO conf.Configuration: found resource parse-plugins.xml at file:/app/hadoop/tmp/hadoop-unjar1161270060222812225/parse-plugins.xml
> 14/05/28 07:10:57 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 14/05/28 07:10:59 INFO mapred.JobClient: Running job: job_201405280024_0018
> 14/05/28 07:11:00 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/28 07:11:07 INFO mapred.JobClient:  map 100% reduce 0%
> 14/05/28 07:11:09 INFO mapred.JobClient: Job complete: job_201405280024_0018
> 14/05/28 07:11:09 INFO mapred.JobClient: Counters: 17
> 14/05/28 07:11:09 INFO mapred.JobClient:   Job Counters
> 14/05/28 07:11:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=7869
> 14/05/28 07:11:09 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 14/05/28 07:11:09 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 14/05/28 07:11:09 INFO mapred.JobClient:     Launched map tasks=1
> 14/05/28 07:11:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/05/28 07:11:09 INFO mapred.JobClient:   File Output Format Counters
> 14/05/28 07:11:09 INFO mapred.JobClient:     Bytes Written=0
> 14/05/28 07:11:09 INFO mapred.JobClient:   FileSystemCounters
> 14/05/28 07:11:09 INFO mapred.JobClient:     HDFS_BYTES_READ=861
> 14/05/28 07:11:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78891
> 14/05/28 07:11:09 INFO mapred.JobClient:   File Input Format Counters
> 14/05/28 07:11:09 INFO mapred.JobClient:     Bytes Read=0
> 14/05/28 07:11:09 INFO mapred.JobClient:   Map-Reduce Framework
> 14/05/28 07:11:09 INFO mapred.JobClient:     Map input records=0
> 14/05/28 07:11:09 INFO mapred.JobClient:     Physical memory (bytes) snapshot=114253824
> 14/05/28 07:11:09 INFO mapred.JobClient:     Spilled Records=0
> 14/05/28 07:11:09 INFO mapred.JobClient:     CPU time spent (ms)=1070
> 14/05/28 07:11:09 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
> 14/05/28 07:11:09 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1987776512
> 14/05/28 07:11:09 INFO mapred.JobClient:     Map output records=0
> 14/05/28 07:11:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=861
> 14/05/28 07:11:09 INFO parse.ParserJob: ParserJob: success
> CrawlDB update for TestCrawl
> Warning: $HADOOP_HOME is deprecated.
>
> 14/05/28 07:11:12 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
> 14/05/28 07:11:13 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar5400634919722418143/classes/plugins
> 14/05/28 07:11:13 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 14/05/28 07:11:13 INFO plugin.PluginRepository: Registered Plugins:
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Http / Https Protocol Plug-in (protocol-httpclient)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Creative Commons Plugins (creativecommons)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         More Indexing Filter (index-more)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         JavaScript Parser (parse-js)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 14/05/28 07:11:13 INFO plugin.PluginRepository: Registered Extension-Points:
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Parse Filter (org.apache.nutch.parse.ParseFilter)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 14/05/28 07:11:13 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 14/05/28 07:11:16 INFO mapred.JobClient: Running job: job_201405280024_0019
> 14/05/28 07:11:17 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/28 07:11:28 INFO mapred.JobClient:  map 100% reduce 0%
> 14/05/28 07:11:38 INFO mapred.JobClient:  map 100% reduce 16%
> 14/05/28 07:11:39 INFO mapred.JobClient:  map 100% reduce 50%
> 14/05/28 07:19:00 INFO mapred.JobClient: Task Id : attempt_201405280024_0019_r_000001_0, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 14/05/28 07:19:00 WARN mapred.JobClient: Error reading task outputnutch-two-qontifi
> 14/05/28 07:19:00 WARN mapred.JobClient: Error reading task outputnutch-two-qontifi
> 14/05/28 07:19:11 INFO mapred.JobClient:  map 100% reduce 66%
> 14/05/28 07:19:12 INFO mapred.JobClient:  map 100% reduce 100%
> 14/05/28 07:19:13 INFO mapred.JobClient: Job complete: job_201405280024_0019
> 14/05/28 07:19:13 INFO mapred.JobClient: Counters: 27
> 14/05/28 07:19:13 INFO mapred.JobClient:   Job Counters
> 14/05/28 07:19:13 INFO mapred.JobClient:     Launched reduce tasks=3
> 14/05/28 07:19:13 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=10614
> 14/05/28 07:19:13 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Launched map tasks=1
> 14/05/28 07:19:13 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=23263
> 14/05/28 07:19:13 INFO mapred.JobClient:   File Output Format Counters
> 14/05/28 07:19:13 INFO mapred.JobClient:     Bytes Written=0
> 14/05/28 07:19:13 INFO mapred.JobClient:   FileSystemCounters
> 14/05/28 07:19:13 INFO mapred.JobClient:     FILE_BYTES_READ=44
> 14/05/28 07:19:13 INFO mapred.JobClient:     HDFS_BYTES_READ=910
> 14/05/28 07:19:13 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=238016
> 14/05/28 07:19:13 INFO mapred.JobClient:   File Input Format Counters
> 14/05/28 07:19:13 INFO mapred.JobClient:     Bytes Read=0
> 14/05/28 07:19:13 INFO mapred.JobClient:   Map-Reduce Framework
> 14/05/28 07:19:13 INFO mapred.JobClient:     Map output materialized bytes=28
> 14/05/28 07:19:13 INFO mapred.JobClient:     Map input records=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Reduce shuffle bytes=28
> 14/05/28 07:19:13 INFO mapred.JobClient:     Spilled Records=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Map output bytes=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Total committed heap usage (bytes)=293601280
> 14/05/28 07:19:13 INFO mapred.JobClient:     CPU time spent (ms)=6540
> 14/05/28 07:19:13 INFO mapred.JobClient:     Combine input records=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     SPLIT_RAW_BYTES=910
> 14/05/28 07:19:13 INFO mapred.JobClient:     Reduce input records=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Reduce input groups=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Combine output records=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Physical memory (bytes) snapshot=470159360
> 14/05/28 07:19:13 INFO mapred.JobClient:     Reduce output records=0
> 14/05/28 07:19:13 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5987823616
> 14/05/28 07:19:13 INFO mapred.JobClient:     Map output records=0
> 14/05/28 07:19:13 INFO crawl.DbUpdaterJob: DbUpdaterJob: done
> Indexing TestCrawl on SOLR index -> http://128.199.207.54:8983/solr/nutch
> Warning: $HADOOP_HOME is deprecated.
>
> 14/05/28 07:19:16 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
> 14/05/28 07:19:16 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar5241938989393377870/classes/plugins
> 14/05/28 07:19:16 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 14/05/28 07:19:16 INFO plugin.PluginRepository: Registered Plugins:
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Http / Https Protocol Plug-in (protocol-httpclient)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Creative Commons Plugins (creativecommons)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         More Indexing Filter (index-more)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         JavaScript Parser (parse-js)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 14/05/28 07:19:16 INFO plugin.PluginRepository: Registered Extension-Points:
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Parse Filter (org.apache.nutch.parse.ParseFilter)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 14/05/28 07:19:16 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 14/05/28 07:19:16 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100
> 14/05/28 07:19:16 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
> 14/05/28 07:19:16 INFO indexer.IndexingFilters: Adding org.creativecommons.nutch.CCIndexingFilter
> 14/05/28 07:19:17 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.more.MoreIndexingFilter
> 14/05/28 07:19:21 INFO mapred.JobClient: Running job: job_201405280024_0020
> 14/05/28 07:19:22 INFO mapred.JobClient:  map 0% reduce 0%
> 14/05/28 07:19:31 INFO mapred.JobClient:  map 100% reduce 0%
> 14/05/28 07:19:33 INFO mapred.JobClient: Job complete: job_201405280024_0020
> 14/05/28 07:19:33 INFO mapred.JobClient: Counters: 17
> 14/05/28 07:19:33 INFO mapred.JobClient:   Job Counters
> 14/05/28 07:19:33 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=9290
> 14/05/28 07:19:33 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
> 14/05/28 07:19:33 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
> 14/05/28 07:19:33 INFO mapred.JobClient:     Launched map tasks=1
> 14/05/28 07:19:33 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/05/28 07:19:33 INFO mapred.JobClient:   File Output Format Counters
> 14/05/28 07:19:33 INFO mapred.JobClient:     Bytes Written=0
> 14/05/28 07:19:33 INFO mapred.JobClient:   FileSystemCounters
> 14/05/28 07:19:33 INFO mapred.JobClient:     HDFS_BYTES_READ=877
> 14/05/28 07:19:33 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=79006
> 14/05/28 07:19:33 INFO mapred.JobClient:   File Input Format Counters
> 14/05/28 07:19:33 INFO mapred.JobClient:     Bytes Read=0
> 14/05/28 07:19:33 INFO mapred.JobClient:   Map-Reduce Framework
> 14/05/28 07:19:33 INFO mapred.JobClient:     Map input records=0
> 14/05/28 07:19:33 INFO mapred.JobClient:     Physical memory (bytes) snapshot=117587968
> 14/05/28 07:19:33 INFO mapred.JobClient:     Spilled Records=0
> 14/05/28 07:19:33 INFO mapred.JobClient:     CPU time spent (ms)=1040
> 14/05/28 07:19:33 INFO mapred.JobClient:     Total committed heap usage (bytes)=59768832
> 14/05/28 07:19:33 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1992785920
> 14/05/28 07:19:33 INFO mapred.JobClient:     Map output records=0
> 14/05/28 07:19:33 INFO mapred.JobClient:     SPLIT_RAW_BYTES=877
> 14/05/28 07:19:33 INFO solr.SolrIndexerJob: SolrIndexerJob: done.
>
>  Am I missing anything?
>
> --
> Manikandan Saravanan
> Architect - Technology
> TheSocialPeople



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304