You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manikandan Saravanan <ma...@thesocialpeople.net> on 2014/06/05 21:14:47 UTC

Injector works. But generator and fetcher don't work.

Dear Lewis,

I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using Cassandra as my backend datastore . I’m trying to crawl one link as of now. The inject command works properly: I’m able to find one row added to the “webpage” keyspace in Cassandra. But the generator doesn’t do a thing. So does the fetcher. In the end, nothing’s indexed in Solr.

Please help me out. My stack trace is:

hduser@nutch-one-qontifi:/usr/local/nutch$ bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: starting at 2014-06-05 15:00:34
14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
14/06/05 15:00:36 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:00:40 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:00:41 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
14/06/05 15:00:44 INFO input.FileInputFormat: Total input paths to process : 1
14/06/05 15:00:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/06/05 15:00:44 WARN snappy.LoadSnappy: Snappy native library not loaded
14/06/05 15:00:44 INFO mapred.JobClient: Running job: job_201406051410_0011
14/06/05 15:00:45 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:01:00 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:01:02 INFO mapred.JobClient: Job complete: job_201406051410_0011
14/06/05 15:01:02 INFO mapred.JobClient: Counters: 19
14/06/05 15:01:02 INFO mapred.JobClient:   Job Counters 
14/06/05 15:01:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=14861
14/06/05 15:01:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:01:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:01:02 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:01:02 INFO mapred.JobClient:     Data-local map tasks=1
14/06/05 15:01:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 15:01:02 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:01:02 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:01:02 INFO mapred.JobClient:   injector
14/06/05 15:01:02 INFO mapred.JobClient:     urls_injected=1
14/06/05 15:01:02 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:01:02 INFO mapred.JobClient:     HDFS_BYTES_READ=135
14/06/05 15:01:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=77648
14/06/05 15:01:02 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:01:02 INFO mapred.JobClient:     Bytes Read=25
14/06/05 15:01:02 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:01:02 INFO mapred.JobClient:     Map input records=1
14/06/05 15:01:02 INFO mapred.JobClient:     Physical memory (bytes) snapshot=122052608
14/06/05 15:01:02 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:01:02 INFO mapred.JobClient:     CPU time spent (ms)=1490
14/06/05 15:01:02 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
14/06/05 15:01:02 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1119281152
14/06/05 15:01:02 INFO mapred.JobClient:     Map output records=1
14/06/05 15:01:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 1
14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05 15:01:02, elapsed: 00:00:28
Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting at 2014-06-05 15:01:06
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch.
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: false
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000
14/06/05 15:01:06 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
14/06/05 15:01:07 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:01:11 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:01:15 INFO mapred.JobClient: Running job: job_201406051410_0012
14/06/05 15:01:16 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:01:55 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:02:05 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:02:08 INFO mapred.JobClient:  map 100% reduce 66%
14/06/05 15:02:10 INFO mapred.JobClient:  map 100% reduce 83%
14/06/05 15:02:11 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:02:14 INFO mapred.JobClient: Job complete: job_201406051410_0012
14/06/05 15:02:14 INFO mapred.JobClient: Counters: 27
14/06/05 15:02:14 INFO mapred.JobClient:   Job Counters 
14/06/05 15:02:14 INFO mapred.JobClient:     Launched reduce tasks=2
14/06/05 15:02:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=39990
14/06/05 15:02:14 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:02:14 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:02:14 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:02:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=29119
14/06/05 15:02:14 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:02:14 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:02:14 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:02:14 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/06/05 15:02:14 INFO mapred.JobClient:     HDFS_BYTES_READ=951
14/06/05 15:02:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=239453
14/06/05 15:02:14 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:02:14 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:02:14 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:02:14 INFO mapred.JobClient:     Map output materialized bytes=28
14/06/05 15:02:14 INFO mapred.JobClient:     Map input records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/06/05 15:02:14 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:02:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=333971456
14/06/05 15:02:14 INFO mapred.JobClient:     CPU time spent (ms)=9330
14/06/05 15:02:14 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:02:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=951
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:02:14 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=486813696
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6016212992
14/06/05 15:02:14 INFO mapred.JobClient:     Map output records=0
14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: finished at 2014-06-05 15:02:14, time elapsed: 00:01:08
14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1401994862-29963
Fetching : 
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: starting
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: batchId: 1401994862-29963
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: threads: 50
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: parsing: false
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: resuming: false
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob : timelimit set for : 1402005738902
14/06/05 15:02:19 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar813633856909664022/classes/plugins
14/06/05 15:02:20 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:02:20 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http)
14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:02:20 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:02:20 INFO http.Http: http.proxy.host = null
14/06/05 15:02:20 INFO http.Http: http.proxy.port = 8080
14/06/05 15:02:20 INFO http.Http: http.timeout = 10000
14/06/05 15:02:20 INFO http.Http: http.content.limit = 65536
14/06/05 15:02:20 INFO http.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net)
14/06/05 15:02:20 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
14/06/05 15:02:20 INFO http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14/06/05 15:02:20 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:02:25 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:02:29 INFO mapred.JobClient: Running job: job_201406051410_0013
14/06/05 15:02:30 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:03:05 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:03:14 INFO mapred.JobClient:  map 100% reduce 16%
14/06/05 15:03:16 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:03:17 INFO mapred.JobClient:  map 100% reduce 50%
14/06/05 15:03:19 INFO mapred.JobClient:  map 100% reduce 66%
14/06/05 15:03:23 INFO mapred.JobClient:  map 100% reduce 83%
14/06/05 15:03:28 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:03:31 INFO mapred.JobClient: Job complete: job_201406051410_0013
14/06/05 15:03:31 INFO mapred.JobClient: Counters: 28
14/06/05 15:03:31 INFO mapred.JobClient:   Job Counters 
14/06/05 15:03:31 INFO mapred.JobClient:     Launched reduce tasks=2
14/06/05 15:03:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=37163
14/06/05 15:03:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:03:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:03:31 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:03:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=39755
14/06/05 15:03:31 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:03:31 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:03:31 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:03:31 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/06/05 15:03:31 INFO mapred.JobClient:     HDFS_BYTES_READ=935
14/06/05 15:03:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237923
14/06/05 15:03:31 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:03:31 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:03:31 INFO mapred.JobClient:   FetcherStatus
14/06/05 15:03:31 INFO mapred.JobClient:     HitByTimeLimit-QueueFeeder=0
14/06/05 15:03:31 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:03:31 INFO mapred.JobClient:     Map output materialized bytes=28
14/06/05 15:03:31 INFO mapred.JobClient:     Map input records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/06/05 15:03:31 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:03:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=375914496
14/06/05 15:03:31 INFO mapred.JobClient:     CPU time spent (ms)=9820
14/06/05 15:03:31 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:03:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=935
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:03:31 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=510382080
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6060650496
14/06/05 15:03:31 INFO mapred.JobClient:     Map output records=0
14/06/05 15:03:31 INFO fetcher.FetcherJob: FetcherJob: done
Parsing : 
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: starting
14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: resuming:	false
14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: forced reparse:	false
14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: batchId:	1401994862-29963
14/06/05 15:03:35 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar8143815380567453850/classes/plugins
14/06/05 15:03:36 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:03:36 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http)
14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:03:36 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:03:36 INFO conf.Configuration: found resource parse-plugins.xml at file:/app/hadoop/tmp/hadoop-unjar8143815380567453850/parse-plugins.xml
14/06/05 15:03:36 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature
14/06/05 15:03:37 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:03:41 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:03:45 INFO mapred.JobClient: Running job: job_201406051410_0014
14/06/05 15:03:46 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:04:22 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:04:24 INFO mapred.JobClient: Job complete: job_201406051410_0014
14/06/05 15:04:25 INFO mapred.JobClient: Counters: 17
14/06/05 15:04:25 INFO mapred.JobClient:   Job Counters 
14/06/05 15:04:25 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36653
14/06/05 15:04:25 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:04:25 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:04:25 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:04:25 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 15:04:25 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:04:25 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:04:25 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:04:25 INFO mapred.JobClient:     HDFS_BYTES_READ=979
14/06/05 15:04:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78853
14/06/05 15:04:25 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:04:25 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:04:25 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:04:25 INFO mapred.JobClient:     Map input records=0
14/06/05 15:04:25 INFO mapred.JobClient:     Physical memory (bytes) snapshot=129826816
14/06/05 15:04:25 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:04:25 INFO mapred.JobClient:     CPU time spent (ms)=2330
14/06/05 15:04:25 INFO mapred.JobClient:     Total committed heap usage (bytes)=60817408
14/06/05 15:04:25 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2000629760
14/06/05 15:04:25 INFO mapred.JobClient:     Map output records=0
14/06/05 15:04:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=979
14/06/05 15:04:25 INFO parse.ParserJob: ParserJob: success
CrawlDB update for TestCrawl
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:04:28 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
14/06/05 15:04:29 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar4238316120015868426/classes/plugins
14/06/05 15:04:29 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:04:29 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http)
14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:04:29 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:04:30 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:04:34 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:04:38 INFO mapred.JobClient: Running job: job_201406051410_0015
14/06/05 15:04:39 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:05:21 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:05:31 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:05:34 INFO mapred.JobClient:  map 100% reduce 66%
14/06/05 15:05:37 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:05:39 INFO mapred.JobClient: Job complete: job_201406051410_0015
14/06/05 15:05:39 INFO mapred.JobClient: Counters: 27
14/06/05 15:05:39 INFO mapred.JobClient:   Job Counters 
14/06/05 15:05:39 INFO mapred.JobClient:     Launched reduce tasks=2
14/06/05 15:05:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=39898
14/06/05 15:05:39 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:05:39 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:05:39 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:05:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=30439
14/06/05 15:05:39 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:05:39 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:05:39 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:05:39 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/06/05 15:05:39 INFO mapred.JobClient:     HDFS_BYTES_READ=1028
14/06/05 15:05:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237914
14/06/05 15:05:39 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:05:39 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:05:39 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:05:39 INFO mapred.JobClient:     Map output materialized bytes=28
14/06/05 15:05:39 INFO mapred.JobClient:     Map input records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/06/05 15:05:39 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:05:39 INFO mapred.JobClient:     Total committed heap usage (bytes)=375914496
14/06/05 15:05:39 INFO mapred.JobClient:     CPU time spent (ms)=8880
14/06/05 15:05:39 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:05:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1028
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:05:39 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Physical memory (bytes) snapshot=490651648
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6002880512
14/06/05 15:05:39 INFO mapred.JobClient:     Map output records=0
14/06/05 15:05:39 INFO crawl.DbUpdaterJob: DbUpdaterJob: done
Indexing TestCrawl on SOLR index -> http://10.130.231.16:8983/solr/nutch
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:05:43 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
14/06/05 15:05:44 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar7543842044056940295/classes/plugins
14/06/05 15:05:44 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:05:44 INFO plugin.PluginRepository: 	the nutch core extension points (nutch-extensionpoints)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Tika Parser Plug-in (parse-tika)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Basic Indexing Filter (index-basic)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Html Parse Plug-in (parse-html)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Anchor Indexing Filter (index-anchor)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	HTTP Framework (lib-http)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Regex URL Filter (urlfilter-regex)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Http Protocol Plug-in (protocol-http)
14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:05:44 INFO plugin.PluginRepository: 	Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:05:44 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100
14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
14/06/05 15:05:44 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off
14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
14/06/05 15:05:45 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:05:49 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:05:52 INFO mapred.JobClient: Running job: job_201406051410_0016
14/06/05 15:05:53 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:06:29 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:06:32 INFO mapred.JobClient: Job complete: job_201406051410_0016
14/06/05 15:06:32 INFO mapred.JobClient: Counters: 17
14/06/05 15:06:32 INFO mapred.JobClient:   Job Counters 
14/06/05 15:06:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36879
14/06/05 15:06:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:06:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:06:32 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:06:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 15:06:32 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:06:32 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:06:32 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:06:32 INFO mapred.JobClient:     HDFS_BYTES_READ=962
14/06/05 15:06:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78923
14/06/05 15:06:32 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:06:32 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:06:32 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:06:32 INFO mapred.JobClient:     Map input records=0
14/06/05 15:06:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=114335744
14/06/05 15:06:32 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:06:32 INFO mapred.JobClient:     CPU time spent (ms)=2670
14/06/05 15:06:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=60293120
14/06/05 15:06:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1990189056
14/06/05 15:06:32 INFO mapred.JobClient:     Map output records=0
14/06/05 15:06:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=962
14/06/05 15:06:32 INFO solr.SolrIndexerJob: SolrIndexerJob: done.

When I run readdb -stats, I get:

hduser@nutch-one-qontifi:/usr/local/nutch$ bin/nutch readdb TestCrawl -stats
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:13:19 INFO crawl.WebTableReader: WebTable statistics start
14/06/05 15:13:21 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:13:25 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:13:29 INFO mapred.JobClient: Running job: job_201406051410_0019
14/06/05 15:13:30 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:14:06 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:14:15 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:14:17 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:14:19 INFO mapred.JobClient: Job complete: job_201406051410_0019
14/06/05 15:14:19 INFO mapred.JobClient: Counters: 28
14/06/05 15:14:19 INFO mapred.JobClient:   Job Counters 
14/06/05 15:14:19 INFO mapred.JobClient:     Launched reduce tasks=1
14/06/05 15:14:19 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36697
14/06/05 15:14:19 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:14:19 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:14:19 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:14:19 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10302
14/06/05 15:14:19 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:14:19 INFO mapred.JobClient:     Bytes Written=86
14/06/05 15:14:19 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:14:19 INFO mapred.JobClient:     FILE_BYTES_READ=6
14/06/05 15:14:19 INFO mapred.JobClient:     HDFS_BYTES_READ=1135
14/06/05 15:14:19 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=157112
14/06/05 15:14:19 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=86
14/06/05 15:14:19 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:14:19 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:14:19 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:14:19 INFO mapred.JobClient:     Map output materialized bytes=6
14/06/05 15:14:19 INFO mapred.JobClient:     Map input records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce shuffle bytes=6
14/06/05 15:14:19 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:14:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=216530944
14/06/05 15:14:19 INFO mapred.JobClient:     CPU time spent (ms)=2450
14/06/05 15:14:19 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:14:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1135
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:14:19 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Physical memory (bytes) snapshot=320630784
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2254024704
14/06/05 15:14:19 INFO mapred.JobClient:     Map output records=0
14/06/05 15:14:19 INFO crawl.WebTableReader: Statistics for WebTable: 
14/06/05 15:14:19 INFO crawl.WebTableReader: jobs:	{db_stats-job_201406051410_0019={jobID=job_201406051410_0019, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls:	0
14/06/05 15:14:19 INFO crawl.WebTableReader: WebTable statistics: done
14/06/05 15:14:19 INFO crawl.WebTableReader: jobs:	{db_stats-job_201406051410_0019={jobID=job_201406051410_0019, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls:	0

-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

Re: Injector works. But generator and fetcher don't work.

Posted by Manikandan Saravanan <ma...@thesocialpeople.net>.
If you look at my first email in this thread, it says filtering: false and normalising: false. Even then, it didn’t generate anything.

Here’s my regex-urlfilter.txt file:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

And here’s my regex-normalize.xml:

<?xml version="1.0"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
     This is intended so that users can specify substitutions to be
     done on URLs. The regex engine that is used is Perl5 compatible.
     The rules are applied to URLs in the order they occur in this file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
     expanded to &amp; -->

<!-- The following rules show how to strip out session IDs, default pages, 
     interpage anchors, etc. Order does matter!  -->
<regex-normalize>

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

<!-- changes default pages into standard for /index.html, etc. into /
<regex>
  <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/$3</substitution>
</regex> -->

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>
<!-- The following rules show how to strip out session IDs, default pages, 
     interpage anchors, etc. Order does matter!  -->
<regex-normalize>

<!-- cleans ?&amp;var=value into ?var=value -->
<regex>
  <pattern>\?&amp;</pattern>
  <substitution>\?</substitution>
</regex>

<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
  <pattern>&amp;{2,}</pattern>
  <substitution>&amp;</substitution>
</regex>

<!-- removes trailing ? -->
<regex>
  <pattern>[\?&amp;\.]$</pattern>
  <substitution></substitution>
</regex>

<!-- removes duplicate slashes -->
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
</regex-normalize>
-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

On 6 June 2014 at 1:54:02 am, Lewis John Mcgibbney (lewis.mcgibbney@gmail.com) wrote:

I suspect that your generator normalization/filtering prevents this URL from getting through

On Thu, Jun 5, 2014 at 1:09 PM, Manikandan Saravanan <ma...@thesocialpeople.net> wrote:

14/06/05 15:59:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: true
14/06/05 15:59:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: true

Map input records in Generator phase is 0... this is incorrect.

Lewis

Re: Injector works. But generator and fetcher don't work.

Posted by Lewis John Mcgibbney <le...@gmail.com>.
It looks like the InjectorJob phase successfully injects your 1 URL in to
Cassandra Keyspace.

On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan <
manikandan@thesocialpeople.net> wrote:

>
> 14/06/05 15:01:02 INFO mapred.JobClient:     Map input records=1
>
> ...

> 14/06/05 15:01:02 INFO mapred.JobClient:     Map output records=1
> 14/06/05 15:01:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
> 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of
> urls rejected by filters: 0
> 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of
> urls injected after normalization and filtering: 1
> 14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05
> 15:01:02, elapsed: 00:00:28
>

So that looks fine. What I would advise you to do is read the dump after
injecting.


> Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2
>

What does this mean? Did you manually edit this? I have never seen this
logging before.


>
>
> 14/06/05 15:02:14 INFO mapred.JobClient:     Map input records=0
>
> If the URL has already been fetched then a fetchmark will not exist for it
to be re-fetched. Can this perhaps be the case.

It seems that you have been tinkering with crawl cycles without
understanding and/or recognizing the crawl cycle itself. If you are just
starting out, I really advise you to use the nutch script with individual
commands. Reading the database dump is an essential step in a young crawl
cycle.
Lewis

Re: Injector works. But generator and fetcher don't work.

Posted by Manikandan Saravanan <ma...@thesocialpeople.net>.
I built it from Nutch 2.2.1 (src-tar.gz).
-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

On 6 June 2014 at 1:03:18 am, Lewis John Mcgibbney (lewis.mcgibbney@gmail.com) wrote:

which version of Nutch are you using?
Nutch 2 what?


On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan <ma...@thesocialpeople.net> wrote:
Dear Lewis,

I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using Cassandra as my backend datastore . I’m trying to crawl one link as of now. The inject command works properly: I’m able to find one row added to the “webpage” keyspace in Cassandra. But the generator doesn’t do a thing. So does the fetcher. In the end, nothing’s indexed in Solr.

Please help me out. My stack trace is:

hduser@nutch-one-qontifi:/usr/local/nutch$ bin/crawl urls/seed.txt TestCrawl http://10.130.231.16:8983/solr/nutch 2
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: starting at 2014-06-05 15:00:34
14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
14/06/05 15:00:36 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:00:40 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:00:41 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
14/06/05 15:00:44 INFO input.FileInputFormat: Total input paths to process : 1
14/06/05 15:00:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/06/05 15:00:44 WARN snappy.LoadSnappy: Snappy native library not loaded
14/06/05 15:00:44 INFO mapred.JobClient: Running job: job_201406051410_0011
14/06/05 15:00:45 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:01:00 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:01:02 INFO mapred.JobClient: Job complete: job_201406051410_0011
14/06/05 15:01:02 INFO mapred.JobClient: Counters: 19
14/06/05 15:01:02 INFO mapred.JobClient:   Job Counters 
14/06/05 15:01:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=14861
14/06/05 15:01:02 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:01:02 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:01:02 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:01:02 INFO mapred.JobClient:     Data-local map tasks=1
14/06/05 15:01:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 15:01:02 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:01:02 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:01:02 INFO mapred.JobClient:   injector
14/06/05 15:01:02 INFO mapred.JobClient:     urls_injected=1
14/06/05 15:01:02 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:01:02 INFO mapred.JobClient:     HDFS_BYTES_READ=135
14/06/05 15:01:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=77648
14/06/05 15:01:02 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:01:02 INFO mapred.JobClient:     Bytes Read=25
14/06/05 15:01:02 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:01:02 INFO mapred.JobClient:     Map input records=1
14/06/05 15:01:02 INFO mapred.JobClient:     Physical memory (bytes) snapshot=122052608
14/06/05 15:01:02 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:01:02 INFO mapred.JobClient:     CPU time spent (ms)=1490
14/06/05 15:01:02 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
14/06/05 15:01:02 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1119281152
14/06/05 15:01:02 INFO mapred.JobClient:     Map output records=1
14/06/05 15:01:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 1
14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05 15:01:02, elapsed: 00:00:28
Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2
Generating batchId
Generating a new fetchlist
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting at 2014-06-05 15:01:06
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: Selecting best-scoring urls due for fetch.
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: false
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false
14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000
14/06/05 15:01:06 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
14/06/05 15:01:07 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:01:11 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:01:15 INFO mapred.JobClient: Running job: job_201406051410_0012
14/06/05 15:01:16 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:01:55 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:02:05 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:02:08 INFO mapred.JobClient:  map 100% reduce 66%
14/06/05 15:02:10 INFO mapred.JobClient:  map 100% reduce 83%
14/06/05 15:02:11 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:02:14 INFO mapred.JobClient: Job complete: job_201406051410_0012
14/06/05 15:02:14 INFO mapred.JobClient: Counters: 27
14/06/05 15:02:14 INFO mapred.JobClient:   Job Counters 
14/06/05 15:02:14 INFO mapred.JobClient:     Launched reduce tasks=2
14/06/05 15:02:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=39990
14/06/05 15:02:14 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:02:14 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:02:14 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:02:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=29119
14/06/05 15:02:14 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:02:14 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:02:14 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:02:14 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/06/05 15:02:14 INFO mapred.JobClient:     HDFS_BYTES_READ=951
14/06/05 15:02:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=239453
14/06/05 15:02:14 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:02:14 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:02:14 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:02:14 INFO mapred.JobClient:     Map output materialized bytes=28
14/06/05 15:02:14 INFO mapred.JobClient:     Map input records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/06/05 15:02:14 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:02:14 INFO mapred.JobClient:     Total committed heap usage (bytes)=333971456
14/06/05 15:02:14 INFO mapred.JobClient:     CPU time spent (ms)=9330
14/06/05 15:02:14 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:02:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=951
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:02:14 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Physical memory (bytes) snapshot=486813696
14/06/05 15:02:14 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:02:14 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6016212992
14/06/05 15:02:14 INFO mapred.JobClient:     Map output records=0
14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: finished at 2014-06-05 15:02:14, time elapsed: 00:01:08
14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: generated batch id: 1401994862-29963
Fetching : 
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: starting
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: batchId: 1401994862-29963
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: threads: 50
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: parsing: false
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: resuming: false
14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob : timelimit set for : 1402005738902
14/06/05 15:02:19 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar813633856909664022/classes/plugins
14/06/05 15:02:20 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:02:20 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:02:20 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:02:20 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:02:20 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:02:20 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
14/06/05 15:02:20 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
14/06/05 15:02:20 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
14/06/05 15:02:20 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
14/06/05 15:02:20 INFO plugin.PluginRepository: HTTP Framework (lib-http)
14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex)
14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:02:20 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:02:20 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:02:20 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:02:20 INFO http.Http: http.proxy.host = null
14/06/05 15:02:20 INFO http.Http: http.proxy.port = 8080
14/06/05 15:02:20 INFO http.Http: http.timeout = 10000
14/06/05 15:02:20 INFO http.Http: http.content.limit = 65536
14/06/05 15:02:20 INFO http.Http: http.agent = Qontifi/Nutch-2.2.1 (A big data analytics and social media intelligence platform; http://qontifi.com; manikandan at thesocialpeople dot net)
14/06/05 15:02:20 INFO http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
14/06/05 15:02:20 INFO http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
14/06/05 15:02:20 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:02:25 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:02:29 INFO mapred.JobClient: Running job: job_201406051410_0013
14/06/05 15:02:30 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:03:05 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:03:14 INFO mapred.JobClient:  map 100% reduce 16%
14/06/05 15:03:16 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:03:17 INFO mapred.JobClient:  map 100% reduce 50%
14/06/05 15:03:19 INFO mapred.JobClient:  map 100% reduce 66%
14/06/05 15:03:23 INFO mapred.JobClient:  map 100% reduce 83%
14/06/05 15:03:28 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:03:31 INFO mapred.JobClient: Job complete: job_201406051410_0013
14/06/05 15:03:31 INFO mapred.JobClient: Counters: 28
14/06/05 15:03:31 INFO mapred.JobClient:   Job Counters 
14/06/05 15:03:31 INFO mapred.JobClient:     Launched reduce tasks=2
14/06/05 15:03:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=37163
14/06/05 15:03:31 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:03:31 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:03:31 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:03:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=39755
14/06/05 15:03:31 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:03:31 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:03:31 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:03:31 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/06/05 15:03:31 INFO mapred.JobClient:     HDFS_BYTES_READ=935
14/06/05 15:03:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237923
14/06/05 15:03:31 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:03:31 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:03:31 INFO mapred.JobClient:   FetcherStatus
14/06/05 15:03:31 INFO mapred.JobClient:     HitByTimeLimit-QueueFeeder=0
14/06/05 15:03:31 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:03:31 INFO mapred.JobClient:     Map output materialized bytes=28
14/06/05 15:03:31 INFO mapred.JobClient:     Map input records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/06/05 15:03:31 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:03:31 INFO mapred.JobClient:     Total committed heap usage (bytes)=375914496
14/06/05 15:03:31 INFO mapred.JobClient:     CPU time spent (ms)=9820
14/06/05 15:03:31 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:03:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=935
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:03:31 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Physical memory (bytes) snapshot=510382080
14/06/05 15:03:31 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:03:31 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6060650496
14/06/05 15:03:31 INFO mapred.JobClient:     Map output records=0
14/06/05 15:03:31 INFO fetcher.FetcherJob: FetcherJob: done
Parsing : 
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: starting
14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: resuming: false
14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: forced reparse: false
14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: batchId: 1401994862-29963
14/06/05 15:03:35 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar8143815380567453850/classes/plugins
14/06/05 15:03:36 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:03:36 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:03:36 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:03:36 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:03:36 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:03:36 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
14/06/05 15:03:36 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
14/06/05 15:03:36 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
14/06/05 15:03:36 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
14/06/05 15:03:36 INFO plugin.PluginRepository: HTTP Framework (lib-http)
14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex)
14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:03:36 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:03:36 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:03:36 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:03:36 INFO conf.Configuration: found resource parse-plugins.xml at file:/app/hadoop/tmp/hadoop-unjar8143815380567453850/parse-plugins.xml
14/06/05 15:03:36 INFO crawl.SignatureFactory: Using Signature impl: org.apache.nutch.crawl.MD5Signature
14/06/05 15:03:37 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:03:41 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:03:45 INFO mapred.JobClient: Running job: job_201406051410_0014
14/06/05 15:03:46 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:04:22 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:04:24 INFO mapred.JobClient: Job complete: job_201406051410_0014
14/06/05 15:04:25 INFO mapred.JobClient: Counters: 17
14/06/05 15:04:25 INFO mapred.JobClient:   Job Counters 
14/06/05 15:04:25 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36653
14/06/05 15:04:25 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:04:25 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:04:25 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:04:25 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 15:04:25 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:04:25 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:04:25 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:04:25 INFO mapred.JobClient:     HDFS_BYTES_READ=979
14/06/05 15:04:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78853
14/06/05 15:04:25 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:04:25 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:04:25 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:04:25 INFO mapred.JobClient:     Map input records=0
14/06/05 15:04:25 INFO mapred.JobClient:     Physical memory (bytes) snapshot=129826816
14/06/05 15:04:25 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:04:25 INFO mapred.JobClient:     CPU time spent (ms)=2330
14/06/05 15:04:25 INFO mapred.JobClient:     Total committed heap usage (bytes)=60817408
14/06/05 15:04:25 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2000629760
14/06/05 15:04:25 INFO mapred.JobClient:     Map output records=0
14/06/05 15:04:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=979
14/06/05 15:04:25 INFO parse.ParserJob: ParserJob: success
CrawlDB update for TestCrawl
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:04:28 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
14/06/05 15:04:29 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar4238316120015868426/classes/plugins
14/06/05 15:04:29 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:04:29 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:04:29 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:04:29 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:04:29 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:04:29 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
14/06/05 15:04:29 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
14/06/05 15:04:29 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
14/06/05 15:04:29 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
14/06/05 15:04:29 INFO plugin.PluginRepository: HTTP Framework (lib-http)
14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex)
14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:04:29 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:04:29 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:04:29 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:04:30 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:04:34 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:04:38 INFO mapred.JobClient: Running job: job_201406051410_0015
14/06/05 15:04:39 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:05:21 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:05:31 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:05:34 INFO mapred.JobClient:  map 100% reduce 66%
14/06/05 15:05:37 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:05:39 INFO mapred.JobClient: Job complete: job_201406051410_0015
14/06/05 15:05:39 INFO mapred.JobClient: Counters: 27
14/06/05 15:05:39 INFO mapred.JobClient:   Job Counters 
14/06/05 15:05:39 INFO mapred.JobClient:     Launched reduce tasks=2
14/06/05 15:05:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=39898
14/06/05 15:05:39 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:05:39 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:05:39 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:05:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=30439
14/06/05 15:05:39 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:05:39 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:05:39 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:05:39 INFO mapred.JobClient:     FILE_BYTES_READ=44
14/06/05 15:05:39 INFO mapred.JobClient:     HDFS_BYTES_READ=1028
14/06/05 15:05:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237914
14/06/05 15:05:39 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:05:39 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:05:39 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:05:39 INFO mapred.JobClient:     Map output materialized bytes=28
14/06/05 15:05:39 INFO mapred.JobClient:     Map input records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce shuffle bytes=28
14/06/05 15:05:39 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:05:39 INFO mapred.JobClient:     Total committed heap usage (bytes)=375914496
14/06/05 15:05:39 INFO mapred.JobClient:     CPU time spent (ms)=8880
14/06/05 15:05:39 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:05:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1028
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:05:39 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Physical memory (bytes) snapshot=490651648
14/06/05 15:05:39 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:05:39 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=6002880512
14/06/05 15:05:39 INFO mapred.JobClient:     Map output records=0
14/06/05 15:05:39 INFO crawl.DbUpdaterJob: DbUpdaterJob: done
Indexing TestCrawl on SOLR index -> http://10.130.231.16:8983/solr/nutch
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:05:43 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
14/06/05 15:05:44 INFO plugin.PluginRepository: Plugins: looking in: /app/hadoop/tmp/hadoop-unjar7543842044056940295/classes/plugins
14/06/05 15:05:44 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Plugins:
14/06/05 15:05:44 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints)
14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex)
14/06/05 15:05:44 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml)
14/06/05 15:05:44 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic)
14/06/05 15:05:44 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic)
14/06/05 15:05:44 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika)
14/06/05 15:05:44 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
14/06/05 15:05:44 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html)
14/06/05 15:05:44 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor)
14/06/05 15:05:44 INFO plugin.PluginRepository: HTTP Framework (lib-http)
14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex)
14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter)
14/06/05 15:05:44 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass)
14/06/05 15:05:44 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http)
14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Extension-Points:
14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol)
14/06/05 15:05:44 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter)
14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter)
14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser)
14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
14/06/05 15:05:44 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100
14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
14/06/05 15:05:44 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off
14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter
14/06/05 15:05:45 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:05:49 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:05:52 INFO mapred.JobClient: Running job: job_201406051410_0016
14/06/05 15:05:53 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:06:29 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:06:32 INFO mapred.JobClient: Job complete: job_201406051410_0016
14/06/05 15:06:32 INFO mapred.JobClient: Counters: 17
14/06/05 15:06:32 INFO mapred.JobClient:   Job Counters 
14/06/05 15:06:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36879
14/06/05 15:06:32 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:06:32 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:06:32 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:06:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/06/05 15:06:32 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:06:32 INFO mapred.JobClient:     Bytes Written=0
14/06/05 15:06:32 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:06:32 INFO mapred.JobClient:     HDFS_BYTES_READ=962
14/06/05 15:06:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78923
14/06/05 15:06:32 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:06:32 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:06:32 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:06:32 INFO mapred.JobClient:     Map input records=0
14/06/05 15:06:32 INFO mapred.JobClient:     Physical memory (bytes) snapshot=114335744
14/06/05 15:06:32 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:06:32 INFO mapred.JobClient:     CPU time spent (ms)=2670
14/06/05 15:06:32 INFO mapred.JobClient:     Total committed heap usage (bytes)=60293120
14/06/05 15:06:32 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1990189056
14/06/05 15:06:32 INFO mapred.JobClient:     Map output records=0
14/06/05 15:06:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=962
14/06/05 15:06:32 INFO solr.SolrIndexerJob: SolrIndexerJob: done.

When I run readdb -stats, I get:

hduser@nutch-one-qontifi:/usr/local/nutch$ bin/nutch readdb TestCrawl -stats
Warning: $HADOOP_HOME is deprecated.

14/06/05 15:13:19 INFO crawl.WebTableReader: WebTable statistics start
14/06/05 15:13:21 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/06/05 15:13:25 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
14/06/05 15:13:29 INFO mapred.JobClient: Running job: job_201406051410_0019
14/06/05 15:13:30 INFO mapred.JobClient:  map 0% reduce 0%
14/06/05 15:14:06 INFO mapred.JobClient:  map 100% reduce 0%
14/06/05 15:14:15 INFO mapred.JobClient:  map 100% reduce 33%
14/06/05 15:14:17 INFO mapred.JobClient:  map 100% reduce 100%
14/06/05 15:14:19 INFO mapred.JobClient: Job complete: job_201406051410_0019
14/06/05 15:14:19 INFO mapred.JobClient: Counters: 28
14/06/05 15:14:19 INFO mapred.JobClient:   Job Counters 
14/06/05 15:14:19 INFO mapred.JobClient:     Launched reduce tasks=1
14/06/05 15:14:19 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36697
14/06/05 15:14:19 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/06/05 15:14:19 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/06/05 15:14:19 INFO mapred.JobClient:     Launched map tasks=1
14/06/05 15:14:19 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10302
14/06/05 15:14:19 INFO mapred.JobClient:   File Output Format Counters 
14/06/05 15:14:19 INFO mapred.JobClient:     Bytes Written=86
14/06/05 15:14:19 INFO mapred.JobClient:   FileSystemCounters
14/06/05 15:14:19 INFO mapred.JobClient:     FILE_BYTES_READ=6
14/06/05 15:14:19 INFO mapred.JobClient:     HDFS_BYTES_READ=1135
14/06/05 15:14:19 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=157112
14/06/05 15:14:19 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=86
14/06/05 15:14:19 INFO mapred.JobClient:   File Input Format Counters 
14/06/05 15:14:19 INFO mapred.JobClient:     Bytes Read=0
14/06/05 15:14:19 INFO mapred.JobClient:   Map-Reduce Framework
14/06/05 15:14:19 INFO mapred.JobClient:     Map output materialized bytes=6
14/06/05 15:14:19 INFO mapred.JobClient:     Map input records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce shuffle bytes=6
14/06/05 15:14:19 INFO mapred.JobClient:     Spilled Records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Map output bytes=0
14/06/05 15:14:19 INFO mapred.JobClient:     Total committed heap usage (bytes)=216530944
14/06/05 15:14:19 INFO mapred.JobClient:     CPU time spent (ms)=2450
14/06/05 15:14:19 INFO mapred.JobClient:     Combine input records=0
14/06/05 15:14:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1135
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce input records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce input groups=0
14/06/05 15:14:19 INFO mapred.JobClient:     Combine output records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Physical memory (bytes) snapshot=320630784
14/06/05 15:14:19 INFO mapred.JobClient:     Reduce output records=0
14/06/05 15:14:19 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2254024704
14/06/05 15:14:19 INFO mapred.JobClient:     Map output records=0
14/06/05 15:14:19 INFO crawl.WebTableReader: Statistics for WebTable: 
14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: {db_stats-job_201406051410_0019={jobID=job_201406051410_0019, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0
14/06/05 15:14:19 INFO crawl.WebTableReader: WebTable statistics: done
14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: {db_stats-job_201406051410_0019={jobID=job_201406051410_0019, jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697, FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450, SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0, VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0}, FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135, FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format Counters ={BYTES_WRITTEN=86}}}}
14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0

-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople



--
Lewis

Re: Injector works. But generator and fetcher don't work.

Posted by Lewis John Mcgibbney <le...@gmail.com>.
which version of Nutch are you using?
Nutch 2 what?


On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan <
manikandan@thesocialpeople.net> wrote:

> Dear Lewis,
>
> I’m running Nutch 2 on a Hadoop 1.2.1 cluster (2 nodes). I’m using
> Cassandra as my backend datastore . I’m trying to crawl one link as of now.
> The inject command works properly: I’m able to find one row added to the
> “webpage” keyspace in Cassandra. But the generator doesn’t do a thing. So
> does the fetcher. In the end, nothing’s indexed in Solr.
>
> Please help me out. My stack trace is:
>
> hduser@nutch-one-qontifi:/usr/local/nutch$ bin/crawl urls/seed.txt
> TestCrawl http://10.130.231.16:8983/solr/nutch 2
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: starting at
> 2014-06-05 15:00:34
> 14/06/05 15:00:34 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir:
> urls/seed.txt
> 14/06/05 15:00:36 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:00:40 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:00:41 INFO crawl.InjectorJob: InjectorJob: Using class
> org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
> 14/06/05 15:00:44 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/06/05 15:00:44 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 14/06/05 15:00:44 WARN snappy.LoadSnappy: Snappy native library not loaded
> 14/06/05 15:00:44 INFO mapred.JobClient: Running job: job_201406051410_0011
> 14/06/05 15:00:45 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:01:00 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:01:02 INFO mapred.JobClient: Job complete:
> job_201406051410_0011
> 14/06/05 15:01:02 INFO mapred.JobClient: Counters: 19
> 14/06/05 15:01:02 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:01:02 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=14861
> 14/06/05 15:01:02 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:01:02 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:01:02 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:01:02 INFO mapred.JobClient:     Data-local map tasks=1
> 14/06/05 15:01:02 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/06/05 15:01:02 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:01:02 INFO mapred.JobClient:     Bytes Written=0
> 14/06/05 15:01:02 INFO mapred.JobClient:   injector
> 14/06/05 15:01:02 INFO mapred.JobClient:     urls_injected=1
> 14/06/05 15:01:02 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:01:02 INFO mapred.JobClient:     HDFS_BYTES_READ=135
> 14/06/05 15:01:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=77648
> 14/06/05 15:01:02 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:01:02 INFO mapred.JobClient:     Bytes Read=25
> 14/06/05 15:01:02 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:01:02 INFO mapred.JobClient:     Map input records=1
> 14/06/05 15:01:02 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=122052608
> 14/06/05 15:01:02 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:01:02 INFO mapred.JobClient:     CPU time spent (ms)=1490
> 14/06/05 15:01:02 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=58195968
> 14/06/05 15:01:02 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1119281152
> 14/06/05 15:01:02 INFO mapred.JobClient:     Map output records=1
> 14/06/05 15:01:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
> 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of
> urls rejected by filters: 0
> 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of
> urls injected after normalization and filtering: 1
> 14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05
> 15:01:02, elapsed: 00:00:28
> Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2
> Generating batchId
> Generating a new fetchlist
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting at
> 2014-06-05 15:01:06
> 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: Selecting
> best-scoring urls due for fetch.
> 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: starting
> 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: false
> 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: false
> 14/06/05 15:01:06 INFO crawl.GeneratorJob: GeneratorJob: topN: 50000
> 14/06/05 15:01:06 INFO crawl.FetchScheduleFactory: Using FetchSchedule
> impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 14/06/05 15:01:06 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 14/06/05 15:01:07 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:01:11 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:01:15 INFO mapred.JobClient: Running job: job_201406051410_0012
> 14/06/05 15:01:16 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:01:55 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:02:05 INFO mapred.JobClient:  map 100% reduce 33%
> 14/06/05 15:02:08 INFO mapred.JobClient:  map 100% reduce 66%
> 14/06/05 15:02:10 INFO mapred.JobClient:  map 100% reduce 83%
> 14/06/05 15:02:11 INFO mapred.JobClient:  map 100% reduce 100%
> 14/06/05 15:02:14 INFO mapred.JobClient: Job complete:
> job_201406051410_0012
> 14/06/05 15:02:14 INFO mapred.JobClient: Counters: 27
> 14/06/05 15:02:14 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:02:14 INFO mapred.JobClient:     Launched reduce tasks=2
> 14/06/05 15:02:14 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=39990
> 14/06/05 15:02:14 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:02:14 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=29119
> 14/06/05 15:02:14 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:02:14 INFO mapred.JobClient:     Bytes Written=0
> 14/06/05 15:02:14 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:02:14 INFO mapred.JobClient:     FILE_BYTES_READ=44
> 14/06/05 15:02:14 INFO mapred.JobClient:     HDFS_BYTES_READ=951
> 14/06/05 15:02:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=239453
> 14/06/05 15:02:14 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:02:14 INFO mapred.JobClient:     Bytes Read=0
> 14/06/05 15:02:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:02:14 INFO mapred.JobClient:     Map output materialized
> bytes=28
> 14/06/05 15:02:14 INFO mapred.JobClient:     Map input records=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Reduce shuffle bytes=28
> 14/06/05 15:02:14 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Map output bytes=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=333971456
> 14/06/05 15:02:14 INFO mapred.JobClient:     CPU time spent (ms)=9330
> 14/06/05 15:02:14 INFO mapred.JobClient:     Combine input records=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=951
> 14/06/05 15:02:14 INFO mapred.JobClient:     Reduce input records=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Combine output records=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=486813696
> 14/06/05 15:02:14 INFO mapred.JobClient:     Reduce output records=0
> 14/06/05 15:02:14 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=6016212992
> 14/06/05 15:02:14 INFO mapred.JobClient:     Map output records=0
> 14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: finished at
> 2014-06-05 15:02:14, time elapsed: 00:01:08
> 14/06/05 15:02:14 INFO crawl.GeneratorJob: GeneratorJob: generated batch
> id: 1401994862-29963
> Fetching :
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: starting
> 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: batchId:
> 1401994862-29963
> 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: threads: 50
> 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: parsing: false
> 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob: resuming: false
> 14/06/05 15:02:18 INFO fetcher.FetcherJob: FetcherJob : timelimit set for
> : 1402005738902
> 14/06/05 15:02:19 INFO plugin.PluginRepository: Plugins: looking in:
> /app/hadoop/tmp/hadoop-unjar813633856909664022/classes/plugins
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Registered Plugins:
> 14/06/05 15:02:20 INFO plugin.PluginRepository: the nutch core extension
> points (nutch-extensionpoints)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Normalizer
> (urlnormalizer-regex)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: CyberNeko HTML Parser
> (lib-nekohtml)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> (scoring-opic)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Basic URL Normalizer
> (urlnormalizer-basic)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Tika Parser Plug-in
> (parse-tika)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Basic Indexing Filter
> (index-basic)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Html Parse Plug-in
> (parse-html)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Anchor Indexing Filter
> (index-anchor)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: HTTP Framework (lib-http)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter
> (urlfilter-regex)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Regex URL Filter
> Framework (lib-regex-filter)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Pass-through URL
> Normalizer (urlnormalizer-pass)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Http Protocol Plug-in
> (protocol-http)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 14/06/05 15:02:20 INFO plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 14/06/05 15:02:20 INFO http.Http: http.proxy.host = null
> 14/06/05 15:02:20 INFO http.Http: http.proxy.port = 8080
> 14/06/05 15:02:20 INFO http.Http: http.timeout = 10000
> 14/06/05 15:02:20 INFO http.Http: http.content.limit = 65536
> 14/06/05 15:02:20 INFO http.Http: http.agent = Qontifi/Nutch-2.2.1 (A big
> data analytics and social media intelligence platform; http://qontifi.com;
> manikandan at thesocialpeople dot net)
> 14/06/05 15:02:20 INFO http.Http: http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 14/06/05 15:02:20 INFO http.Http: http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 14/06/05 15:02:20 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:02:25 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:02:29 INFO mapred.JobClient: Running job: job_201406051410_0013
> 14/06/05 15:02:30 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:03:05 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:03:14 INFO mapred.JobClient:  map 100% reduce 16%
> 14/06/05 15:03:16 INFO mapred.JobClient:  map 100% reduce 33%
> 14/06/05 15:03:17 INFO mapred.JobClient:  map 100% reduce 50%
> 14/06/05 15:03:19 INFO mapred.JobClient:  map 100% reduce 66%
> 14/06/05 15:03:23 INFO mapred.JobClient:  map 100% reduce 83%
> 14/06/05 15:03:28 INFO mapred.JobClient:  map 100% reduce 100%
> 14/06/05 15:03:31 INFO mapred.JobClient: Job complete:
> job_201406051410_0013
> 14/06/05 15:03:31 INFO mapred.JobClient: Counters: 28
> 14/06/05 15:03:31 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:03:31 INFO mapred.JobClient:     Launched reduce tasks=2
> 14/06/05 15:03:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=37163
> 14/06/05 15:03:31 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:03:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=39755
> 14/06/05 15:03:31 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:03:31 INFO mapred.JobClient:     Bytes Written=0
> 14/06/05 15:03:31 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:03:31 INFO mapred.JobClient:     FILE_BYTES_READ=44
> 14/06/05 15:03:31 INFO mapred.JobClient:     HDFS_BYTES_READ=935
> 14/06/05 15:03:31 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237923
> 14/06/05 15:03:31 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:03:31 INFO mapred.JobClient:     Bytes Read=0
> 14/06/05 15:03:31 INFO mapred.JobClient:   FetcherStatus
> 14/06/05 15:03:31 INFO mapred.JobClient:     HitByTimeLimit-QueueFeeder=0
> 14/06/05 15:03:31 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:03:31 INFO mapred.JobClient:     Map output materialized
> bytes=28
> 14/06/05 15:03:31 INFO mapred.JobClient:     Map input records=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Reduce shuffle bytes=28
> 14/06/05 15:03:31 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Map output bytes=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=375914496
> 14/06/05 15:03:31 INFO mapred.JobClient:     CPU time spent (ms)=9820
> 14/06/05 15:03:31 INFO mapred.JobClient:     Combine input records=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     SPLIT_RAW_BYTES=935
> 14/06/05 15:03:31 INFO mapred.JobClient:     Reduce input records=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Reduce input groups=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Combine output records=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=510382080
> 14/06/05 15:03:31 INFO mapred.JobClient:     Reduce output records=0
> 14/06/05 15:03:31 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=6060650496
> 14/06/05 15:03:31 INFO mapred.JobClient:     Map output records=0
> 14/06/05 15:03:31 INFO fetcher.FetcherJob: FetcherJob: done
> Parsing :
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: starting
> 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: resuming: false
> 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: forced reparse: false
> 14/06/05 15:03:34 INFO parse.ParserJob: ParserJob: batchId:
> 1401994862-29963
> 14/06/05 15:03:35 INFO plugin.PluginRepository: Plugins: looking in:
> /app/hadoop/tmp/hadoop-unjar8143815380567453850/classes/plugins
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Registered Plugins:
> 14/06/05 15:03:36 INFO plugin.PluginRepository: the nutch core extension
> points (nutch-extensionpoints)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Normalizer
> (urlnormalizer-regex)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: CyberNeko HTML Parser
> (lib-nekohtml)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> (scoring-opic)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Basic URL Normalizer
> (urlnormalizer-basic)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Tika Parser Plug-in
> (parse-tika)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Basic Indexing Filter
> (index-basic)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Html Parse Plug-in
> (parse-html)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Anchor Indexing Filter
> (index-anchor)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: HTTP Framework (lib-http)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter
> (urlfilter-regex)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Regex URL Filter
> Framework (lib-regex-filter)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Pass-through URL
> Normalizer (urlnormalizer-pass)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Http Protocol Plug-in
> (protocol-http)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 14/06/05 15:03:36 INFO plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 14/06/05 15:03:36 INFO conf.Configuration: found resource
> parse-plugins.xml at
> file:/app/hadoop/tmp/hadoop-unjar8143815380567453850/parse-plugins.xml
> 14/06/05 15:03:36 INFO crawl.SignatureFactory: Using Signature impl:
> org.apache.nutch.crawl.MD5Signature
> 14/06/05 15:03:37 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:03:41 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:03:45 INFO mapred.JobClient: Running job: job_201406051410_0014
> 14/06/05 15:03:46 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:04:22 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:04:24 INFO mapred.JobClient: Job complete:
> job_201406051410_0014
> 14/06/05 15:04:25 INFO mapred.JobClient: Counters: 17
> 14/06/05 15:04:25 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:04:25 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36653
> 14/06/05 15:04:25 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:04:25 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:04:25 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:04:25 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/06/05 15:04:25 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:04:25 INFO mapred.JobClient:     Bytes Written=0
> 14/06/05 15:04:25 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:04:25 INFO mapred.JobClient:     HDFS_BYTES_READ=979
> 14/06/05 15:04:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78853
> 14/06/05 15:04:25 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:04:25 INFO mapred.JobClient:     Bytes Read=0
> 14/06/05 15:04:25 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:04:25 INFO mapred.JobClient:     Map input records=0
> 14/06/05 15:04:25 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=129826816
> 14/06/05 15:04:25 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:04:25 INFO mapred.JobClient:     CPU time spent (ms)=2330
> 14/06/05 15:04:25 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=60817408
> 14/06/05 15:04:25 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=2000629760
> 14/06/05 15:04:25 INFO mapred.JobClient:     Map output records=0
> 14/06/05 15:04:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=979
> 14/06/05 15:04:25 INFO parse.ParserJob: ParserJob: success
> CrawlDB update for TestCrawl
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:04:28 INFO crawl.DbUpdaterJob: DbUpdaterJob: starting
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Plugins: looking in:
> /app/hadoop/tmp/hadoop-unjar4238316120015868426/classes/plugins
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Registered Plugins:
> 14/06/05 15:04:29 INFO plugin.PluginRepository: the nutch core extension
> points (nutch-extensionpoints)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Normalizer
> (urlnormalizer-regex)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: CyberNeko HTML Parser
> (lib-nekohtml)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> (scoring-opic)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Basic URL Normalizer
> (urlnormalizer-basic)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Tika Parser Plug-in
> (parse-tika)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Basic Indexing Filter
> (index-basic)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Html Parse Plug-in
> (parse-html)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Anchor Indexing Filter
> (index-anchor)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: HTTP Framework (lib-http)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter
> (urlfilter-regex)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Regex URL Filter
> Framework (lib-regex-filter)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Pass-through URL
> Normalizer (urlnormalizer-pass)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Http Protocol Plug-in
> (protocol-http)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 14/06/05 15:04:29 INFO plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 14/06/05 15:04:30 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:04:34 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:04:38 INFO mapred.JobClient: Running job: job_201406051410_0015
> 14/06/05 15:04:39 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:05:21 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:05:31 INFO mapred.JobClient:  map 100% reduce 33%
> 14/06/05 15:05:34 INFO mapred.JobClient:  map 100% reduce 66%
> 14/06/05 15:05:37 INFO mapred.JobClient:  map 100% reduce 100%
> 14/06/05 15:05:39 INFO mapred.JobClient: Job complete:
> job_201406051410_0015
> 14/06/05 15:05:39 INFO mapred.JobClient: Counters: 27
> 14/06/05 15:05:39 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:05:39 INFO mapred.JobClient:     Launched reduce tasks=2
> 14/06/05 15:05:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=39898
> 14/06/05 15:05:39 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:05:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=30439
> 14/06/05 15:05:39 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:05:39 INFO mapred.JobClient:     Bytes Written=0
> 14/06/05 15:05:39 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:05:39 INFO mapred.JobClient:     FILE_BYTES_READ=44
> 14/06/05 15:05:39 INFO mapred.JobClient:     HDFS_BYTES_READ=1028
> 14/06/05 15:05:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=237914
> 14/06/05 15:05:39 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:05:39 INFO mapred.JobClient:     Bytes Read=0
> 14/06/05 15:05:39 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:05:39 INFO mapred.JobClient:     Map output materialized
> bytes=28
> 14/06/05 15:05:39 INFO mapred.JobClient:     Map input records=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Reduce shuffle bytes=28
> 14/06/05 15:05:39 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Map output bytes=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=375914496
> 14/06/05 15:05:39 INFO mapred.JobClient:     CPU time spent (ms)=8880
> 14/06/05 15:05:39 INFO mapred.JobClient:     Combine input records=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1028
> 14/06/05 15:05:39 INFO mapred.JobClient:     Reduce input records=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Reduce input groups=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Combine output records=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=490651648
> 14/06/05 15:05:39 INFO mapred.JobClient:     Reduce output records=0
> 14/06/05 15:05:39 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=6002880512
> 14/06/05 15:05:39 INFO mapred.JobClient:     Map output records=0
> 14/06/05 15:05:39 INFO crawl.DbUpdaterJob: DbUpdaterJob: done
> Indexing TestCrawl on SOLR index -> http://10.130.231.16:8983/solr/nutch
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:05:43 INFO solr.SolrIndexerJob: SolrIndexerJob: starting
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Plugins: looking in:
> /app/hadoop/tmp/hadoop-unjar7543842044056940295/classes/plugins
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Registered Plugins:
> 14/06/05 15:05:44 INFO plugin.PluginRepository: the nutch core extension
> points (nutch-extensionpoints)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Normalizer
> (urlnormalizer-regex)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: CyberNeko HTML Parser
> (lib-nekohtml)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> (scoring-opic)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Basic URL Normalizer
> (urlnormalizer-basic)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Tika Parser Plug-in
> (parse-tika)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Basic Indexing Filter
> (index-basic)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Html Parse Plug-in
> (parse-html)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Anchor Indexing Filter
> (index-anchor)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: HTTP Framework (lib-http)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter
> (urlfilter-regex)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Regex URL Filter
> Framework (lib-regex-filter)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Pass-through URL
> Normalizer (urlnormalizer-pass)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Http Protocol Plug-in
> (protocol-http)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Parse Filter
> (org.apache.nutch.parse.ParseFilter)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 14/06/05 15:05:44 INFO plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 14/06/05 15:05:44 INFO basic.BasicIndexingFilter: Maximum title length for
> indexing set to: 100
> 14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 14/06/05 15:05:44 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> is: off
> 14/06/05 15:05:44 INFO indexer.IndexingFilters: Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 14/06/05 15:05:45 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:05:49 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:05:52 INFO mapred.JobClient: Running job: job_201406051410_0016
> 14/06/05 15:05:53 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:06:29 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:06:32 INFO mapred.JobClient: Job complete:
> job_201406051410_0016
> 14/06/05 15:06:32 INFO mapred.JobClient: Counters: 17
> 14/06/05 15:06:32 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:06:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36879
> 14/06/05 15:06:32 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:06:32 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:06:32 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:06:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/06/05 15:06:32 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:06:32 INFO mapred.JobClient:     Bytes Written=0
> 14/06/05 15:06:32 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:06:32 INFO mapred.JobClient:     HDFS_BYTES_READ=962
> 14/06/05 15:06:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78923
> 14/06/05 15:06:32 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:06:32 INFO mapred.JobClient:     Bytes Read=0
> 14/06/05 15:06:32 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:06:32 INFO mapred.JobClient:     Map input records=0
> 14/06/05 15:06:32 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=114335744
> 14/06/05 15:06:32 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:06:32 INFO mapred.JobClient:     CPU time spent (ms)=2670
> 14/06/05 15:06:32 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=60293120
> 14/06/05 15:06:32 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=1990189056
> 14/06/05 15:06:32 INFO mapred.JobClient:     Map output records=0
> 14/06/05 15:06:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=962
> 14/06/05 15:06:32 INFO solr.SolrIndexerJob: SolrIndexerJob: done.
>
> When I run readdb -stats, I get:
>
> hduser@nutch-one-qontifi:/usr/local/nutch$ bin/nutch readdb TestCrawl
> -stats
> Warning: $HADOOP_HOME is deprecated.
>
> 14/06/05 15:13:19 INFO crawl.WebTableReader: WebTable statistics start
> 14/06/05 15:13:21 INFO connection.CassandraHostRetryService: Downed Host
> Retry service started with queue size -1 and retry delay 10s
> 14/06/05 15:13:25 INFO service.JmxMonitor: Registering JMX
> me.prettyprint.cassandra.service_Qontifi:ServiceType=hector,MonitorType=hector
> 14/06/05 15:13:29 INFO mapred.JobClient: Running job: job_201406051410_0019
> 14/06/05 15:13:30 INFO mapred.JobClient:  map 0% reduce 0%
> 14/06/05 15:14:06 INFO mapred.JobClient:  map 100% reduce 0%
> 14/06/05 15:14:15 INFO mapred.JobClient:  map 100% reduce 33%
> 14/06/05 15:14:17 INFO mapred.JobClient:  map 100% reduce 100%
> 14/06/05 15:14:19 INFO mapred.JobClient: Job complete:
> job_201406051410_0019
> 14/06/05 15:14:19 INFO mapred.JobClient: Counters: 28
> 14/06/05 15:14:19 INFO mapred.JobClient:   Job Counters
> 14/06/05 15:14:19 INFO mapred.JobClient:     Launched reduce tasks=1
> 14/06/05 15:14:19 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=36697
> 14/06/05 15:14:19 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Launched map tasks=1
> 14/06/05 15:14:19 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10302
> 14/06/05 15:14:19 INFO mapred.JobClient:   File Output Format Counters
> 14/06/05 15:14:19 INFO mapred.JobClient:     Bytes Written=86
> 14/06/05 15:14:19 INFO mapred.JobClient:   FileSystemCounters
> 14/06/05 15:14:19 INFO mapred.JobClient:     FILE_BYTES_READ=6
> 14/06/05 15:14:19 INFO mapred.JobClient:     HDFS_BYTES_READ=1135
> 14/06/05 15:14:19 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=157112
> 14/06/05 15:14:19 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=86
> 14/06/05 15:14:19 INFO mapred.JobClient:   File Input Format Counters
> 14/06/05 15:14:19 INFO mapred.JobClient:     Bytes Read=0
> 14/06/05 15:14:19 INFO mapred.JobClient:   Map-Reduce Framework
> 14/06/05 15:14:19 INFO mapred.JobClient:     Map output materialized
> bytes=6
> 14/06/05 15:14:19 INFO mapred.JobClient:     Map input records=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Reduce shuffle bytes=6
> 14/06/05 15:14:19 INFO mapred.JobClient:     Spilled Records=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Map output bytes=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=216530944
> 14/06/05 15:14:19 INFO mapred.JobClient:     CPU time spent (ms)=2450
> 14/06/05 15:14:19 INFO mapred.JobClient:     Combine input records=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1135
> 14/06/05 15:14:19 INFO mapred.JobClient:     Reduce input records=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Reduce input groups=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Combine output records=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=320630784
> 14/06/05 15:14:19 INFO mapred.JobClient:     Reduce output records=0
> 14/06/05 15:14:19 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=2254024704
> 14/06/05 15:14:19 INFO mapred.JobClient:     Map output records=0
> 14/06/05 15:14:19 INFO crawl.WebTableReader: Statistics for WebTable:
> 14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: {db_stats-job_201406051410_0019={jobID=job_201406051410_0019,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job
> Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697,
> FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0,
> TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce
> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
> REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
> COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450,
> SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0,
> REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
> PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0,
> VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0},
> FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135,
> FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format
> Counters ={BYTES_WRITTEN=86}}}}
> 14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0
> 14/06/05 15:14:19 INFO crawl.WebTableReader: WebTable statistics: done
> 14/06/05 15:14:19 INFO crawl.WebTableReader: jobs: {db_stats-job_201406051410_0019={jobID=job_201406051410_0019,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Job
> Counters ={TOTAL_LAUNCHED_REDUCES=1, SLOTS_MILLIS_MAPS=36697,
> FALLOW_SLOTS_MILLIS_REDUCES=0, FALLOW_SLOTS_MILLIS_MAPS=0,
> TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=10302}, Map-Reduce
> Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
> REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
> COMMITTED_HEAP_BYTES=216530944, CPU_MILLISECONDS=2450,
> SPLIT_RAW_BYTES=1135, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0,
> REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
> PHYSICAL_MEMORY_BYTES=320630784, REDUCE_OUTPUT_RECORDS=0,
> VIRTUAL_MEMORY_BYTES=2254024704, MAP_OUTPUT_RECORDS=0},
> FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1135,
> FILE_BYTES_WRITTEN=157112, HDFS_BYTES_WRITTEN=86}, File Output Format
> Counters ={BYTES_WRITTEN=86}}}}
> 14/06/05 15:14:19 INFO crawl.WebTableReader: TOTAL urls: 0
>
> --
> Manikandan Saravanan
> Architect - Technology
> TheSocialPeople <http://thesocialpeople.net>
>



-- 
*Lewis*