You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Yossi Tamari <yo...@pipl.com> on 2017/04/30 14:04:09 UTC

Wrong FS exception in Fetcher

Hi,

 

I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
distributed mode.

Running the command:

Deploy/bin/crawl urls crawl 2

The Injector and Generator run successfully, but in the Fetcher I get the
following error:

17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
expected: file:///

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)

        at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
6)

        at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
System.java:630)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
leSystem.java:861)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
a:625)

        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
5)

        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)

        at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
ormat.java:55)

        at
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)

        at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
:141)

        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)

        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)

        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)

        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)

        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)

        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)

        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.hadoop.util.RunJar.run(RunJar.java:234)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

 

Error running:

  /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
crawl/segments/20170430084337 -noParsing -threads 50

Failed with exit value 255.

 

 

Any ideas how to fix this?

 

Thanks,

               Yossi.


RE: Wrong FS exception in Fetcher

Posted by Yossi Tamari <yo...@pipl.com>.
Hi,

 

Setting the MapReduce framework to YARN solved this issue.

 

            Yossi.

 

From: Yossi Tamari [mailto:yossi.tamari@pipl.com] 
Sent: 30 April 2017 17:04
To: user@nutch.apache.org
Subject: Wrong FS exception in Fetcher

 

Hi,

 

I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
distributed mode.

Running the command:

Deploy/bin/crawl urls crawl 2

The Injector and Generator run successfully, but in the Fetcher I get the
following error:

17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
java.lang.IllegalArgumentException: Wrong FS:
hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
expected:  <file:///> file:///

        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)

        at
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
6)

        at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
System.java:630)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
leSystem.java:861)

        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
a:625)

        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
5)

        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)

        at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
ormat.java:55)

        at
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)

        at
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
:141)

        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)

        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)

        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)

        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1807)

        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)

        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)

        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)

        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)

        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)

        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)

        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)

        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

        at org.apache.hadoop.util.RunJar.run(RunJar.java:234)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

 

Error running:

  /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
crawl/segments/20170430084337 -noParsing -threads 50

Failed with exit value 255.

 

 

Any ideas how to fix this?

 

Thanks,

               Yossi.


RE: Wrong FS exception in Fetcher

Posted by Yossi Tamari <yo...@pipl.com>.
Hi,

Issue created: https://issues.apache.org/jira/browse/NUTCH-2383.

Thanks,
Yossi.


-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: 02 May 2017 16:08
To: user@nutch.apache.org
Subject: Re: Wrong FS exception in Fetcher

Hi Yossi,

> that 1.13 requires Hadoop 2.7.2 specifically.

That's not a hard requirement. Usually you have to use the Hadoop version of your running Hadoop
cluster. Mostly this causes no problems, but if there are problems it's a good strategy to try
this first.

Thanks, for the detailed log. All steps are called the same way. The method
checkOutputSpecs(FileSystem, JobConf) is first called in the Fetcher.
It probably needs debugging to find out why here a local file system for the
output path is assumed.

Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian

On 05/02/2017 01:21 PM, Yossi Tamari wrote:
> Thanks Sebastian,
> 
> The output with set -x is below. I'm new to Nutch and was not aware that 1.13 requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a good idea to document it in the download page and provide a download link (since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to install 2.7.2 and retest tomorrow.
> 
> root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls crawl 2
> Injecting seed URLs
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 3 = 0 ']'
> + COMMAND=inject
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' inject = crawl ']'
> + '[' inject = inject ']'
> + CLASS=org.apache.nutch.crawl.Injector
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.crawl.Injector crawl/crawldb urls
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 06:00:24
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
> 17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
> 17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
> 17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local307378419_0001
> 17/05/02 06:00:26 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
> 17/05/02 06:00:26 INFO mapreduce.Job: Running job: job_local307378419_0001
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter set in config null
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for map tasks
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_m_000000_0
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:26 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
> 17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-unjar333276722181778867/classes/plugins
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Plugins:
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Anchor Indexing Filter (index-anchor)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Tika Parser Plug-in (parse-tika)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Http Protocol Plug-in (protocol-http)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         ElasticIndexWriter (indexer-elastic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Extension-Points:
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
> 17/05/02 06:00:26 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar333276722181778867/regex-urlfilter.txt
> 17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: map
> 17/05/02 06:00:26 INFO mapred.Task: Task 'attempt_local307378419_0001_m_000000_0' done.
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_m_000000_0
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_m_000001_0
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:26 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/urls/seed.txt:0+24
> 17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
> 17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000001_0 is done. And is in the process of committing
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/urls/seed.txt:0+24
> 17/05/02 06:00:26 INFO mapred.Task: Task 'attempt_local307378419_0001_m_000001_0' done.
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_m_000001_0
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: map task executor complete.
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_r_000000_0
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@504b0ec4
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:26 INFO reduce.EventFetcher: attempt_local307378419_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local307378419_0001_m_000001_0 decomp: 58 len: 62 to MEMORY
> 17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output for attempt_local307378419_0001_m_000001_0
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 58, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->58
> 17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local307378419_0001_m_000000_0 decomp: 58 len: 62 to MEMORY
> 17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output for attempt_local307378419_0001_m_000000_0
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 58, inMemoryMapOutputs.size() -> 2, commitMemory -> 58, usedMemory ->116
> 17/05/02 06:00:26 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:26 INFO mapred.Merger: Merging 2 sorted segments
> 17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 62 bytes
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merged 2 segments, 116 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 1 files, 118 bytes from disk
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:26 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 87 bytes
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
> 17/05/02 06:00:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 17/05/02 06:00:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
> 17/05/02 06:00:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
> 17/05/02 06:00:27 INFO crawl.Injector: Injector: overwrite: false
> 17/05/02 06:00:27 INFO crawl.Injector: Injector: update: false
> 17/05/02 06:00:27 INFO mapreduce.Job: Job job_local307378419_0001 running in uber mode : false
> 17/05/02 06:00:27 INFO mapreduce.Job:  map 100% reduce 0%
> 17/05/02 06:00:27 INFO mapred.Task: Task:attempt_local307378419_0001_r_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
> 17/05/02 06:00:27 INFO mapred.Task: Task attempt_local307378419_0001_r_000000_0 is allowed to commit now
> 17/05/02 06:00:27 INFO output.FileOutputCommitter: Saved output of task 'attempt_local307378419_0001_r_000000_0' to hdfs://localhost:9000/user/root/crawl/crawldb/crawldb-921346783/_temporary/0/task_local307378419_0001_r_000000
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:27 INFO mapred.Task: Task 'attempt_local307378419_0001_r_000000_0' done.
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_r_000000_0
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce task executor complete.
> 17/05/02 06:00:28 INFO mapreduce.Job:  map 100% reduce 100%
> 17/05/02 06:00:28 INFO mapreduce.Job: Job job_local307378419_0001 completed successfully
> 17/05/02 06:00:28 INFO mapreduce.Job: Counters: 37
>         File System Counters
>                 FILE: Number of bytes read=652298479
>                 FILE: Number of bytes written=658557993
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=492
>                 HDFS: Number of bytes written=365
>                 HDFS: Number of read operations=46
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=13
>         Map-Reduce Framework
>                 Map input records=2
>                 Map output records=2
>                 Map output bytes=108
>                 Map output materialized bytes=124
>                 Input split bytes=570
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=124
>                 Reduce input records=2
>                 Reduce output records=1
>                 Spilled Records=4
>                 Shuffled Maps =2
>                 Failed Shuffles=0
>                 Merged Map outputs=2
>                 GC time elapsed (ms)=15
>                 Total committed heap usage (bytes)=1044381696
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         injector
>                 urls_injected=1
>                 urls_merged=1
>         File Input Format Counters
>                 Bytes Read=0
>         File Output Format Counters
>                 Bytes Written=365
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls rejected by filters: 0
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 1
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 1
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total new urls injected: 0
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: finished at 2017-05-02 06:00:28, elapsed: 00:00:04
> Tue May 2 06:00:28 CDT 2017 : Iteration 1 of 2
> Generating a new segment
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 18 = 0 ']'
> + COMMAND=generate
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' generate = crawl ']'
> + '[' generate = inject ']'
> + '[' generate = generate ']'
> + CLASS=org.apache.nutch.crawl.Generator
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.crawl.Generator -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: starting at 2017-05-02 06:00:32
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: filtering: false
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: normalizing: true
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: topN: 50000
> 17/05/02 06:00:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
> 17/05/02 06:00:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 17/05/02 06:00:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:33 INFO mapred.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:33 INFO mapreduce.JobSubmitter: number of splits:1
> 17/05/02 06:00:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1706016672_0001
> 17/05/02 06:00:33 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
> 17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
> 17/05/02 06:00:33 INFO mapreduce.Job: Running job: job_local1706016672_0001
> 17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
> 17/05/02 06:00:33 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:33 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for map tasks
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:34 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
> 17/05/02 06:00:34 INFO mapred.MapTask: numReduceTasks: 2
> 17/05/02 06:00:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:34 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-unjar7886623985863993949/classes/plugins
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Plugins:
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Anchor Indexing Filter (index-anchor)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Tika Parser Plug-in (parse-tika)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Http Protocol Plug-in (protocol-http)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         ElasticIndexWriter (indexer-elastic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Extension-Points:
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar7886623985863993949/regex-urlfilter.txt
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
> 17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 'partition', using default
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:34 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:34 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufend = 83; bufvoid = 104857600
> 17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
> 17/05/02 06:00:34 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_m_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
> 17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_m_000000_0' done.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: map task executor complete.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for reduce tasks
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_r_000000_0
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2fd7e5ad
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:34 INFO reduce.EventFetcher: attempt_local1706016672_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
> 17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1706016672_0001_m_000000_0 decomp: 87 len: 83 to MEMORY
> 17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 87 bytes from map-output for attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 87, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->87
> 17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 87 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 91 bytes from disk
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
> 17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 'generate_host_count', using default
> 17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_r_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO mapred.Task: Task attempt_local1706016672_0001_r_000000_0 is allowed to commit now
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1706016672_0001_r_000000_0' to hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/_temporary/0/task_local1706016672_0001_r_000000
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_r_000000_0' done.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_r_000000_0
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_r_000001_0
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@29cfa49
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:34 INFO reduce.EventFetcher: attempt_local1706016672_0001_r_000001_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#2 about to shuffle output of map attempt_local1706016672_0001_m_000000_0 decomp: 2 len: 14 to MEMORY
> 17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
> 17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 22 bytes from disk
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
> 17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_r_000001_0 is done. And is in the process of committing
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_r_000001_0' done.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_r_000001_0
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce task executor complete.
> 17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 running in uber mode : false
> 17/05/02 06:00:34 INFO mapreduce.Job:  map 100% reduce 100%
> 17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 completed successfully
> 17/05/02 06:00:35 INFO mapreduce.Job: Counters: 35
>         File System Counters
>                 FILE: Number of bytes read=652296139
>                 FILE: Number of bytes written=658571046
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=444
>                 HDFS: Number of bytes written=398
>                 HDFS: Number of read operations=37
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=13
>         Map-Reduce Framework
>                 Map input records=1
>                 Map output records=1
>                 Map output bytes=83
>                 Map output materialized bytes=97
>                 Input split bytes=123
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=97
>                 Reduce input records=1
>                 Reduce output records=1
>                 Spilled Records=2
>                 Shuffled Maps =2
>                 Failed Shuffles=0
>                 Merged Map outputs=2
>                 GC time elapsed (ms)=8
>                 Total committed heap usage (bytes)=1036517376
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=148
>         File Output Format Counters
>                 Bytes Written=199
> 17/05/02 06:00:35 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
> 17/05/02 06:00:36 INFO crawl.Generator: Generator: segment: crawl/segments/20170502060036
> 17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:36 INFO mapred.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:36 INFO mapreduce.JobSubmitter: number of splits:1
> 17/05/02 06:00:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1332900929_0002
> 17/05/02 06:00:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
> 17/05/02 06:00:36 INFO mapreduce.Job: Running job: job_local1332900929_0002
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for map tasks
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1332900929_0002_m_000000_0
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
> 17/05/02 06:00:36 INFO mapred.MapTask: numReduceTasks: 1
> 17/05/02 06:00:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:36 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:36 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:36 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:36 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufend = 104; bufvoid = 104857600
> 17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:36 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:36 INFO mapred.Task: Task:attempt_local1332900929_0002_m_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
> 17/05/02 06:00:36 INFO mapred.Task: Task 'attempt_local1332900929_0002_m_000000_0' done.
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1332900929_0002_m_000000_0
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: map task executor complete.
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for reduce tasks
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1332900929_0002_r_000000_0
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:36 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@57dcd1f6
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:36 INFO reduce.EventFetcher: attempt_local1332900929_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:36 INFO reduce.LocalFetcher: localfetcher#3 about to shuffle output of map attempt_local1332900929_0002_m_000000_0 decomp: 108 len: 82 to MEMORY
> 17/05/02 06:00:36 INFO reduce.InMemoryMapOutput: Read 108 bytes from map-output for attempt_local1332900929_0002_m_000000_0
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 108, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->108
> 17/05/02 06:00:36 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merged 1 segments, 108 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 1 files, 90 bytes from disk
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:36 INFO mapred.Task: Task:attempt_local1332900929_0002_r_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:36 INFO mapred.Task: Task attempt_local1332900929_0002_r_000000_0 is allowed to commit now
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1332900929_0002_r_000000_0' to hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_generate/_temporary/0/task_local1332900929_0002_r_000000
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:36 INFO mapred.Task: Task 'attempt_local1332900929_0002_r_000000_0' done.
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1332900929_0002_r_000000_0
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce task executor complete.
> 17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 running in uber mode : false
> 17/05/02 06:00:37 INFO mapreduce.Job:  map 100% reduce 100%
> 17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 completed successfully
> 17/05/02 06:00:37 INFO mapreduce.Job: Counters: 35
>         File System Counters
>                 FILE: Number of bytes read=869728356
>                 FILE: Number of bytes written=878093356
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=694
>                 HDFS: Number of bytes written=567
>                 HDFS: Number of read operations=53
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=18
>         Map-Reduce Framework
>                 Map input records=1
>                 Map output records=1
>                 Map output bytes=104
>                 Map output materialized bytes=82
>                 Input split bytes=157
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=82
>                 Reduce input records=1
>                 Reduce output records=1
>                 Spilled Records=2
>                 Shuffled Maps =1
>                 Failed Shuffles=0
>                 Merged Map outputs=1
>                 GC time elapsed (ms)=0
>                 Total committed heap usage (bytes)=901775360
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=199
>         File Output Format Counters
>                 Bytes Written=169
> 17/05/02 06:00:37 INFO crawl.Generator: Generator: finished at 2017-05-02 06:00:37, elapsed: 00:00:05
> Operating on segment : 20170502060036
> Fetching : 20170502060036
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 17 = 0 ']'
> + COMMAND=fetch
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' fetch = crawl ']'
> + '[' fetch = inject ']'
> + '[' fetch = generate ']'
> + '[' fetch = freegen ']'
> + '[' fetch = fetch ']'
> + CLASS=org.apache.nutch.fetcher.Fetcher
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.fetcher.Fetcher -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
> 17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: starting at 2017-05-02 06:00:43
> 17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: segment: crawl/segments/20170502060036
> 17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1493733643194
> 17/05/02 06:00:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
> 17/05/02 06:00:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 17/05/02 06:00:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:44 ERROR fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_fetch, expected: file:///
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
>         at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
>         at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:630)
>         at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
>         at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
>         at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:435)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
>         at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:55)
>         at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
>         at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:141)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
>         at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> 
> Error running:
>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
> Failed with exit value 255.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
> Sent: 02 May 2017 13:54
> To: user@nutch.apache.org
> Subject: Re: Wrong FS exception in Fetcher
> 
> Hi Yossi,
> 
> strange error, indeed. Is it also reproducible in pseudo-distributed mode using Hadoop 2.7.2,
> the version Nutch depends on?n
> 
> Could you also add the line
>   set -x
> to bin/nutch and run bin/crawl again to see how all steps are executed.
> 
> Thanks,
> Sebastian
> 
> On 04/30/2017 04:04 PM, Yossi Tamari wrote:
>> Hi,
>>
>>  
>>
>> I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
>> distributed mode.
>>
>> Running the command:
>>
>> Deploy/bin/crawl urls crawl 2
>>
>> The Injector and Generator run successfully, but in the Fetcher I get the
>> following error:
>>
>> 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
>> java.lang.IllegalArgumentException: Wrong FS:
>> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
>> expected: file:///
>>
>>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
>> 6)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
>> System.java:630)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
>> leSystem.java:861)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
>> a:625)
>>
>>         at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
>> 5)
>>
>>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
>>
>>         at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
>> ormat.java:55)
>>
>>         at
>> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
>>
>>         at
>> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
>> :141)
>>
>>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>>
>>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
>>
>>         at java.security.AccessController.doPrivileged(Native Method)
>>
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
>> va:1807)
>>
>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
>>
>>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>>
>>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>>
>>         at java.security.AccessController.doPrivileged(Native Method)
>>
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
>> va:1807)
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>>
>>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>>
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
>>
>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
>>
>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>>
>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
>> )
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:43)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>
>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
>>
>>  
>>
>> Error running:
>>
>>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
>> crawl/segments/20170430084337 -noParsing -threads 50
>>
>> Failed with exit value 255.
>>
>>  
>>
>>  
>>
>> Any ideas how to fix this?
>>
>>  
>>
>> Thanks,
>>
>>                Yossi.
>>
>>
> 
> 



Re: Wrong FS exception in Fetcher

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Yossi,

> that 1.13 requires Hadoop 2.7.2 specifically.

That's not a hard requirement. Usually you have to use the Hadoop version of your running Hadoop
cluster. Mostly this causes no problems, but if there are problems it's a good strategy to try
this first.

Thanks, for the detailed log. All steps are called the same way. The method
checkOutputSpecs(FileSystem, JobConf) is first called in the Fetcher.
It probably needs debugging to find out why here a local file system for the
output path is assumed.

Please, open an issue on
  https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian

On 05/02/2017 01:21 PM, Yossi Tamari wrote:
> Thanks Sebastian,
> 
> The output with set -x is below. I'm new to Nutch and was not aware that 1.13 requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a good idea to document it in the download page and provide a download link (since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to install 2.7.2 and retest tomorrow.
> 
> root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls crawl 2
> Injecting seed URLs
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 3 = 0 ']'
> + COMMAND=inject
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' inject = crawl ']'
> + '[' inject = inject ']'
> + CLASS=org.apache.nutch.crawl.Injector
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.crawl.Injector crawl/crawldb urls
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 06:00:24
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
> 17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
> 17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
> 17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
> 17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local307378419_0001
> 17/05/02 06:00:26 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
> 17/05/02 06:00:26 INFO mapreduce.Job: Running job: job_local307378419_0001
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter set in config null
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for map tasks
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_m_000000_0
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:26 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
> 17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-unjar333276722181778867/classes/plugins
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Plugins:
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Anchor Indexing Filter (index-anchor)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Tika Parser Plug-in (parse-tika)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Http Protocol Plug-in (protocol-http)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         ElasticIndexWriter (indexer-elastic)
> 17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Extension-Points:
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
> 17/05/02 06:00:26 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar333276722181778867/regex-urlfilter.txt
> 17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: map
> 17/05/02 06:00:26 INFO mapred.Task: Task 'attempt_local307378419_0001_m_000000_0' done.
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_m_000000_0
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_m_000001_0
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:26 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/urls/seed.txt:0+24
> 17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
> 17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 104857600
> 17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000001_0 is done. And is in the process of committing
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/urls/seed.txt:0+24
> 17/05/02 06:00:26 INFO mapred.Task: Task 'attempt_local307378419_0001_m_000001_0' done.
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_m_000001_0
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: map task executor complete.
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_r_000000_0
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@504b0ec4
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:26 INFO reduce.EventFetcher: attempt_local307378419_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local307378419_0001_m_000001_0 decomp: 58 len: 62 to MEMORY
> 17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output for attempt_local307378419_0001_m_000001_0
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 58, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->58
> 17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local307378419_0001_m_000000_0 decomp: 58 len: 62 to MEMORY
> 17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output for attempt_local307378419_0001_m_000000_0
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 58, inMemoryMapOutputs.size() -> 2, commitMemory -> 58, usedMemory ->116
> 17/05/02 06:00:26 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:26 INFO mapred.Merger: Merging 2 sorted segments
> 17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 62 bytes
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merged 2 segments, 116 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 1 files, 118 bytes from disk
> 17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:26 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 87 bytes
> 17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
> 17/05/02 06:00:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 17/05/02 06:00:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
> 17/05/02 06:00:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
> 17/05/02 06:00:27 INFO crawl.Injector: Injector: overwrite: false
> 17/05/02 06:00:27 INFO crawl.Injector: Injector: update: false
> 17/05/02 06:00:27 INFO mapreduce.Job: Job job_local307378419_0001 running in uber mode : false
> 17/05/02 06:00:27 INFO mapreduce.Job:  map 100% reduce 0%
> 17/05/02 06:00:27 INFO mapred.Task: Task:attempt_local307378419_0001_r_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
> 17/05/02 06:00:27 INFO mapred.Task: Task attempt_local307378419_0001_r_000000_0 is allowed to commit now
> 17/05/02 06:00:27 INFO output.FileOutputCommitter: Saved output of task 'attempt_local307378419_0001_r_000000_0' to hdfs://localhost:9000/user/root/crawl/crawldb/crawldb-921346783/_temporary/0/task_local307378419_0001_r_000000
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:27 INFO mapred.Task: Task 'attempt_local307378419_0001_r_000000_0' done.
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_r_000000_0
> 17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce task executor complete.
> 17/05/02 06:00:28 INFO mapreduce.Job:  map 100% reduce 100%
> 17/05/02 06:00:28 INFO mapreduce.Job: Job job_local307378419_0001 completed successfully
> 17/05/02 06:00:28 INFO mapreduce.Job: Counters: 37
>         File System Counters
>                 FILE: Number of bytes read=652298479
>                 FILE: Number of bytes written=658557993
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=492
>                 HDFS: Number of bytes written=365
>                 HDFS: Number of read operations=46
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=13
>         Map-Reduce Framework
>                 Map input records=2
>                 Map output records=2
>                 Map output bytes=108
>                 Map output materialized bytes=124
>                 Input split bytes=570
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=124
>                 Reduce input records=2
>                 Reduce output records=1
>                 Spilled Records=4
>                 Shuffled Maps =2
>                 Failed Shuffles=0
>                 Merged Map outputs=2
>                 GC time elapsed (ms)=15
>                 Total committed heap usage (bytes)=1044381696
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         injector
>                 urls_injected=1
>                 urls_merged=1
>         File Input Format Counters
>                 Bytes Read=0
>         File Output Format Counters
>                 Bytes Written=365
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls rejected by filters: 0
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 1
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 1
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: Total new urls injected: 0
> 17/05/02 06:00:28 INFO crawl.Injector: Injector: finished at 2017-05-02 06:00:28, elapsed: 00:00:04
> Tue May 2 06:00:28 CDT 2017 : Iteration 1 of 2
> Generating a new segment
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 18 = 0 ']'
> + COMMAND=generate
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' generate = crawl ']'
> + '[' generate = inject ']'
> + '[' generate = generate ']'
> + CLASS=org.apache.nutch.crawl.Generator
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.crawl.Generator -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: starting at 2017-05-02 06:00:32
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: filtering: false
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: normalizing: true
> 17/05/02 06:00:32 INFO crawl.Generator: Generator: topN: 50000
> 17/05/02 06:00:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
> 17/05/02 06:00:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 17/05/02 06:00:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:33 INFO mapred.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:33 INFO mapreduce.JobSubmitter: number of splits:1
> 17/05/02 06:00:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1706016672_0001
> 17/05/02 06:00:33 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
> 17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
> 17/05/02 06:00:33 INFO mapreduce.Job: Running job: job_local1706016672_0001
> 17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
> 17/05/02 06:00:33 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:33 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for map tasks
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:34 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
> 17/05/02 06:00:34 INFO mapred.MapTask: numReduceTasks: 2
> 17/05/02 06:00:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:34 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-unjar7886623985863993949/classes/plugins
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Plugins:
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Anchor Indexing Filter (index-anchor)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Tika Parser Plug-in (parse-tika)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Http Protocol Plug-in (protocol-http)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         ElasticIndexWriter (indexer-elastic)
> 17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Extension-Points:
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar7886623985863993949/regex-urlfilter.txt
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
> 17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 'partition', using default
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:34 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:34 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufend = 83; bufvoid = 104857600
> 17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
> 17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
> 17/05/02 06:00:34 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_m_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
> 17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_m_000000_0' done.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: map task executor complete.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for reduce tasks
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_r_000000_0
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2fd7e5ad
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:34 INFO reduce.EventFetcher: attempt_local1706016672_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
> 17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1706016672_0001_m_000000_0 decomp: 87 len: 83 to MEMORY
> 17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 87 bytes from map-output for attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 87, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->87
> 17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 87 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 91 bytes from disk
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
> 17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 'generate_host_count', using default
> 17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_r_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO mapred.Task: Task attempt_local1706016672_0001_r_000000_0 is allowed to commit now
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1706016672_0001_r_000000_0' to hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/_temporary/0/task_local1706016672_0001_r_000000
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_r_000000_0' done.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_r_000000_0
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_r_000001_0
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@29cfa49
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:34 INFO reduce.EventFetcher: attempt_local1706016672_0001_r_000001_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#2 about to shuffle output of map attempt_local1706016672_0001_m_000000_0 decomp: 2 len: 14 to MEMORY
> 17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local1706016672_0001_m_000000_0
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
> 17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 22 bytes from disk
> 17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
> 17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
> 17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
> 17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_r_000001_0 is done. And is in the process of committing
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_r_000001_0' done.
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_r_000001_0
> 17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce task executor complete.
> 17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 running in uber mode : false
> 17/05/02 06:00:34 INFO mapreduce.Job:  map 100% reduce 100%
> 17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 completed successfully
> 17/05/02 06:00:35 INFO mapreduce.Job: Counters: 35
>         File System Counters
>                 FILE: Number of bytes read=652296139
>                 FILE: Number of bytes written=658571046
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=444
>                 HDFS: Number of bytes written=398
>                 HDFS: Number of read operations=37
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=13
>         Map-Reduce Framework
>                 Map input records=1
>                 Map output records=1
>                 Map output bytes=83
>                 Map output materialized bytes=97
>                 Input split bytes=123
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=97
>                 Reduce input records=1
>                 Reduce output records=1
>                 Spilled Records=2
>                 Shuffled Maps =2
>                 Failed Shuffles=0
>                 Merged Map outputs=2
>                 GC time elapsed (ms)=8
>                 Total committed heap usage (bytes)=1036517376
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=148
>         File Output Format Counters
>                 Bytes Written=199
> 17/05/02 06:00:35 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
> 17/05/02 06:00:36 INFO crawl.Generator: Generator: segment: crawl/segments/20170502060036
> 17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:36 INFO mapred.FileInputFormat: Total input files to process : 1
> 17/05/02 06:00:36 INFO mapreduce.JobSubmitter: number of splits:1
> 17/05/02 06:00:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1332900929_0002
> 17/05/02 06:00:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
> 17/05/02 06:00:36 INFO mapreduce.Job: Running job: job_local1332900929_0002
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for map tasks
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1332900929_0002_m_000000_0
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
> 17/05/02 06:00:36 INFO mapred.MapTask: numReduceTasks: 1
> 17/05/02 06:00:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
> 17/05/02 06:00:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
> 17/05/02 06:00:36 INFO mapred.MapTask: soft limit at 83886080
> 17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
> 17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
> 17/05/02 06:00:36 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner:
> 17/05/02 06:00:36 INFO mapred.MapTask: Starting flush of map output
> 17/05/02 06:00:36 INFO mapred.MapTask: Spilling map output
> 17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufend = 104; bufvoid = 104857600
> 17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
> 17/05/02 06:00:36 INFO mapred.MapTask: Finished spill 0
> 17/05/02 06:00:36 INFO mapred.Task: Task:attempt_local1332900929_0002_m_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
> 17/05/02 06:00:36 INFO mapred.Task: Task 'attempt_local1332900929_0002_m_000000_0' done.
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1332900929_0002_m_000000_0
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: map task executor complete.
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for reduce tasks
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1332900929_0002_r_000000_0
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
> 17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
> 17/05/02 06:00:36 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@57dcd1f6
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
> 17/05/02 06:00:36 INFO reduce.EventFetcher: attempt_local1332900929_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
> 17/05/02 06:00:36 INFO reduce.LocalFetcher: localfetcher#3 about to shuffle output of map attempt_local1332900929_0002_m_000000_0 decomp: 108 len: 82 to MEMORY
> 17/05/02 06:00:36 INFO reduce.InMemoryMapOutput: Read 108 bytes from map-output for attempt_local1332900929_0002_m_000000_0
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 108, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->108
> 17/05/02 06:00:36 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
> 17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merged 1 segments, 108 bytes to disk to satisfy reduce memory limit
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 1 files, 90 bytes from disk
> 17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
> 17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
> 17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:36 INFO mapred.Task: Task:attempt_local1332900929_0002_r_000000_0 is done. And is in the process of committing
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
> 17/05/02 06:00:36 INFO mapred.Task: Task attempt_local1332900929_0002_r_000000_0 is allowed to commit now
> 17/05/02 06:00:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1332900929_0002_r_000000_0' to hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_generate/_temporary/0/task_local1332900929_0002_r_000000
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce > reduce
> 17/05/02 06:00:36 INFO mapred.Task: Task 'attempt_local1332900929_0002_r_000000_0' done.
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1332900929_0002_r_000000_0
> 17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce task executor complete.
> 17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 running in uber mode : false
> 17/05/02 06:00:37 INFO mapreduce.Job:  map 100% reduce 100%
> 17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 completed successfully
> 17/05/02 06:00:37 INFO mapreduce.Job: Counters: 35
>         File System Counters
>                 FILE: Number of bytes read=869728356
>                 FILE: Number of bytes written=878093356
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>                 HDFS: Number of bytes read=694
>                 HDFS: Number of bytes written=567
>                 HDFS: Number of read operations=53
>                 HDFS: Number of large read operations=0
>                 HDFS: Number of write operations=18
>         Map-Reduce Framework
>                 Map input records=1
>                 Map output records=1
>                 Map output bytes=104
>                 Map output materialized bytes=82
>                 Input split bytes=157
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=1
>                 Reduce shuffle bytes=82
>                 Reduce input records=1
>                 Reduce output records=1
>                 Spilled Records=2
>                 Shuffled Maps =1
>                 Failed Shuffles=0
>                 Merged Map outputs=1
>                 GC time elapsed (ms)=0
>                 Total committed heap usage (bytes)=901775360
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=199
>         File Output Format Counters
>                 Bytes Written=169
> 17/05/02 06:00:37 INFO crawl.Generator: Generator: finished at 2017-05-02 06:00:37, elapsed: 00:00:05
> Operating on segment : 20170502060036
> Fetching : 20170502060036
> /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
> + cygwin=false
> + case "`uname`" in
> ++ uname
> + THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
> + '[' 17 = 0 ']'
> + COMMAND=fetch
> + shift
> ++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
> + THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
> ++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
> ++ pwd
> + NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
> + '[' '' '!=' '' ']'
> + '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
> + local=true
> + '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
> + local=false
> + for f in '"$NUTCH_HOME"/*nutch*.job'
> + NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
> + false
> + JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
> + JAVA_HEAP_MAX=-Xmx1000m
> + '[' '' '!=' '' ']'
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
> + CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
> + IFS=
> + false
> + false
> + JAVA_LIBRARY_PATH=
> + '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
> + '[' false = true -a X '!=' X ']'
> + unset IFS
> + '[' '' = '' ']'
> + NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
> + '[' '' = '' ']'
> + NUTCH_LOGFILE=hadoop.log
> + false
> + NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
> + NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
> + '[' x '!=' x ']'
> + '[' fetch = crawl ']'
> + '[' fetch = inject ']'
> + '[' fetch = generate ']'
> + '[' fetch = freegen ']'
> + '[' fetch = fetch ']'
> + CLASS=org.apache.nutch.fetcher.Fetcher
> + EXEC_CALL=(hadoop jar "$NUTCH_JOB")
> + false
> ++ which hadoop
> ++ wc -l
> + '[' 1 -eq 0 ']'
> + exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.fetcher.Fetcher -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
> 17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: starting at 2017-05-02 06:00:43
> 17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: segment: crawl/segments/20170502060036
> 17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1493733643194
> 17/05/02 06:00:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
> 17/05/02 06:00:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
> 17/05/02 06:00:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
> 17/05/02 06:00:44 ERROR fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_fetch, expected: file:///
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
>         at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
>         at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:630)
>         at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
>         at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
>         at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:435)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
>         at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:55)
>         at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
>         at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:141)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
>         at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> 
> Error running:
>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
> Failed with exit value 255.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
> Sent: 02 May 2017 13:54
> To: user@nutch.apache.org
> Subject: Re: Wrong FS exception in Fetcher
> 
> Hi Yossi,
> 
> strange error, indeed. Is it also reproducible in pseudo-distributed mode using Hadoop 2.7.2,
> the version Nutch depends on?n
> 
> Could you also add the line
>   set -x
> to bin/nutch and run bin/crawl again to see how all steps are executed.
> 
> Thanks,
> Sebastian
> 
> On 04/30/2017 04:04 PM, Yossi Tamari wrote:
>> Hi,
>>
>>  
>>
>> I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
>> distributed mode.
>>
>> Running the command:
>>
>> Deploy/bin/crawl urls crawl 2
>>
>> The Injector and Generator run successfully, but in the Fetcher I get the
>> following error:
>>
>> 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
>> java.lang.IllegalArgumentException: Wrong FS:
>> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
>> expected: file:///
>>
>>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
>> 6)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
>> System.java:630)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
>> leSystem.java:861)
>>
>>         at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
>> a:625)
>>
>>         at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
>> 5)
>>
>>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
>>
>>         at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
>> ormat.java:55)
>>
>>         at
>> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
>>
>>         at
>> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
>> :141)
>>
>>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>>
>>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
>>
>>         at java.security.AccessController.doPrivileged(Native Method)
>>
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
>> va:1807)
>>
>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
>>
>>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
>>
>>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
>>
>>         at java.security.AccessController.doPrivileged(Native Method)
>>
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
>> va:1807)
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
>>
>>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
>>
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
>>
>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
>>
>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
>>
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>>
>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
>>
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
>> )
>>
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
>> .java:43)
>>
>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>
>>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
>>
>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
>>
>>  
>>
>> Error running:
>>
>>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
>> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
>> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
>> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
>> crawl/segments/20170430084337 -noParsing -threads 50
>>
>> Failed with exit value 255.
>>
>>  
>>
>>  
>>
>> Any ideas how to fix this?
>>
>>  
>>
>> Thanks,
>>
>>                Yossi.
>>
>>
> 
> 


RE: Wrong FS exception in Fetcher

Posted by Yossi Tamari <yo...@pipl.com>.
Thanks Sebastian,

The output with set -x is below. I'm new to Nutch and was not aware that 1.13 requires Hadoop 2.7.2 specifically. While I see it now in pom.xml, it may be a good idea to document it in the download page and provide a download link (since the Hadoop releases page contains 2.7.3 but not 2.7.2). I will try to install 2.7.2 and retest tomorrow.

root@crawler001:/data/apache-nutch-1.13/runtime/deploy/bin# ./crawl urls crawl 2
Injecting seed URLs
/data/apache-nutch-1.13/runtime/deploy/bin/nutch inject crawl/crawldb urls
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 3 = 0 ']'
+ COMMAND=inject
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' inject = crawl ']'
+ '[' inject = inject ']'
+ CLASS=org.apache.nutch.crawl.Injector
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.crawl.Injector crawl/crawldb urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: starting at 2017-05-02 06:00:24
17/05/02 06:00:24 INFO crawl.Injector: Injector: crawlDb: crawl/crawldb
17/05/02 06:00:24 INFO crawl.Injector: Injector: urlDir: urls
17/05/02 06:00:24 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries.
17/05/02 06:00:25 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/05/02 06:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO input.FileInputFormat: Total input files to process : 1
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: number of splits:2
17/05/02 06:00:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local307378419_0001
17/05/02 06:00:26 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/05/02 06:00:26 INFO mapreduce.Job: Running job: job_local307378419_0001
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:26 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-unjar333276722181778867/classes/plugins
17/05/02 06:00:26 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Plugins:
17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
17/05/02 06:00:26 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
17/05/02 06:00:26 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Anchor Indexing Filter (index-anchor)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Tika Parser Plug-in (parse-tika)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
17/05/02 06:00:26 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
17/05/02 06:00:26 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Http Protocol Plug-in (protocol-http)
17/05/02 06:00:26 INFO plugin.PluginRepository:         ElasticIndexWriter (indexer-elastic)
17/05/02 06:00:26 INFO plugin.PluginRepository: Registered Extension-Points:
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
17/05/02 06:00:26 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar333276722181778867/regex-urlfilter.txt
17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000000_0 is done. And is in the process of committing
17/05/02 06:00:26 INFO mapred.LocalJobRunner: map
17/05/02 06:00:26 INFO mapred.Task: Task 'attempt_local307378419_0001_m_000000_0' done.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/urls/seed.txt:0+24
17/05/02 06:00:26 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:26 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:26 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:26 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:26 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar333276722181778867/regex-normalize.xml
17/05/02 06:00:26 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default
17/05/02 06:00:26 INFO mapred.LocalJobRunner:
17/05/02 06:00:26 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:26 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:26 INFO mapred.MapTask: bufstart = 0; bufend = 54; bufvoid = 104857600
17/05/02 06:00:26 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
17/05/02 06:00:26 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:26 INFO mapred.Task: Task:attempt_local307378419_0001_m_000001_0 is done. And is in the process of committing
17/05/02 06:00:26 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/urls/seed.txt:0+24
17/05/02 06:00:26 INFO mapred.Task: Task 'attempt_local307378419_0001_m_000001_0' done.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:26 INFO mapred.LocalJobRunner: Starting task: attempt_local307378419_0001_r_000000_0
17/05/02 06:00:26 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:26 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:26 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:26 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@504b0ec4
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:26 INFO reduce.EventFetcher: attempt_local307378419_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local307378419_0001_m_000001_0 decomp: 58 len: 62 to MEMORY
17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output for attempt_local307378419_0001_m_000001_0
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 58, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->58
17/05/02 06:00:26 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local307378419_0001_m_000000_0 decomp: 58 len: 62 to MEMORY
17/05/02 06:00:26 INFO reduce.InMemoryMapOutput: Read 58 bytes from map-output for attempt_local307378419_0001_m_000000_0
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 58, inMemoryMapOutputs.size() -> 2, commitMemory -> 58, usedMemory ->116
17/05/02 06:00:26 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:26 INFO mapred.Merger: Merging 2 sorted segments
17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 62 bytes
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merged 2 segments, 116 bytes to disk to satisfy reduce memory limit
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 1 files, 118 bytes from disk
17/05/02 06:00:26 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
17/05/02 06:00:26 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:26 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 87 bytes
17/05/02 06:00:26 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/05/02 06:00:27 INFO compress.CodecPool: Got brand-new compressor [.deflate]
17/05/02 06:00:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
17/05/02 06:00:27 INFO crawl.Injector: Injector: overwrite: false
17/05/02 06:00:27 INFO crawl.Injector: Injector: update: false
17/05/02 06:00:27 INFO mapreduce.Job: Job job_local307378419_0001 running in uber mode : false
17/05/02 06:00:27 INFO mapreduce.Job:  map 100% reduce 0%
17/05/02 06:00:27 INFO mapred.Task: Task:attempt_local307378419_0001_r_000000_0 is done. And is in the process of committing
17/05/02 06:00:27 INFO mapred.LocalJobRunner: 2 / 2 copied.
17/05/02 06:00:27 INFO mapred.Task: Task attempt_local307378419_0001_r_000000_0 is allowed to commit now
17/05/02 06:00:27 INFO output.FileOutputCommitter: Saved output of task 'attempt_local307378419_0001_r_000000_0' to hdfs://localhost:9000/user/root/crawl/crawldb/crawldb-921346783/_temporary/0/task_local307378419_0001_r_000000
17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:27 INFO mapred.Task: Task 'attempt_local307378419_0001_r_000000_0' done.
17/05/02 06:00:27 INFO mapred.LocalJobRunner: Finishing task: attempt_local307378419_0001_r_000000_0
17/05/02 06:00:27 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:28 INFO mapreduce.Job:  map 100% reduce 100%
17/05/02 06:00:28 INFO mapreduce.Job: Job job_local307378419_0001 completed successfully
17/05/02 06:00:28 INFO mapreduce.Job: Counters: 37
        File System Counters
                FILE: Number of bytes read=652298479
                FILE: Number of bytes written=658557993
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=492
                HDFS: Number of bytes written=365
                HDFS: Number of read operations=46
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=13
        Map-Reduce Framework
                Map input records=2
                Map output records=2
                Map output bytes=108
                Map output materialized bytes=124
                Input split bytes=570
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=124
                Reduce input records=2
                Reduce output records=1
                Spilled Records=4
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=15
                Total committed heap usage (bytes)=1044381696
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        injector
                urls_injected=1
                urls_merged=1
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=365
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls rejected by filters: 0
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected after normalization and filtering: 1
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total urls injected but already in CrawlDb: 1
17/05/02 06:00:28 INFO crawl.Injector: Injector: Total new urls injected: 0
17/05/02 06:00:28 INFO crawl.Injector: Injector: finished at 2017-05-02 06:00:28, elapsed: 00:00:04
Tue May 2 06:00:28 CDT 2017 : Iteration 1 of 2
Generating a new segment
/data/apache-nutch-1.13/runtime/deploy/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 18 = 0 ']'
+ COMMAND=generate
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' generate = crawl ']'
+ '[' generate = inject ']'
+ '[' generate = generate ']'
+ CLASS=org.apache.nutch.crawl.Generator
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.crawl.Generator -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 50000 -numFetchers 1 -noFilter
17/05/02 06:00:32 INFO crawl.Generator: Generator: starting at 2017-05-02 06:00:32
17/05/02 06:00:32 INFO crawl.Generator: Generator: Selecting best-scoring urls due for fetch.
17/05/02 06:00:32 INFO crawl.Generator: Generator: filtering: false
17/05/02 06:00:32 INFO crawl.Generator: Generator: normalizing: true
17/05/02 06:00:32 INFO crawl.Generator: Generator: topN: 50000
17/05/02 06:00:32 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/05/02 06:00:32 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/05/02 06:00:32 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:33 INFO mapred.FileInputFormat: Total input files to process : 1
17/05/02 06:00:33 INFO mapreduce.JobSubmitter: number of splits:1
17/05/02 06:00:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1706016672_0001
17/05/02 06:00:33 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:33 INFO mapreduce.Job: Running job: job_local1706016672_0001
17/05/02 06:00:33 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
17/05/02 06:00:33 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:33 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:34 INFO mapred.MapTask: numReduceTasks: 2
17/05/02 06:00:34 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:34 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:34 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:34 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:34 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-unjar7886623985863993949/classes/plugins
17/05/02 06:00:34 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Plugins:
17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter (urlfilter-regex)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Html Parse Plug-in (parse-html)
17/05/02 06:00:34 INFO plugin.PluginRepository:         HTTP Framework (lib-http)
17/05/02 06:00:34 INFO plugin.PluginRepository:         the nutch core extension points (nutch-extensionpoints)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic Indexing Filter (index-basic)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Anchor Indexing Filter (index-anchor)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Tika Parser Plug-in (parse-tika)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Basic URL Normalizer (urlnormalizer-basic)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Filter Framework (lib-regex-filter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Regex URL Normalizer (urlnormalizer-regex)
17/05/02 06:00:34 INFO plugin.PluginRepository:         CyberNeko HTML Parser (lib-nekohtml)
17/05/02 06:00:34 INFO plugin.PluginRepository:         OPIC Scoring Plug-in (scoring-opic)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Pass-through URL Normalizer (urlnormalizer-pass)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Http Protocol Plug-in (protocol-http)
17/05/02 06:00:34 INFO plugin.PluginRepository:         ElasticIndexWriter (indexer-elastic)
17/05/02 06:00:34 INFO plugin.PluginRepository: Registered Extension-Points:
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Content Parser (org.apache.nutch.parse.Parser)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Filter (org.apache.nutch.net.URLFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Protocol (org.apache.nutch.protocol.Protocol)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
17/05/02 06:00:34 INFO plugin.PluginRepository:         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar7886623985863993949/regex-urlfilter.txt
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 'partition', using default
17/05/02 06:00:34 INFO mapred.LocalJobRunner:
17/05/02 06:00:34 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:34 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:34 INFO mapred.MapTask: bufstart = 0; bufend = 83; bufvoid = 104857600
17/05/02 06:00:34 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
17/05/02 06:00:34 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new compressor [.deflate]
17/05/02 06:00:34 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_m_000000_0 is done. And is in the process of committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/crawl/crawldb/current/part-r-00000/data:0+148
17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_m_000000_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_r_000000_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@2fd7e5ad
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:34 INFO reduce.EventFetcher: attempt_local1706016672_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
17/05/02 06:00:34 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1706016672_0001_m_000000_0 decomp: 87 len: 83 to MEMORY
17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 87 bytes from map-output for attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 87, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->87
17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 87 bytes to disk to satisfy reduce memory limit
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 91 bytes from disk
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO regex.RegexURLNormalizer: can't find rules for scope 'generate_host_count', using default
17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_r_000000_0 is done. And is in the process of committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO mapred.Task: Task attempt_local1706016672_0001_r_000000_0 is allowed to commit now
17/05/02 06:00:34 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1706016672_0001_r_000000_0' to hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/_temporary/0/task_local1706016672_0001_r_000000
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_r_000000_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_r_000000_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Starting task: attempt_local1706016672_0001_r_000001_0
17/05/02 06:00:34 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:34 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:34 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:34 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@29cfa49
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:34 INFO reduce.EventFetcher: attempt_local1706016672_0001_r_000001_0 Thread started: EventFetcher for fetching Map Completion Events
17/05/02 06:00:34 INFO reduce.LocalFetcher: localfetcher#2 about to shuffle output of map attempt_local1706016672_0001_m_000000_0 decomp: 2 len: 14 to MEMORY
17/05/02 06:00:34 INFO reduce.InMemoryMapOutput: Read 2 bytes from map-output for attempt_local1706016672_0001_m_000000_0
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->2
17/05/02 06:00:34 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merged 1 segments, 2 bytes to disk to satisfy reduce memory limit
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 1 files, 22 bytes from disk
17/05/02 06:00:34 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
17/05/02 06:00:34 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:34 INFO mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes
17/05/02 06:00:34 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:34 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-unjar7886623985863993949/regex-normalize.xml
17/05/02 06:00:34 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: defaultInterval=2592000
17/05/02 06:00:34 INFO crawl.AbstractFetchSchedule: maxInterval=7776000
17/05/02 06:00:34 INFO mapred.Task: Task:attempt_local1706016672_0001_r_000001_0 is done. And is in the process of committing
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:34 INFO mapred.Task: Task 'attempt_local1706016672_0001_r_000001_0' done.
17/05/02 06:00:34 INFO mapred.LocalJobRunner: Finishing task: attempt_local1706016672_0001_r_000001_0
17/05/02 06:00:34 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 running in uber mode : false
17/05/02 06:00:34 INFO mapreduce.Job:  map 100% reduce 100%
17/05/02 06:00:34 INFO mapreduce.Job: Job job_local1706016672_0001 completed successfully
17/05/02 06:00:35 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=652296139
                FILE: Number of bytes written=658571046
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=444
                HDFS: Number of bytes written=398
                HDFS: Number of read operations=37
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=13
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Map output bytes=83
                Map output materialized bytes=97
                Input split bytes=123
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=97
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=8
                Total committed heap usage (bytes)=1036517376
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=148
        File Output Format Counters
                Bytes Written=199
17/05/02 06:00:35 INFO crawl.Generator: Generator: Partitioning selected urls for politeness.
17/05/02 06:00:36 INFO crawl.Generator: Generator: segment: crawl/segments/20170502060036
17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:36 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:36 INFO mapred.FileInputFormat: Total input files to process : 1
17/05/02 06:00:36 INFO mapreduce.JobSubmitter: number of splits:1
17/05/02 06:00:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1332900929_0002
17/05/02 06:00:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/05/02 06:00:36 INFO mapreduce.Job: Running job: job_local1332900929_0002
17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/05/02 06:00:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for map tasks
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
17/05/02 06:00:36 INFO mapred.MapTask: numReduceTasks: 1
17/05/02 06:00:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
17/05/02 06:00:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
17/05/02 06:00:36 INFO mapred.MapTask: soft limit at 83886080
17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
17/05/02 06:00:36 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
17/05/02 06:00:36 INFO mapred.LocalJobRunner:
17/05/02 06:00:36 INFO mapred.MapTask: Starting flush of map output
17/05/02 06:00:36 INFO mapred.MapTask: Spilling map output
17/05/02 06:00:36 INFO mapred.MapTask: bufstart = 0; bufend = 104; bufvoid = 104857600
17/05/02 06:00:36 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600
17/05/02 06:00:36 INFO mapred.MapTask: Finished spill 0
17/05/02 06:00:36 INFO mapred.Task: Task:attempt_local1332900929_0002_m_000000_0 is done. And is in the process of committing
17/05/02 06:00:36 INFO mapred.LocalJobRunner: hdfs://localhost:9000/user/root/generate-temp-ca817ccd-332b-4fa3-afe3-dab7d80ea711/fetchlist-1/part-00000:0+199
17/05/02 06:00:36 INFO mapred.Task: Task 'attempt_local1332900929_0002_m_000000_0' done.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO mapred.LocalJobRunner: map task executor complete.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Waiting for reduce tasks
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1332900929_0002_r_000000_0
17/05/02 06:00:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
17/05/02 06:00:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
17/05/02 06:00:36 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
17/05/02 06:00:36 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@57dcd1f6
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
17/05/02 06:00:36 INFO reduce.EventFetcher: attempt_local1332900929_0002_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
17/05/02 06:00:36 INFO reduce.LocalFetcher: localfetcher#3 about to shuffle output of map attempt_local1332900929_0002_m_000000_0 decomp: 108 len: 82 to MEMORY
17/05/02 06:00:36 INFO reduce.InMemoryMapOutput: Read 108 bytes from map-output for attempt_local1332900929_0002_m_000000_0
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 108, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->108
17/05/02 06:00:36 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merged 1 segments, 108 bytes to disk to satisfy reduce memory limit
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 1 files, 90 bytes from disk
17/05/02 06:00:36 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
17/05/02 06:00:36 INFO mapred.Merger: Merging 1 sorted segments
17/05/02 06:00:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 81 bytes
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO mapred.Task: Task:attempt_local1332900929_0002_r_000000_0 is done. And is in the process of committing
17/05/02 06:00:36 INFO mapred.LocalJobRunner: 1 / 1 copied.
17/05/02 06:00:36 INFO mapred.Task: Task attempt_local1332900929_0002_r_000000_0 is allowed to commit now
17/05/02 06:00:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1332900929_0002_r_000000_0' to hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_generate/_temporary/0/task_local1332900929_0002_r_000000
17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce > reduce
17/05/02 06:00:36 INFO mapred.Task: Task 'attempt_local1332900929_0002_r_000000_0' done.
17/05/02 06:00:36 INFO mapred.LocalJobRunner: Finishing task: attempt_local1332900929_0002_r_000000_0
17/05/02 06:00:36 INFO mapred.LocalJobRunner: reduce task executor complete.
17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 running in uber mode : false
17/05/02 06:00:37 INFO mapreduce.Job:  map 100% reduce 100%
17/05/02 06:00:37 INFO mapreduce.Job: Job job_local1332900929_0002 completed successfully
17/05/02 06:00:37 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=869728356
                FILE: Number of bytes written=878093356
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=694
                HDFS: Number of bytes written=567
                HDFS: Number of read operations=53
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=18
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Map output bytes=104
                Map output materialized bytes=82
                Input split bytes=157
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=82
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=0
                Total committed heap usage (bytes)=901775360
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=199
        File Output Format Counters
                Bytes Written=169
17/05/02 06:00:37 INFO crawl.Generator: Generator: finished at 2017-05-02 06:00:37, elapsed: 00:00:05
Operating on segment : 20170502060036
Fetching : 20170502060036
/data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
+ cygwin=false
+ case "`uname`" in
++ uname
+ THIS=/data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ '[' -h /data/apache-nutch-1.13/runtime/deploy/bin/nutch ']'
+ '[' 17 = 0 ']'
+ COMMAND=fetch
+ shift
++ dirname /data/apache-nutch-1.13/runtime/deploy/bin/nutch
+ THIS_DIR=/data/apache-nutch-1.13/runtime/deploy/bin
++ cd /data/apache-nutch-1.13/runtime/deploy/bin/..
++ pwd
+ NUTCH_HOME=/data/apache-nutch-1.13/runtime/deploy
+ '[' '' '!=' '' ']'
+ '[' /usr/lib/jvm/java-8-oracle/jre/ = '' ']'
+ local=true
+ '[' -f /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job ']'
+ local=false
+ for f in '"$NUTCH_HOME"/*nutch*.job'
+ NUTCH_JOB=/data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job
+ false
+ JAVA=/usr/lib/jvm/java-8-oracle/jre//bin/java
+ JAVA_HEAP_MAX=-Xmx1000m
+ '[' '' '!=' '' ']'
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf
+ CLASSPATH=/data/apache-nutch-1.13/runtime/deploy/conf:/usr/lib/jvm/java-8-oracle/jre//lib/tools.jar
+ IFS=
+ false
+ false
+ JAVA_LIBRARY_PATH=
+ '[' -d /data/apache-nutch-1.13/runtime/deploy/lib/native ']'
+ '[' false = true -a X '!=' X ']'
+ unset IFS
+ '[' '' = '' ']'
+ NUTCH_LOG_DIR=/data/apache-nutch-1.13/runtime/deploy/logs
+ '[' '' = '' ']'
+ NUTCH_LOGFILE=hadoop.log
+ false
+ NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR")
+ NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Dhadoop.log.file="$NUTCH_LOGFILE")
+ '[' x '!=' x ']'
+ '[' fetch = crawl ']'
+ '[' fetch = inject ']'
+ '[' fetch = generate ']'
+ '[' fetch = freegen ']'
+ '[' fetch = fetch ']'
+ CLASS=org.apache.nutch.fetcher.Fetcher
+ EXEC_CALL=(hadoop jar "$NUTCH_JOB")
+ false
++ which hadoop
++ wc -l
+ '[' 1 -eq 0 ']'
+ exec hadoop jar /data/apache-nutch-1.13/runtime/deploy/apache-nutch-1.13.job org.apache.nutch.fetcher.Fetcher -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: starting at 2017-05-02 06:00:43
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher: segment: crawl/segments/20170502060036
17/05/02 06:00:43 INFO fetcher.Fetcher: Fetcher Timelimit set for : 1493733643194
17/05/02 06:00:44 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
17/05/02 06:00:44 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
17/05/02 06:00:44 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
17/05/02 06:00:44 ERROR fetcher.Fetcher: Fetcher: java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/root/crawl/segments/20170502060036/crawl_fetch, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
        at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:630)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:435)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
        at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:55)
        at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:141)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
        at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1807)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

Error running:
  /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 crawl/segments/20170502060036 -noParsing -threads 50
Failed with exit value 255.

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Sent: 02 May 2017 13:54
To: user@nutch.apache.org
Subject: Re: Wrong FS exception in Fetcher

Hi Yossi,

strange error, indeed. Is it also reproducible in pseudo-distributed mode using Hadoop 2.7.2,
the version Nutch depends on?n

Could you also add the line
  set -x
to bin/nutch and run bin/crawl again to see how all steps are executed.

Thanks,
Sebastian

On 04/30/2017 04:04 PM, Yossi Tamari wrote:
> Hi,
> 
>  
> 
> I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
> distributed mode.
> 
> Running the command:
> 
> Deploy/bin/crawl urls crawl 2
> 
> The Injector and Generator run successfully, but in the Fetcher I get the
> following error:
> 
> 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
> expected: file:///
> 
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
> 6)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
> System.java:630)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
> leSystem.java:861)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
> a:625)
> 
>         at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
> 5)
> 
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
> 
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
> ormat.java:55)
> 
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
> 
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
> :141)
> 
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
> 
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
> 
>         at java.security.AccessController.doPrivileged(Native Method)
> 
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> 
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
> 
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
> 
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
> 
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
> 
>         at java.security.AccessController.doPrivileged(Native Method)
> 
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> 
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
> 
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
> 
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
> 
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
> 
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
> 
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
> 
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> 
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
> 
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
> )
> 
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:43)
> 
>         at java.lang.reflect.Method.invoke(Method.java:498)
> 
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
> 
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> 
>  
> 
> Error running:
> 
>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> crawl/segments/20170430084337 -noParsing -threads 50
> 
> Failed with exit value 255.
> 
>  
> 
>  
> 
> Any ideas how to fix this?
> 
>  
> 
> Thanks,
> 
>                Yossi.
> 
> 



Re: Wrong FS exception in Fetcher

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Yossi,

strange error, indeed. Is it also reproducible in pseudo-distributed mode using Hadoop 2.7.2,
the version Nutch depends on?n

Could you also add the line
  set -x
to bin/nutch and run bin/crawl again to see how all steps are executed.

Thanks,
Sebastian

On 04/30/2017 04:04 PM, Yossi Tamari wrote:
> Hi,
> 
>  
> 
> I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pseudo-distributed
> distributed mode.
> 
> Running the command:
> 
> Deploy/bin/crawl urls crawl 2
> 
> The Injector and Generator run successfully, but in the Fetcher I get the
> following error:
> 
> 17/04/30 08:43:48 ERROR fetcher.Fetcher: Fetcher:
> java.lang.IllegalArgumentException: Wrong FS:
> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch,
> expected: file:///
> 
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:8
> 6)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFile
> System.java:630)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFi
> leSystem.java:861)
> 
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.jav
> a:625)
> 
>         at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:43
> 5)
> 
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1436)
> 
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputF
> ormat.java:55)
> 
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:270)
> 
>         at
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java
> :141)
> 
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
> 
>         at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338)
> 
>         at java.security.AccessController.doPrivileged(Native Method)
> 
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> 
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
> 
>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338)
> 
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
> 
>         at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
> 
>         at java.security.AccessController.doPrivileged(Native Method)
> 
>         at javax.security.auth.Subject.doAs(Subject.java:422)
> 
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
> va:1807)
> 
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
> 
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
> 
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
> 
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
> 
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:521)
> 
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> 
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:495)
> 
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
> )
> 
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:43)
> 
>         at java.lang.reflect.Method.invoke(Method.java:498)
> 
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
> 
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
> 
>  
> 
> Error running:
> 
>   /data/apache-nutch-1.13/runtime/deploy/bin/nutch fetch -D
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180
> crawl/segments/20170430084337 -noParsing -threads 50
> 
> Failed with exit value 255.
> 
>  
> 
>  
> 
> Any ideas how to fix this?
> 
>  
> 
> Thanks,
> 
>                Yossi.
> 
>