You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/18 06:45:26 UTC
Parse Timeout?
Hi,
I've been getting a strange timeout exception during parsing of a large
sitemap XML document. I've set the timeout in nutch-site.xml to -1 or
large numbers and "ant clean && ant runtime" before deploying the parse
job, to no avail. Nor did restarting the cluster help. The strange thing
is that the error happens exactly 30 seconds after the job is started,
so something must be wrong with the config. Here's the log:
2017-08-18 06:05:11,257 WARN [main] org.apache.nutch.parse.ParseUtil: Error parsing https://www.mscdirect.com/detail16.xml
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:174)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:163)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:146)
at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:337)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:151)
at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:88)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2017-08-18 06:05:11,258 WARN [main] org.apache.nutch.parse.ParseUtil: Unable to successfully parse content https://www.mscdirect.com/detail16.xml of type application/xml
2017-08-18 06:05:11,349 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1502934094771_0059_m_000000_0 is done. And is in the process of committing
2017-08-18 06:05:11,398 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1502934094771_0059_m_000000_0' done.
2017-08-18 06:05:11,400 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2017-08-18 06:05:11,401 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2017-08-18 06:05:11,401 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.
The Hadoop version is 2.6.0, Nutch version 2.x, running on CloudEra
Manager managed 5-node AWS cluster.
Any help would be appreciated, thanks!
Michael