You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/18 06:45:26 UTC

Parse Timeout?

Hi,

I've been getting a strange timeout exception during parsing of a large 
sitemap XML document. I've set the timeout in nutch-site.xml to -1 or 
large numbers and "ant clean && ant runtime" before deploying the parse 
job, to no avail. Nor did restarting the cluster help. The strange thing 
is that the error happens exactly 30 seconds after the job is started, 
so something must be wrong with the config. Here's the log:

2017-08-18 06:05:11,257 WARN [main] org.apache.nutch.parse.ParseUtil: Error parsing https://www.mscdirect.com/detail16.xml
java.util.concurrent.TimeoutException
	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:174)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:163)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:146)
	at org.apache.nutch.parse.ParseUtil.process(ParseUtil.java:337)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:151)
	at org.apache.nutch.parse.ParserJob$ParserMapper.map(ParserJob.java:88)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2017-08-18 06:05:11,258 WARN [main] org.apache.nutch.parse.ParseUtil: Unable to successfully parse content https://www.mscdirect.com/detail16.xml of type application/xml
2017-08-18 06:05:11,349 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1502934094771_0059_m_000000_0 is done. And is in the process of committing
2017-08-18 06:05:11,398 INFO [main] org.apache.hadoop.mapred.Task: Task 'attempt_1502934094771_0059_m_000000_0' done.
2017-08-18 06:05:11,400 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2017-08-18 06:05:11,401 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
2017-08-18 06:05:11,401 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system shutdown complete.

The Hadoop version is 2.6.0, Nutch version 2.x, running on CloudEra 
Manager managed 5-node AWS cluster.

Any help would be appreciated, thanks!

Michael