You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2022/01/15 14:15:00 UTC
[jira] [Created] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
Sebastian Nagel created NUTCH-2937:
--------------------------------------
Summary: parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode
Key: NUTCH-2937
URL: https://issues.apache.org/jira/browse/NUTCH-2937
Project: Nutch
Issue Type: Bug
Components: parser, plugin
Affects Versions: 1.19
Reporter: Sebastian Nagel
Fix For: 1.19
While testing NUTCH-2919 I've seen the following error caused by a conflicting dependency to commons-io:
- 2.11.0 Nutch core
- 2.11.0 parse-tika (excluded to avoid duplicated dependencies)
- 2.5 provided by Hadoop
This causes errors parsing some office and other documents (but not all), for example:
{noformat}
2022-01-15 01:36:31,365 WARN [FetcherThread] org.apache.nutch.parse.ParseUtil: Error parsing http://kurskrun.ru/privacypolicy with org.apache.nutch.parse.tika.TikaParser
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 'org.apache.commons.io.input.CloseShieldInputStream org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:188)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:92)
at org.apache.nutch.fetcher.FetcherThread.output(FetcherThread.java:715)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:431)
Caused by: java.lang.NoSuchMethodError: 'org.apache.commons.io.input.CloseShieldInputStream org.apache.commons.io.input.CloseShieldInputStream.wrap(java.io.InputStream)'
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:120)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:289)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:151)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:90)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:34)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)