You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/07/09 15:51:00 UTC

[jira] [Commented] (NUTCH-2071) A parser failure on a single document may fail crawling job if parser.timeout=-1

    [ https://issues.apache.org/jira/browse/NUTCH-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537099#comment-16537099 ] 

ASF GitHub Bot commented on NUTCH-2071:
---------------------------------------

sebastian-nagel opened a new pull request #358: NUTCH-2071 A parser failure on a single document may fail crawling job if parser.timeout=-1
URL: https://github.com/apache/nutch/pull/358
 
 
   - also catch any Throwable if parser.timeout == -1 (parser is not called from ExecutorService)
   - improve log message: show full class name of called parser

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


>  A parser failure on a single document may fail crawling job if parser.timeout=-1
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-2071
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2071
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>            Reporter: Arkadi Kosmynin
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.14, 1.15
>
>         Attachments: NUTCH-2071.diff
>
>
> java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
>         <...>
> Caused by: java.lang.IncompatibleClassChangeError: class org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as super class
>                 at java.lang.ClassLoader.defineClass1(Native Method)
>                 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>                 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>                 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>                 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>                 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>                 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>                 at java.security.AccessController.doPrivileged(Native Method)
>                 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>                 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>                 at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
>                 at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
>                 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> Suggested fix in ParseUtil:
> Replace 
>             if (maxParseTime!=-1)
>                        parseResult = runParser(parsers[i], content);
>             else 
>                        parseResult = parsers[i].getParse(content);
> with
>       try
>       {
>             if (maxParseTime!=-1)
>                        parseResult = runParser(parsers[i], content);
>             else 
>                        parseResult = parsers[i].getParse(content);
>       } catch( Throwable e )
>       {
>         LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
>         parseResult = null ;
>       }



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)