You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2015/07/23 22:38:43 UTC

Re: A parser failure on a single document may fail crawling job

Hi Arkadi,

does the problem persist?
Which version of Nutch are you using?
Can you point to one file or URL to reproduce it?

Thanks,
Sebastian

On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> Hi Arkadi,
> 
> thanks for reporting that. Can you open a Jira ticket [1] to address this bug?
> 
> It's rather a bug of the plugin parse-tika and should be solved there,
> cf. https://issues.apache.org/jira/browse/TIKA-1240
> A plugin should be able to load all required classes.
> 
> Thanks,
> Sebastian
> 
> [1] https://issues.apache.org/jira/browse/NUTCH
> 
> 2015-06-23 3:59 GMT+02:00 <Arkadi.Kosmynin@csiro.au <ma...@csiro.au>>:
> 
>     Hi,
> 
>     This is what happened:
> 
>     java.io.IOException: Job failed!
>             at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>             at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
>             <...>
>     Caused by: java.lang.IncompatibleClassChangeError: class
>     org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as
>     super class
>                     at java.lang.ClassLoader.defineClass1(Native Method)
>                     at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>                     at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>                     at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>                     at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>                     at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>                     at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>                     at java.security.AccessController.doPrivileged(Native Method)
>                     at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>                     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>                     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>                     at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
>                     at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
>                     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> 
>     Suggested fix in ParseUtil:
> 
>     Replace
> 
>                 if (maxParseTime!=-1)
>                            parseResult = runParser(parsers[i], content);
>                 else
>                            parseResult = parsers[i].getParse(content);
> 
>     with
> 
>           try
>           {
>                 if (maxParseTime!=-1)
>                            parseResult = runParser(parsers[i], content);
>                 else
>                            parseResult = parsers[i].getParse(content);
>           } catch( Throwable e )
>           {
>             LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + "
>     failed: " + e.getMessage() ) ;
>             parseResult = null ;
>           }
> 
>     Also replace
> 
>           if (maxParseTime!=-1)
>                       parseResult = runParser(p, content);
>            else
>                       parseResult = p.getParse(content);
> 
>     with
> 
>         try
>         {
>           if (maxParseTime!=-1)
>                       parseResult = runParser(p, content);
>            else
>                       parseResult = p.getParse(content);
>         } catch( Throwable e )
>         {
>           LOG.warn( "Parsing " + content.getUrl() + " with " + p.getClass().getName() + " failed: "
>     + e.getMessage() ) ;
>         }
> 
>     Regards,
>     Arkadi
> 
> 


RE: A parser failure on a single document may fail crawling job

Posted by Ar...@csiro.au.
Hi Sebastian,

I apologise for a long silence on this issue. I have been out of town, back on Monday. Then I will do what you are asking in 2-3 days.

Regards,
Arkadi
________________________________________
From: Sebastian Nagel [wastl.nagel@googlemail.com]
Sent: Friday, 24 July 2015 6:38 AM
To: user@nutch.apache.org
Cc: Kosmynin, Arkadi (CASS, Marsfield)
Subject: Re: A parser failure on a single document may fail crawling job

Hi Arkadi,

does the problem persist?
Which version of Nutch are you using?
Can you point to one file or URL to reproduce it?

Thanks,
Sebastian

On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> Hi Arkadi,
>
> thanks for reporting that. Can you open a Jira ticket [1] to address this bug?
>
> It's rather a bug of the plugin parse-tika and should be solved there,
> cf. https://issues.apache.org/jira/browse/TIKA-1240
> A plugin should be able to load all required classes.
>
> Thanks,
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH
>
> 2015-06-23 3:59 GMT+02:00 <Arkadi.Kosmynin@csiro.au <ma...@csiro.au>>:
>
>     Hi,
>
>     This is what happened:
>
>     java.io.IOException: Job failed!
>             at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>             at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
>             <...>
>     Caused by: java.lang.IncompatibleClassChangeError: class
>     org.apache.tika.parser.asm.XHTMLClassVisitor has interface org.objectweb.asm.ClassVisitor as
>     super class
>                     at java.lang.ClassLoader.defineClass1(Native Method)
>                     at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>                     at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>                     at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>                     at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>                     at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>                     at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>                     at java.security.AccessController.doPrivileged(Native Method)
>                     at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>                     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>                     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>                     at org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
>                     at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
>                     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
>
>     Suggested fix in ParseUtil:
>
>     Replace
>
>                 if (maxParseTime!=-1)
>                            parseResult = runParser(parsers[i], content);
>                 else
>                            parseResult = parsers[i].getParse(content);
>
>     with
>
>           try
>           {
>                 if (maxParseTime!=-1)
>                            parseResult = runParser(parsers[i], content);
>                 else
>                            parseResult = parsers[i].getParse(content);
>           } catch( Throwable e )
>           {
>             LOG.warn( "Parsing " + content.getUrl() + " with " + parsers[i].getClass().getName() + "
>     failed: " + e.getMessage() ) ;
>             parseResult = null ;
>           }
>
>     Also replace
>
>           if (maxParseTime!=-1)
>                       parseResult = runParser(p, content);
>            else
>                       parseResult = p.getParse(content);
>
>     with
>
>         try
>         {
>           if (maxParseTime!=-1)
>                       parseResult = runParser(p, content);
>            else
>                       parseResult = p.getParse(content);
>         } catch( Throwable e )
>         {
>           LOG.warn( "Parsing " + content.getUrl() + " with " + p.getClass().getName() + " failed: "
>     + e.getMessage() ) ;
>         }
>
>     Regards,
>     Arkadi
>
>


RE: A parser failure on a single document may fail crawling job

Posted by Ar...@csiro.au.
Hi Sebastian,

> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com]
> Sent: Friday, 24 July 2015 6:39 AM
> To: user@nutch.apache.org
> Cc: Kosmynin, Arkadi (CASS, Marsfield) <Ar...@csiro.au>
> Subject: Re: A parser failure on a single document may fail crawling job
> 
> Hi Arkadi,
> 
> does the problem persist?

Yes.

> Which version of Nutch are you using?

1.9

> Can you point to one file or URL to reproduce it?

To reproduce:

- Remove a jar file that one of your parsers depends on. 
- Make Nutch parse any file using this parser.

This will result in NoSuchMethodError thrown and crawling job failed.

I've created a JIRA issue NUTCH-2071 and attached a patch. I believe that this problem should be handled at ParseUtil level because people may use their own or third party parsers and Nutch should be protected from parsers problems.

Regards,
Arkadi

> 
> Thanks,
> Sebastian
> 
> On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> > Hi Arkadi,
> >
> > thanks for reporting that. Can you open a Jira ticket [1] to address this bug?
> >
> > It's rather a bug of the plugin parse-tika and should be solved there,
> > cf. https://issues.apache.org/jira/browse/TIKA-1240
> > A plugin should be able to load all required classes.
> >
> > Thanks,
> > Sebastian
> >
> > [1] https://issues.apache.org/jira/browse/NUTCH
> >
> > 2015-06-23 3:59 GMT+02:00 <Arkadi.Kosmynin@csiro.au
> <ma...@csiro.au>>:
> >
> >     Hi,
> >
> >     This is what happened:
> >
> >     java.io.IOException: Job failed!
> >             at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> >             at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> >             <...>
> >     Caused by: java.lang.IncompatibleClassChangeError: class
> >     org.apache.tika.parser.asm.XHTMLClassVisitor has interface
> org.objectweb.asm.ClassVisitor as
> >     super class
> >                     at java.lang.ClassLoader.defineClass1(Native Method)
> >                     at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> >                     at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> >                     at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> >                     at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> >                     at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> >                     at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> >                     at java.security.AccessController.doPrivileged(Native Method)
> >                     at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> >                     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> >                     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> >                     at
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
> >                     at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
> >                     at
> > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> >
> >     Suggested fix in ParseUtil:
> >
> >     Replace
> >
> >                 if (maxParseTime!=-1)
> >                            parseResult = runParser(parsers[i], content);
> >                 else
> >                            parseResult = parsers[i].getParse(content);
> >
> >     with
> >
> >           try
> >           {
> >                 if (maxParseTime!=-1)
> >                            parseResult = runParser(parsers[i], content);
> >                 else
> >                            parseResult = parsers[i].getParse(content);
> >           } catch( Throwable e )
> >           {
> >             LOG.warn( "Parsing " + content.getUrl() + " with " +
> parsers[i].getClass().getName() + "
> >     failed: " + e.getMessage() ) ;
> >             parseResult = null ;
> >           }
> >
> >     Also replace
> >
> >           if (maxParseTime!=-1)
> >                       parseResult = runParser(p, content);
> >            else
> >                       parseResult = p.getParse(content);
> >
> >     with
> >
> >         try
> >         {
> >           if (maxParseTime!=-1)
> >                       parseResult = runParser(p, content);
> >            else
> >                       parseResult = p.getParse(content);
> >         } catch( Throwable e )
> >         {
> >           LOG.warn( "Parsing " + content.getUrl() + " with " +
> p.getClass().getName() + " failed: "
> >     + e.getMessage() ) ;
> >         }
> >
> >     Regards,
> >     Arkadi
> >
> >