You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by blue-wolf Yang <bl...@gmail.com> on 2011/06/11 07:07:21 UTC

Parse application/xhtml+xml error

Hi,
I'm testing my custom parser plugin for nutch 1.2, which match some regular
expression in the content and store these matched text into my database.
When I test it in eclipse, everything worked well. But if I use it in my
production environment. Some warnings were logged in hadoop.log like
following:

>  2011-06-11 00:33:06,760 WARN  parse.ParserFactory - ParserFactory:Plugin:
>> org.apache.nutch.parse.html.HtmlParser mapped to contentType
>> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
>> not claim to support contentType: application/xhtml+xml
>
>  2011-06-11 00:33:07,302 INFO  fetcher.Fetcher - -activeThreads=1,
>> spinWaiting=0, fetchQueues.totalSize=0
>
> 2011-06-11 00:33:08,303 INFO  fetcher.Fetcher - -activeThreads=1,
>> spinWaiting=0, fetchQueues.totalSize=0
>
> 2011-06-11 00:33:09,303 INFO  fetcher.Fetcher - -activeThreads=1,
>> spinWaiting=0, fetchQueues.totalSize=0
>
> 2011-06-11 00:33:09,940 WARN  parse.ParseUtil - Unable to successfully
>> parse content http://www.eccom.com.cn/EN/ of type application/xhtml+xml
>
> 2011-06-11 00:33:09,943 WARN  fetcher.Fetcher - Error parsing:
>> http://www.eccom.com.cn/EN/: failed(2,200):
>> org.apache.nutch.parse.ParseException: Unable to successfully parse content
>
> When I remove the plugin in nutch-site.xml, crawling worked correctly. Any
idea? Thanks.

Re: Parse application/xhtml+xml error

Posted by Markus Jelsma <ma...@openindex.io>.

Check your parse-plugins.xml file. It needs to map content types to a parse 
plugin. This parse plugin must also be configured to load in your nutch-site 
configuration plugin.includes directive.

The plugin's plugin.xml file must also map to the content type. See examples 
such as parse-html or parse-tika.

> Hi,
> I'm testing my custom parser plugin for nutch 1.2, which match some regular
> expression in the content and store these matched text into my database.
> When I test it in eclipse, everything worked well. But if I use it in my
> production environment. Some warnings were logged in hadoop.log like
> 
> following:
> >  2011-06-11 00:33:06,760 WARN  parse.ParserFactory - ParserFactory:Plugin:
> >> org.apache.nutch.parse.html.HtmlParser mapped to contentType
> >> application/xhtml+xml via parse-plugins.xml, but its plugin.xml file
> >> does not claim to support contentType: application/xhtml+xml
> >> 
> >  2011-06-11 00:33:07,302 INFO  fetcher.Fetcher - -activeThreads=1,
> >  
> >> spinWaiting=0, fetchQueues.totalSize=0
> > 
> > 2011-06-11 00:33:08,303 INFO  fetcher.Fetcher - -activeThreads=1,
> > 
> >> spinWaiting=0, fetchQueues.totalSize=0
> > 
> > 2011-06-11 00:33:09,303 INFO  fetcher.Fetcher - -activeThreads=1,
> > 
> >> spinWaiting=0, fetchQueues.totalSize=0
> > 
> > 2011-06-11 00:33:09,940 WARN  parse.ParseUtil - Unable to successfully
> > 
> >> parse content http://www.eccom.com.cn/EN/ of type application/xhtml+xml
> > 
> > 2011-06-11 00:33:09,943 WARN  fetcher.Fetcher - Error parsing:
> >> http://www.eccom.com.cn/EN/: failed(2,200):
> >> org.apache.nutch.parse.ParseException: Unable to successfully parse
> >> content
> > 
> > When I remove the plugin in nutch-site.xml, crawling worked correctly.
> > Any
> 
> idea? Thanks.