You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by A Laxmi <a....@gmail.com> on 2013/10/03 20:15:46 UTC

[Nutch 2.2.1 + HBase 0.90.4] Error tika.TikaParser

I got this error and researching on it doesn't seem to help much.  Please
help.

*Error tika.TikaParser - Error parsing http://www.###.###.##/###/abc.xml

*
*org.apache.tika.exception.TikaException: RSS parse error

*
*Caused by com.sun.syndication.io.ParsingFeedException: Invalid XML: Error
on line 436: The element "item" must be terminated by the matching end-tag
"</item>"*

Re: [Nutch 2.2.1 + HBase 0.90.4] Error tika.TikaParser

Posted by A Laxmi <a....@gmail.com>.
Thanks, Sebastian!


On Sat, Oct 5, 2013 at 5:48 AM, Sebastian Nagel
<wa...@googlemail.com>wrote:

> Hi,
>
> according to the error message the RSS feed abc.xml
> is truncated or invalid. Could you check the property
> "http.content.limit", default is 64kB, large RSS feeds
> may get truncated.
>
> You can test parsing itself by
>
> % bin/nutch parsechecker 'http://www.###.###.##/###/abc.xml'
>
> Cheers,
> Sebastian
>
>
> On 10/03/2013 08:15 PM, A Laxmi wrote:
> > I got this error and researching on it doesn't seem to help much.  Please
> > help.
> >
> > *Error tika.TikaParser - Error parsing http://www.###.###.##/###/abc.xml
> >
> > *
> > *org.apache.tika.exception.TikaException: RSS parse error
> >
> > *
> > *Caused by com.sun.syndication.io.ParsingFeedException: Invalid XML:
> Error
> > on line 436: The element "item" must be terminated by the matching
> end-tag
> > "</item>"*
> >
>
>

Re: [Nutch 2.2.1 + HBase 0.90.4] Error tika.TikaParser

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

according to the error message the RSS feed abc.xml
is truncated or invalid. Could you check the property
"http.content.limit", default is 64kB, large RSS feeds
may get truncated.

You can test parsing itself by

% bin/nutch parsechecker 'http://www.###.###.##/###/abc.xml'

Cheers,
Sebastian


On 10/03/2013 08:15 PM, A Laxmi wrote:
> I got this error and researching on it doesn't seem to help much.  Please
> help.
> 
> *Error tika.TikaParser - Error parsing http://www.###.###.##/###/abc.xml
> 
> *
> *org.apache.tika.exception.TikaException: RSS parse error
> 
> *
> *Caused by com.sun.syndication.io.ParsingFeedException: Invalid XML: Error
> on line 436: The element "item" must be terminated by the matching end-tag
> "</item>"*
>