You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2012/05/24 21:28:53 UTC

Re: RSS parser

(it's too late I know)

Have you checked the property http.content.limit
(default is only 64kB, RSS feeds are often larger).
Looks like the content is truncated:

 > Caused by: com.sun.syndication.io.ParsingFeedException:
 > Invalid XML: Error on line 300: XML document
 > structures must start and end within the same entity.

On 02/10/2012 01:24 PM, Michael Kazekin wrote:
> On 02/08/2012 06:44 PM, dspathis wrote:
>>> http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Frss.sciam.com%2Fsciam%2Fearth-and-environment
>>>
>> Hmmm. I just tried the URL you provided with my own Nutch 1.4 installation.
>> It gets parsed successfully *both* with the feed and with the tika parser (I
>> modified my config first to use the former, then to use the latter).
>>
>> I think your config might still have issues. Maybe you could turn on TRACE
>> level logging to see if you can get some clues that way?
>
> 1) I installed Nutch 1.4 from scratch,
>
> 2) changed nutch-site.xml from empty to:
>
> <configuration>
>
> <property>
> <name>http.agent.name</name>
> <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html)</value>
> </property>
>
> <property>
> <name>http.robots.agents</name>
> <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html),*</value>
> </property>
>
> </configuration>
>
> 3) commented out feed plugin (inparse-plugins.xml)
>
> <mimeType name="application/rss+xml">
> <plugin id="parse-tika" />
> <!--<plugin id="feed" />-->
> </mimeType>
>
> 4) Changed log level in log4j.properties
>
> log4j.logger.org.apache.nutch.fetcher.Fetcher=TRACE,cmdstdout
> log4j.logger.org.apache.nutch.parse.ParseSegment=TRACE,cmdstdout
>
>
> Then I injected, generated and fetched db with the only RSS link, and got this exception with Tika:
>
>
> 2012-02-10 15:06:32,782 ERROR tika.TikaParser - Error parsing
> http://rss.sciam.com/sciam/earth-and-environment
> org.apache.tika.exception.TikaException: RSS parse error
> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:106)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 300: XML document
> structures must start and end within the same entity.
> at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:207)
> at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:135)
> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:68)
> ... 6 more
> Caused by: org.jdom.input.JDOMParseException: Error on line 300: XML document structures must start
> and end within the same entity.
> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
> at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:203)
> ... 8 more
> Caused by: org.xml.sax.SAXParseException: XML document structures must start and end within the same
> entity.
> at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source)
> at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
> ... 9 more
> 2012-02-10 15:06:32,785 INFO parse.ParseSegment - Parsing:
> http://rss.sciam.com/sciam/earth-and-environment
> 2012-02-10 15:06:32,786 WARN parse.ParseSegment - Error parsing:
> http://rss.sciam.com/sciam/earth-and-environment: failed(2,0): RSS parse error
> 2012-02-10 15:06:32,787 INFO crawl.SignatureFactory - Using Signature impl:
> org.apache.nutch.crawl.MD5Signature
> 2012-02-10 15:06:33,235 INFO parse.ParseSegment - ParseSegment: finished at 2012-02-10 15:06:33,
> elapsed: 00:00:01
>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/RSS-parser-tp3719558p3726154.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>


Re: RSS parser

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi,

On Mon, Jul 23, 2012 at 5:00 PM, ShlomiJ <sh...@gmail.com> wrote:
> @Lewis
> Any update on the matter of the differences between Nutch 1.4 and 1.5 ?

No difference between 1.4 & 1.5.1

Lewis

Re: RSS parser

Posted by ShlomiJ <sh...@gmail.com>.
@Lewis
Any update on the matter of the differences between Nutch 1.4 and 1.5 ?

@all
Any new insight on the question why the parse fails?

ShlomiJ



--
View this message in context: http://lucene.472066.n3.nabble.com/RSS-parser-tp3719558p3996772.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: RSS parser

Posted by Michael Kazekin <Mi...@mediainsight.info>.
Sebastian,

You are right, that was one of the issues I successfully fixed, but 
anyway thank you!


On 05/24/2012 11:28 PM, Sebastian Nagel wrote:
> (it's too late I know)
>
> Have you checked the property http.content.limit
> (default is only 64kB, RSS feeds are often larger).
> Looks like the content is truncated:
>
> > Caused by: com.sun.syndication.io.ParsingFeedException:
> > Invalid XML: Error on line 300: XML document
> > structures must start and end within the same entity.
>
> On 02/10/2012 01:24 PM, Michael Kazekin wrote:
>> On 02/08/2012 06:44 PM, dspathis wrote:
>>>> http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Frss.sciam.com%2Fsciam%2Fearth-and-environment 
>>>>
>>>>
>>> Hmmm. I just tried the URL you provided with my own Nutch 1.4 
>>> installation.
>>> It gets parsed successfully *both* with the feed and with the tika 
>>> parser (I
>>> modified my config first to use the former, then to use the latter).
>>>
>>> I think your config might still have issues. Maybe you could turn on 
>>> TRACE
>>> level logging to see if you can get some clues that way?
>>
>> 1) I installed Nutch 1.4 from scratch,
>>
>> 2) changed nutch-site.xml from empty to:
>>
>> <configuration>
>>
>> <property>
>> <name>http.agent.name</name>
>> <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html)</value>
>> </property>
>>
>> <property>
>> <name>http.robots.agents</name>
>> <value>Test Nutch Agent 
>> (http://www.nutch.org/docs/en/bot.html),*</value>
>> </property>
>>
>> </configuration>
>>
>> 3) commented out feed plugin (inparse-plugins.xml)
>>
>> <mimeType name="application/rss+xml">
>> <plugin id="parse-tika" />
>> <!--<plugin id="feed" />-->
>> </mimeType>
>>
>> 4) Changed log level in log4j.properties
>>
>> log4j.logger.org.apache.nutch.fetcher.Fetcher=TRACE,cmdstdout
>> log4j.logger.org.apache.nutch.parse.ParseSegment=TRACE,cmdstdout
>>
>>
>> Then I injected, generated and fetched db with the only RSS link, and 
>> got this exception with Tika:
>>
>>
>> 2012-02-10 15:06:32,782 ERROR tika.TikaParser - Error parsing
>> http://rss.sciam.com/sciam/earth-and-environment
>> org.apache.tika.exception.TikaException: RSS parse error
>> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:106)
>> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
>> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> at java.lang.Thread.run(Thread.java:662)
>> Caused by: com.sun.syndication.io.ParsingFeedException: Invalid XML: 
>> Error on line 300: XML document
>> structures must start and end within the same entity.
>> at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:207)
>> at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:135)
>> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:68)
>> ... 6 more
>> Caused by: org.jdom.input.JDOMParseException: Error on line 300: XML 
>> document structures must start
>> and end within the same entity.
>> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
>> at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:203)
>> ... 8 more
>> Caused by: org.xml.sax.SAXParseException: XML document structures 
>> must start and end within the same
>> entity.
>> at 
>> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
>> Source)
>> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>> at 
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown 
>> Source)
>> at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown 
>> Source)
>> at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
>> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
>> at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
>> at 
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown 
>> Source)
>> at 
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> Source)
>> at 
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
>> Source)
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>> at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
>> Source)
>> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453)
>> ... 9 more
>> 2012-02-10 15:06:32,785 INFO parse.ParseSegment - Parsing:
>> http://rss.sciam.com/sciam/earth-and-environment
>> 2012-02-10 15:06:32,786 WARN parse.ParseSegment - Error parsing:
>> http://rss.sciam.com/sciam/earth-and-environment: failed(2,0): RSS 
>> parse error
>> 2012-02-10 15:06:32,787 INFO crawl.SignatureFactory - Using Signature 
>> impl:
>> org.apache.nutch.crawl.MD5Signature
>> 2012-02-10 15:06:33,235 INFO parse.ParseSegment - ParseSegment: 
>> finished at 2012-02-10 15:06:33,
>> elapsed: 00:00:01
>>
>>> -- 
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/RSS-parser-tp3719558p3726154.html
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>
>
>