You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "O. Klein" <kl...@octoweb.nl> on 2015/03/21 20:41:53 UTC

Feed

I'm trying to get the feed plugin to extract links with Nutch 1.9, but I keep
running into following exception. Even with a test case like
http://www.feedforall.com/sample.xml

Any clues on cause of this? 

2015-03-21 20:30:53,529 INFO  parse.ParseSegment - Parsed
(15ms):http://www.feedforall.com/law-enforcement.htm
2015-03-21 20:30:53,531 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/computer-service.htm
2015-03-21 20:30:53,533 INFO  parse.ParseSegment - Parsed
(1ms):http://www.feedforall.com/sample.xml
2015-03-21 20:30:53,534 INFO  parse.ParseSegment - Parsed
(1ms):http://www.feedforall.com/politics.htm
2015-03-21 20:30:53,535 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/weather.htm
2015-03-21 20:30:53,536 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/restaurant.htm
2015-03-21 20:30:53,537 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/schools.htm
2015-03-21 20:30:53,538 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/real-estate.htm
2015-03-21 20:30:53,539 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/government.htm
2015-03-21 20:30:53,540 INFO  parse.ParseSegment - Parsed
(0ms):http://www.feedforall.com/banks.htm
2015-03-21 20:30:53,666 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/banks.htm
2015-03-21 20:30:53,667 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/computer-service.htm
2015-03-21 20:30:53,668 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/government.htm
2015-03-21 20:30:53,669 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/law-enforcement.htm
2015-03-21 20:30:53,670 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/politics.htm
2015-03-21 20:30:53,670 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/real-estate.htm
2015-03-21 20:30:53,671 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/restaurant.htm
2015-03-21 20:30:53,672 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/schools.htm
2015-03-21 20:30:53,675 WARN  parse.ParseOutputFormat - Can't read fetch
time for: http://www.feedforall.com/weather.htm
2015-03-21 20:30:53,985 INFO  parse.ParseSegment - ParseSegment: finished at
2015-03-21 20:30:53, elapsed: 00:00:03



--
View this message in context: http://lucene.472066.n3.nabble.com/Feed-tp4194433.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Feed

Posted by "O. Klein" <kl...@octoweb.nl>.

Nevermind. Got it working now.




--
View this message in context: http://lucene.472066.n3.nabble.com/Feed-tp4194433p4194491.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Feed

Posted by "O. Klein" <kl...@octoweb.nl>.

Thank you for your answer.

For some unknown reason the exception is gone.

This brings me to the next issue:

how do I get metadata indexed with feed plugin?

>From the parsechecker I get

Parse Metadata: feed=http://www.feedforall.com/industry-solutions.htm
tag=Computers/Software/Internet/Site Management/Content Management
published=1098198545000 

So how would I get the tag as a field indexed in Solr?



--
View this message in context: http://lucene.472066.n3.nabble.com/Feed-tp4194433p4194476.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Feed

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

what is exactly the problem:
there is a warning in the logs which stems from the fact
that the feed plugin adds the feed items without fetching
them. To avoid the warning
 WARN  parse.ParseOutputFormat - Can't read fetch
    time for: http://www.feedforall.com/schools.htm
the feed plugin should eventually add a fetch time.

In general, it should work and you'll get the feed items
as separate documents added to the index.

If it's just about following the links in the feed,
using the plugin "parse-tika" instead of "feed" will
do the job.

In short: although the plugins "feed" and "parse-tika" are
both able to parse rss/atom feeds, they differ how they
treat the feed items:
- (feed) add them as individual documents with URL, title and text
  NOTE: feed item URLs are processed by URL filters and normalizers
- (parse-tika) treat a feed as one large documents. Feed items
  are just list items plus one outlink

Cheers,
Sebastian

On 03/21/2015 08:41 PM, O. Klein wrote:
> I'm trying to get the feed plugin to extract links with Nutch 1.9, but I keep
> running into following exception. Even with a test case like
> http://www.feedforall.com/sample.xml
> 
> Any clues on cause of this? 
> 
> 2015-03-21 20:30:53,529 INFO  parse.ParseSegment - Parsed
> (15ms):http://www.feedforall.com/law-enforcement.htm
> 2015-03-21 20:30:53,531 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/computer-service.htm
> 2015-03-21 20:30:53,533 INFO  parse.ParseSegment - Parsed
> (1ms):http://www.feedforall.com/sample.xml
> 2015-03-21 20:30:53,534 INFO  parse.ParseSegment - Parsed
> (1ms):http://www.feedforall.com/politics.htm
> 2015-03-21 20:30:53,535 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/weather.htm
> 2015-03-21 20:30:53,536 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/restaurant.htm
> 2015-03-21 20:30:53,537 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/schools.htm
> 2015-03-21 20:30:53,538 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/real-estate.htm
> 2015-03-21 20:30:53,539 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/government.htm
> 2015-03-21 20:30:53,540 INFO  parse.ParseSegment - Parsed
> (0ms):http://www.feedforall.com/banks.htm
> 2015-03-21 20:30:53,666 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/banks.htm
> 2015-03-21 20:30:53,667 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/computer-service.htm
> 2015-03-21 20:30:53,668 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/government.htm
> 2015-03-21 20:30:53,669 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/law-enforcement.htm
> 2015-03-21 20:30:53,670 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/politics.htm
> 2015-03-21 20:30:53,670 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/real-estate.htm
> 2015-03-21 20:30:53,671 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/restaurant.htm
> 2015-03-21 20:30:53,672 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/schools.htm
> 2015-03-21 20:30:53,675 WARN  parse.ParseOutputFormat - Can't read fetch
> time for: http://www.feedforall.com/weather.htm
> 2015-03-21 20:30:53,985 INFO  parse.ParseSegment - ParseSegment: finished at
> 2015-03-21 20:30:53, elapsed: 00:00:03
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Feed-tp4194433.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>