You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by lewis john mcgibbney <le...@apache.org> on 2016/11/15 17:15:33 UTC

Re: user Digest 7 Nov 2016 19:53:09 -0000 Issue 2672

Hi Eyeris,
I've just tried Nutch master branch to parse outlinks from a number of RSS
Feeds, an example being 'http://www.jpl.nasa.gov/blog/feed/'. This works
perfectly with both the feed and parse-tika plugins. Outlinks are extracted
accordingly.
Can you provide an example of the RSS Feeds you are looking to parse
outlinks from? Are they valid?
An excellent resource to use for this kind of trouble shooting is the
ParseChecker tool
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
hth

On Mon, Nov 7, 2016 at 11:53 AM, <us...@nutch.apache.org> wrote:

> From: Eyeris Rodriguez Rueda <er...@uci.cu>
> To: user@nutch.apache.org
> Cc:
> Date: Sun, 6 Nov 2016 12:14:29 -0500 (CST)
> Subject: how to insert outlinks from rss in crawldb ?
>
> Hi.
> I am using nutch 1.12 and solr 4.10.3.
>
> Rss is a significant way to discover new url to fetch.
>
> All links detected in a rss are not inserted in crawldb as new urls.
> any body can tell me why?
> Please any body can help me or point me in the right direction for insert
> outlinks from feed to crawldb, and visit its in the next iteration.
>
> I have activated only tika parser because using both (tika and feed) the
> field content and outlinks are empty in solr.
>
>
>
>
>
>
>
>
>
>
>  E don´t are inserted in crawldb and also don´t visited in next iterations
> of c
>
>
>

-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney