You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by blunderboy <sa...@gmail.com> on 2012/05/22 12:40:41 UTC

Get Parent of URLs fetched by nutch

As I run Apache Nutch 1.4 crawler, I want to store some additional
information. I want to store the parent of every URL.

For example, I want to crawl a page a.html that has 2 anchor links to b.html
and c.html So when I crawl a.html, I should get something like this :-

a.html null
b.html a.html
c.html a.html

I want to store something like this. I have read how nutch works and have
run nutch in eclipse too. I also read fetcher.java and logged where it
fetched content. But I got no success in knowing where Nutch fetches the
child URLs of a given page. I think this step takes place after parsing
step.

--
View this message in context: http://lucene.472066.n3.nabble.com/Get-Parent-of-URLs-fetched-by-nutch-tp3985369.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Get Parent of URLs fetched by nutch

Posted by blunderboy <sa...@gmail.com>.

I have seen some code in OutLinkExtracter.java file.
May be can someone tell me how to use it because i think it contains what I
need. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Get-Parent-of-URLs-fetched-by-nutch-tp3985369p3985395.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Get Parent of URLs fetched by nutch

Posted by Julien Nioche <li...@gmail.com>.

Implement your own scoring filter and add the URL of the source to the
targets' metadata. See https://issues.apache.org/jira/browse/NUTCH-1331 for
something (vaguely) related

On 22 May 2012 11:40, blunderboy <sa...@gmail.com> wrote:

> As I run Apache Nutch 1.4 crawler, I want to store some additional
> information. I want to store the parent of every URL.
>
> For example, I want to crawl a page a.html that has 2 anchor links to
> b.html
> and c.html So when I crawl a.html, I should get something like this :-
>
> a.html null
> b.html a.html
> c.html a.html
>
> I want to store something like this. I have read how nutch works and have
> run nutch in eclipse too. I also read fetcher.java and logged where it
> fetched content. But I got no success in knowing where Nutch fetches the
> child URLs of a given page. I think this step takes place after parsing
> step.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Get-Parent-of-URLs-fetched-by-nutch-tp3985369.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble