You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Vijith <vi...@gmail.com> on 2012/08/30 08:14:31 UTC

Need some directions

Hi all,

I am new to dev... I am working on NUTCH-1150...
I would like to get some directions before I can start... Right now I am
going through the Fetcher.java code...

-- 
*. . . . . thanks & regards*
*
*
*Vijith V.*

Re: Need some directions

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Vijith,

This is very welcome thanks for the interest in developing Nutch.

I would suggest you comment directly on this issue with your thoughts,
that way we can keep a log of who is doing what.

The issue looks quite an interesting improvement but right now I am
not exactly sure how to solve it. Try braking the Fetcher down into
manageable chunks and understand where the functionality needs to be
implemented to track already parsed redirect URLs.

Lewis

On Thu, Aug 30, 2012 at 7:14 AM, Vijith <vi...@gmail.com> wrote:
> Hi all,
>
> I am new to dev... I am working on NUTCH-1150...
> I would like to get some directions before I can start... Right now I am
> going through the Fetcher.java code...
>
> --
> . . . . . thanks & regards
>
> Vijith V.
>
>



-- 
Lewis

RE: Need some directions

Posted by Markus Jelsma <ma...@openindex.io>.
 
-----Original message-----
> From:Vijith <vi...@gmail.com>
> Sent: Fri 31-Aug-2012 15:44
> To: dev@nutch.apache.org
> Subject: Re: Need some directions
> 
> I have tried running nutch with a sample site with two different urls redirecting to a common resource.
> I could not find any clues, from hadoop.log, where the common resource is parsed multiple times.
> Could some one please explain the exact scenario that creates this bug.

In the Jira comment you said it fetched page4 twice now.

> 
> And how does this bug relates to NUTCH-1184 ? 

It relates to 1184 because if URL's in the same fetch list link to a common page, it can be followed.as well.

We solved this issue by keeping a list of crawled URL's in a external bloom filter.

> 
> On Thu, Aug 30, 2012 at 11:44 AM, Vijith <vijithkv.87@gmail.com <ma...@gmail.com> > wrote:
> Hi all, 
> 
> I am new to dev... I am working on NUTCH-1150...
> I would like to get some directions before I can start... Right now I am going through the Fetcher.java code...
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
> 
> 
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
> 

Re: Need some directions

Posted by Vijith <vi...@gmail.com>.
I have tried running nutch with a sample site with two different urls
redirecting to a common resource.
I could not find any clues, from hadoop.log, where the common resource is
parsed multiple times.
Could some one please explain the exact scenario that creates this bug.

And how does this bug relates to NUTCH-1184 ?

On Thu, Aug 30, 2012 at 11:44 AM, Vijith <vi...@gmail.com> wrote:

> Hi all,
>
> I am new to dev... I am working on NUTCH-1150...
> I would like to get some directions before I can start... Right now I am
> going through the Fetcher.java code...
>
> --
> *. . . . . thanks & regards*
> *
> *
> *Vijith V.*
>
>
>


-- 
*. . . . . thanks & regards*
*
*
*Vijith V.*