You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/09/03 15:01:37 UTC

RE: Need some directions

We ignore false positives for for now. A common solution is to maintain a set of known false positives and check that set for membership first before looking at the bloom filter.
 
-----Original message-----
> From:Vijith <vi...@gmail.com>
> Sent: Mon 03-Sep-2012 13:01
> To: Markus Jelsma <ma...@openindex.io>
> Subject: Re: Need some directions
> 
> I tried with bloom filters. Its working fine for my sample site. So how did you handle false positives then ?
> I am working on it as part of a training assignment. I thought this would be a good starting point to learn Nutch code base.
> 
> On Fri, Aug 31, 2012 at 7:20 PM, Markus Jelsma <markus.jelsma@openindex.io <ma...@openindex.io> > wrote:
> 
> -----Original message-----
> > From:Vijith <vijithkv.87@gmail.com <ma...@gmail.com> >
> > Sent: Fri 31-Aug-2012 15:44
> > To: dev@nutch.apache.org <ma...@nutch.apache.org> 
> > Subject: Re: Need some directions
> >
> > I have tried running nutch with a sample site with two different urls redirecting to a common resource.
> > I could not find any clues, from hadoop.log, where the common resource is parsed multiple times.
> > Could some one please explain the exact scenario that creates this bug.
> 
> In the Jira comment you said it fetched page4 twice now.
> 
> >
> > And how does this bug relates to NUTCH-1184 ? 
> 
> It relates to 1184 because if URL's in the same fetch list link to a common page, it can be followed.as <http://followed.as> well.
> 
> We solved this issue by keeping a list of crawled URL's in a external bloom filter.
> 
> >
> > On Thu, Aug 30, 2012 at 11:44 AM, Vijith <vijithkv.87@gmail.com <ma...@gmail.com> <mailto:vijithkv.87@gmail.com <ma...@gmail.com> > > wrote:
> > Hi all, 
> >
> > I am new to dev... I am working on NUTCH-1150...
> > I would like to get some directions before I can start... Right now I am going through the Fetcher.java code...
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
> >
> >
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
> >
> 
> 
> 
> -- 
> . . . . . thanks & regards
> 
> Vijith V.
> 
> 
>