You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Vijith <vi...@gmail.com> on 2012/08/31 14:29:07 UTC
Some questions regarding NUTCH-1150
Hi all,
(Please ignore my previous mail, if any)
I am new to dev... I am working on NUTCH-1150...
https://issues.apache.org/jira/browse/NUTCH-1150
I would like to get some directions before I can start... Right now I am
going through the Fetcher.java code...
I have tried running nutch with a sample site with two different urls
redirecting to a common resource.
I could not find any clues, from hadoop.log, where the common resource is
parsed multiple times.
Could some one please explain the exact scenario that creates this bug.
And how does this bug relates to NUTCH-1184 ?
--
*Vijith V.*
Re: Some questions regarding NUTCH-1150
Posted by feng lu <am...@gmail.com>.
Hi Vijith
May be Markus Jelsma already sloved this issue by keeping a list of
crawled URL's in a external bloom filter. So you can ask Markus Jelsma to
confirm it.
On Sat, Sep 1, 2012 at 2:05 PM, Vijith <vi...@gmail.com> wrote:
> Thanks a lot Feng. I will try the same...
>
>
> On Sat, Sep 1, 2012 at 7:36 AM, feng lu <am...@gmail.com> wrote:
>
>> Hi Vijith
>>
>> it only happen when the fetcher.parse is true and
>> fetcher.follow.outlinks.depth is greater than 0. When Two url (A,B)
>> direct to same url (C) and that url will fetch twice, maybe i think you can deduplicate
>> the url (C) in handleRedirect function in fetcher.java.
>>
>> On Fri, Aug 31, 2012 at 8:39 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> No hassle Vijith
>>>
>>> Thank you
>>>
>>> Lewis
>>>
>>> On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
>>> > I apologize..I was sending to mailing list with out subscribing to it.
>>> I
>>> > found the reply from Lewis (from archive). I will comment directly on
>>> the
>>> > issue. Thanks.
>>> >
>>> >
>>> > On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> (Please ignore my previous mail, if any)
>>> >>
>>> >> I am new to dev... I am working on
>>> >> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
>>> >> I would like to get some directions before I can start... Right now I
>>> am
>>> >> going through the Fetcher.java code...
>>> >>
>>> >> I have tried running nutch with a sample site with two different urls
>>> >> redirecting to a common resource.
>>> >> I could not find any clues, from hadoop.log, where the common
>>> resource is
>>> >> parsed multiple times.
>>> >> Could some one please explain the exact scenario that creates this
>>> bug.
>>> >>
>>> >> And how does this bug relates to NUTCH-1184 ?
>>> >>
>>> >> --
>>> >> Vijith V.
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > . . . . . thanks & regards
>>> >
>>> > Vijith V.
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Lewis
>>>
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>
>
> --
> *. . . . . thanks & regards*
> *
> *
> *Vijith V.*
>
>
>
--
Don't Grow Old, Grow Up... :-)
Re: Some questions regarding NUTCH-1150
Posted by Vijith <vi...@gmail.com>.
Thanks a lot Feng. I will try the same...
On Sat, Sep 1, 2012 at 7:36 AM, feng lu <am...@gmail.com> wrote:
> Hi Vijith
>
> it only happen when the fetcher.parse is true and
> fetcher.follow.outlinks.depth is greater than 0. When Two url (A,B)
> direct to same url (C) and that url will fetch twice, maybe i think you can deduplicate
> the url (C) in handleRedirect function in fetcher.java.
>
> On Fri, Aug 31, 2012 at 8:39 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> No hassle Vijith
>>
>> Thank you
>>
>> Lewis
>>
>> On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
>> > I apologize..I was sending to mailing list with out subscribing to it. I
>> > found the reply from Lewis (from archive). I will comment directly on
>> the
>> > issue. Thanks.
>> >
>> >
>> > On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> (Please ignore my previous mail, if any)
>> >>
>> >> I am new to dev... I am working on
>> >> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
>> >> I would like to get some directions before I can start... Right now I
>> am
>> >> going through the Fetcher.java code...
>> >>
>> >> I have tried running nutch with a sample site with two different urls
>> >> redirecting to a common resource.
>> >> I could not find any clues, from hadoop.log, where the common resource
>> is
>> >> parsed multiple times.
>> >> Could some one please explain the exact scenario that creates this bug.
>> >>
>> >> And how does this bug relates to NUTCH-1184 ?
>> >>
>> >> --
>> >> Vijith V.
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > . . . . . thanks & regards
>> >
>> > Vijith V.
>> >
>> >
>>
>>
>>
>> --
>> Lewis
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>
--
*. . . . . thanks & regards*
*
*
*Vijith V.*
Re: Some questions regarding NUTCH-1150
Posted by feng lu <am...@gmail.com>.
Hi Vijith
it only happen when the fetcher.parse is true and
fetcher.follow.outlinks.depth is greater than 0. When Two url (A,B) direct
to same url (C) and that url will fetch twice, maybe i think you can
deduplicate
the url (C) in handleRedirect function in fetcher.java.
On Fri, Aug 31, 2012 at 8:39 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> No hassle Vijith
>
> Thank you
>
> Lewis
>
> On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
> > I apologize..I was sending to mailing list with out subscribing to it. I
> > found the reply from Lewis (from archive). I will comment directly on the
> > issue. Thanks.
> >
> >
> > On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> (Please ignore my previous mail, if any)
> >>
> >> I am new to dev... I am working on
> >> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
> >> I would like to get some directions before I can start... Right now I am
> >> going through the Fetcher.java code...
> >>
> >> I have tried running nutch with a sample site with two different urls
> >> redirecting to a common resource.
> >> I could not find any clues, from hadoop.log, where the common resource
> is
> >> parsed multiple times.
> >> Could some one please explain the exact scenario that creates this bug.
> >>
> >> And how does this bug relates to NUTCH-1184 ?
> >>
> >> --
> >> Vijith V.
> >>
> >>
> >
> >
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
>
>
>
> --
> Lewis
>
--
Don't Grow Old, Grow Up... :-)
Re: Some questions regarding NUTCH-1150
Posted by Lewis John Mcgibbney <le...@gmail.com>.
No hassle Vijith
Thank you
Lewis
On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
> I apologize..I was sending to mailing list with out subscribing to it. I
> found the reply from Lewis (from archive). I will comment directly on the
> issue. Thanks.
>
>
> On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
>>
>> Hi all,
>>
>> (Please ignore my previous mail, if any)
>>
>> I am new to dev... I am working on
>> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
>> I would like to get some directions before I can start... Right now I am
>> going through the Fetcher.java code...
>>
>> I have tried running nutch with a sample site with two different urls
>> redirecting to a common resource.
>> I could not find any clues, from hadoop.log, where the common resource is
>> parsed multiple times.
>> Could some one please explain the exact scenario that creates this bug.
>>
>> And how does this bug relates to NUTCH-1184 ?
>>
>> --
>> Vijith V.
>>
>>
>
>
>
> --
> . . . . . thanks & regards
>
> Vijith V.
>
>
--
Lewis
Re: Some questions regarding NUTCH-1150
Posted by Vijith <vi...@gmail.com>.
I apologize..I was sending to mailing list with out subscribing to it. I
found the reply from Lewis (from archive). I will comment directly on the
issue. Thanks.
On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
> Hi all,
>
> (Please ignore my previous mail, if any)
>
> I am new to dev... I am working on NUTCH-1150...
> https://issues.apache.org/jira/browse/NUTCH-1150
> I would like to get some directions before I can start... Right now I am
> going through the Fetcher.java code...
>
> I have tried running nutch with a sample site with two different urls
> redirecting to a common resource.
> I could not find any clues, from hadoop.log, where the common resource is
> parsed multiple times.
> Could some one please explain the exact scenario that creates this bug.
>
> And how does this bug relates to NUTCH-1184 ?
>
> --
> *Vijith V.*
>
>
>
--
*. . . . . thanks & regards*
*
*
*Vijith V.*