You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Vijith <vi...@gmail.com> on 2012/08/31 14:29:07 UTC

Some questions regarding NUTCH-1150

Hi all,

(Please ignore my previous mail, if any)

I am new to dev... I am working on NUTCH-1150...
https://issues.apache.org/jira/browse/NUTCH-1150
I would like to get some directions before I can start... Right now I am
going through the Fetcher.java code...

I have tried running nutch with a sample site with two different urls
redirecting to a common resource.
I could not find any clues, from hadoop.log, where the common resource is
parsed multiple times.
Could some one please explain the exact scenario that creates this bug.

And how does this bug relates to NUTCH-1184 ?

-- 
*Vijith V.*

Re: Some questions regarding NUTCH-1150

Posted by feng lu <am...@gmail.com>.
Hi Vijith

May be Markus Jelsma  already sloved this issue by keeping a list of
crawled URL's in a external bloom filter. So you can ask Markus Jelsma to
confirm it.

On Sat, Sep 1, 2012 at 2:05 PM, Vijith <vi...@gmail.com> wrote:

> Thanks a lot Feng. I will try the same...
>
>
> On Sat, Sep 1, 2012 at 7:36 AM, feng lu <am...@gmail.com> wrote:
>
>> Hi  Vijith
>>
>> it only happen when the fetcher.parse is true and
>> fetcher.follow.outlinks.depth is greater than 0. When Two url (A,B)
>> direct to same url (C) and that url will fetch twice, maybe i think you can deduplicate
>> the url (C) in handleRedirect function in fetcher.java.
>>
>> On Fri, Aug 31, 2012 at 8:39 PM, Lewis John Mcgibbney <
>> lewis.mcgibbney@gmail.com> wrote:
>>
>>> No hassle Vijith
>>>
>>> Thank you
>>>
>>> Lewis
>>>
>>> On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
>>> > I apologize..I was sending to mailing list with out subscribing to it.
>>> I
>>> > found the reply from Lewis (from archive). I will comment directly on
>>> the
>>> > issue. Thanks.
>>> >
>>> >
>>> > On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> (Please ignore my previous mail, if any)
>>> >>
>>> >> I am new to dev... I am working on
>>> >> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
>>> >> I would like to get some directions before I can start... Right now I
>>> am
>>> >> going through the Fetcher.java code...
>>> >>
>>> >> I have tried running nutch with a sample site with two different urls
>>> >> redirecting to a common resource.
>>> >> I could not find any clues, from hadoop.log, where the common
>>> resource is
>>> >> parsed multiple times.
>>> >> Could some one please explain the exact scenario that creates this
>>> bug.
>>> >>
>>> >> And how does this bug relates to NUTCH-1184 ?
>>> >>
>>> >> --
>>> >> Vijith V.
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > . . . . . thanks & regards
>>> >
>>> > Vijith V.
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Lewis
>>>
>>
>>
>>
>> --
>> Don't Grow Old, Grow Up... :-)
>>
>
>
>
> --
> *. . . . . thanks & regards*
> *
> *
> *Vijith V.*
>
>
>


-- 
Don't Grow Old, Grow Up... :-)

Re: Some questions regarding NUTCH-1150

Posted by Vijith <vi...@gmail.com>.
Thanks a lot Feng. I will try the same...

On Sat, Sep 1, 2012 at 7:36 AM, feng lu <am...@gmail.com> wrote:

> Hi  Vijith
>
> it only happen when the fetcher.parse is true and
> fetcher.follow.outlinks.depth is greater than 0. When Two url (A,B)
> direct to same url (C) and that url will fetch twice, maybe i think you can deduplicate
> the url (C) in handleRedirect function in fetcher.java.
>
> On Fri, Aug 31, 2012 at 8:39 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> No hassle Vijith
>>
>> Thank you
>>
>> Lewis
>>
>> On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
>> > I apologize..I was sending to mailing list with out subscribing to it. I
>> > found the reply from Lewis (from archive). I will comment directly on
>> the
>> > issue. Thanks.
>> >
>> >
>> > On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> (Please ignore my previous mail, if any)
>> >>
>> >> I am new to dev... I am working on
>> >> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
>> >> I would like to get some directions before I can start... Right now I
>> am
>> >> going through the Fetcher.java code...
>> >>
>> >> I have tried running nutch with a sample site with two different urls
>> >> redirecting to a common resource.
>> >> I could not find any clues, from hadoop.log, where the common resource
>> is
>> >> parsed multiple times.
>> >> Could some one please explain the exact scenario that creates this bug.
>> >>
>> >> And how does this bug relates to NUTCH-1184 ?
>> >>
>> >> --
>> >> Vijith V.
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > . . . . . thanks & regards
>> >
>> > Vijith V.
>> >
>> >
>>
>>
>>
>> --
>> Lewis
>>
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>



-- 
*. . . . . thanks & regards*
*
*
*Vijith V.*

Re: Some questions regarding NUTCH-1150

Posted by feng lu <am...@gmail.com>.
Hi  Vijith

it only happen when the fetcher.parse is true and
fetcher.follow.outlinks.depth is greater than 0. When Two url (A,B) direct
to same url (C) and that url will fetch twice, maybe i think you can
deduplicate
the url (C) in handleRedirect function in fetcher.java.

On Fri, Aug 31, 2012 at 8:39 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> No hassle Vijith
>
> Thank you
>
> Lewis
>
> On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
> > I apologize..I was sending to mailing list with out subscribing to it. I
> > found the reply from Lewis (from archive). I will comment directly on the
> > issue. Thanks.
> >
> >
> > On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> (Please ignore my previous mail, if any)
> >>
> >> I am new to dev... I am working on
> >> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
> >> I would like to get some directions before I can start... Right now I am
> >> going through the Fetcher.java code...
> >>
> >> I have tried running nutch with a sample site with two different urls
> >> redirecting to a common resource.
> >> I could not find any clues, from hadoop.log, where the common resource
> is
> >> parsed multiple times.
> >> Could some one please explain the exact scenario that creates this bug.
> >>
> >> And how does this bug relates to NUTCH-1184 ?
> >>
> >> --
> >> Vijith V.
> >>
> >>
> >
> >
> >
> > --
> > . . . . . thanks & regards
> >
> > Vijith V.
> >
> >
>
>
>
> --
> Lewis
>



-- 
Don't Grow Old, Grow Up... :-)

Re: Some questions regarding NUTCH-1150

Posted by Lewis John Mcgibbney <le...@gmail.com>.
No hassle Vijith

Thank you

Lewis

On Fri, Aug 31, 2012 at 1:37 PM, Vijith <vi...@gmail.com> wrote:
> I apologize..I was sending to mailing list with out subscribing to it. I
> found the reply from Lewis (from archive). I will comment directly on the
> issue. Thanks.
>
>
> On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:
>>
>> Hi all,
>>
>> (Please ignore my previous mail, if any)
>>
>> I am new to dev... I am working on
>> NUTCH-1150...https://issues.apache.org/jira/browse/NUTCH-1150
>> I would like to get some directions before I can start... Right now I am
>> going through the Fetcher.java code...
>>
>> I have tried running nutch with a sample site with two different urls
>> redirecting to a common resource.
>> I could not find any clues, from hadoop.log, where the common resource is
>> parsed multiple times.
>> Could some one please explain the exact scenario that creates this bug.
>>
>> And how does this bug relates to NUTCH-1184 ?
>>
>> --
>> Vijith V.
>>
>>
>
>
>
> --
> . . . . . thanks & regards
>
> Vijith V.
>
>



-- 
Lewis

Re: Some questions regarding NUTCH-1150

Posted by Vijith <vi...@gmail.com>.
I apologize..I was sending to mailing list with out subscribing to it. I
found the reply from Lewis (from archive). I will comment directly on the
issue. Thanks.


On Fri, Aug 31, 2012 at 5:59 PM, Vijith <vi...@gmail.com> wrote:

> Hi all,
>
> (Please ignore my previous mail, if any)
>
> I am new to dev... I am working on NUTCH-1150...
> https://issues.apache.org/jira/browse/NUTCH-1150
> I would like to get some directions before I can start... Right now I am
> going through the Fetcher.java code...
>
> I have tried running nutch with a sample site with two different urls
> redirecting to a common resource.
> I could not find any clues, from hadoop.log, where the common resource is
> parsed multiple times.
> Could some one please explain the exact scenario that creates this bug.
>
> And how does this bug relates to NUTCH-1184 ?
>
> --
> *Vijith V.*
>
>
>


-- 
*. . . . . thanks & regards*
*
*
*Vijith V.*