You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/10/13 18:09:40 UTC

http.redirect.max and duplicate fetch/parse

Hi,

With a > 0 value for http.redirect.max there's a possibility for fetching and 
parsing duplicates, this is especially true for fetch lists with many domains, 
even with just a few (+10) records per domain/host queue.

Assuming there's only one thread per queue, how can we use http.redirect.max 
and prevent fetch and parse of duplicates?

I'm not a big fan of keeping a map of fetched records in memory as it'll blow 
up the heap. We can also not safely remove a record from the fetch queue as 
the queue feeder may not have finished and duplicates may still enter a queue.

Any thoughts?

Thanks,
Markus

Re: http.redirect.max and duplicate fetch/parse

Posted by Markus Jelsma <ma...@openindex.io>.

> Actually some kv storages use bloom filter for similar purpose.
> 
> What is your queue size? And what is redirect rate?

There are 2500-5000 domain queues per fetcher and 20.000-40.000 fetch items. 
We usually have around 8 URL's per domain. The redirect rate is quite low, it 
doesn't happen that often so it's not a very big deal, just an inconvenience 
and a thing we might want to optimize.

> 
> If most redirects are not crossdomain and average number of urls per
> domain is not very big some fixed size chache in FetchItemQueue may
> help. But this leads to lots of changes in fetcher.

I haven't seen crossdomain redirects yet but it's possible. Just like false 
positives this is something we could live with.

Thanks for sharing your thoughts.

> 
> On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote:
> > That sounds creepy indeed. It would still need a similar amount of RAM
> > plus network overhead. Would a bloom filter be useful at all? It takes a
> > lot less space and i can live with a non-deterministic approach.
> > 
> > On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
> >> Hi
> >> 
> >> I think some external key-value storage may replace map. They are fast
> >> enough and overhead will be unsignificant (for many threads)
> >> But this is very creepy solution.
> >> 
> >> Sergey Volkov.
> >> 
> >> On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
> >>> Anyone?
> >>> 
> >>>> Hi,
> >>>> 
> >>>> With a>   0 value for http.redirect.max there's a possibility for
> >>>> fetching and parsing duplicates, this is especially true for fetch
> >>>> lists with many domains, even with just a few (+10) records per
> >>>> domain/host queue.
> >>>> 
> >>>> Assuming there's only one thread per queue, how can we use
> >>>> http.redirect.max and prevent fetch and parse of duplicates?
> >>>> 
> >>>> I'm not a big fan of keeping a map of fetched records in memory as
> >>>> it'll blow up the heap. We can also not safely remove a record from
> >>>> the fetch queue as the queue feeder may not have finished and
> >>>> duplicates may still enter a queue.
> >>>> 
> >>>> Any thoughts?
> >>>> 
> >>>> Thanks,
> >>>> Markus

Re: http.redirect.max and duplicate fetch/parse

Posted by Sergey A Volkov <se...@gmail.com>.

Actually some kv storages use bloom filter for similar purpose.

What is your queue size? And what is redirect rate?

If most redirects are not crossdomain and average number of urls per 
domain is not very big some fixed size chache in FetchItemQueue may 
help. But this leads to lots of changes in fetcher.

On Tue 18 Oct 2011 05:01:06 PM MSK, Markus Jelsma wrote:
> That sounds creepy indeed. It would still need a similar amount of RAM plus
> network overhead. Would a bloom filter be useful at all? It takes a lot less
> space and i can live with a non-deterministic approach.
>
> On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
>> Hi
>>
>> I think some external key-value storage may replace map. They are fast
>> enough and overhead will be unsignificant (for many threads)
>> But this is very creepy solution.
>>
>> Sergey Volkov.
>>
>> On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
>>> Anyone?
>>>
>>>> Hi,
>>>>
>>>> With a>   0 value for http.redirect.max there's a possibility for
>>>> fetching and parsing duplicates, this is especially true for fetch
>>>> lists with many domains, even with just a few (+10) records per
>>>> domain/host queue.
>>>>
>>>> Assuming there's only one thread per queue, how can we use
>>>> http.redirect.max and prevent fetch and parse of duplicates?
>>>>
>>>> I'm not a big fan of keeping a map of fetched records in memory as it'll
>>>> blow up the heap. We can also not safely remove a record from the fetch
>>>> queue as the queue feeder may not have finished and duplicates may still
>>>> enter a queue.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> Markus
>

Re: http.redirect.max and duplicate fetch/parse

Posted by Markus Jelsma <ma...@openindex.io>.

That sounds creepy indeed. It would still need a similar amount of RAM plus 
network overhead. Would a bloom filter be useful at all? It takes a lot less 
space and i can live with a non-deterministic approach.

On Tuesday 18 October 2011 01:45:20 Sergey A Volkov wrote:
> Hi
> 
> I think some external key-value storage may replace map. They are fast
> enough and overhead will be unsignificant (for many threads)
> But this is very creepy solution.
> 
> Sergey Volkov.
> 
> On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
> > Anyone?
> > 
> >> Hi,
> >> 
> >> With a>  0 value for http.redirect.max there's a possibility for
> >> fetching and parsing duplicates, this is especially true for fetch
> >> lists with many domains, even with just a few (+10) records per
> >> domain/host queue.
> >> 
> >> Assuming there's only one thread per queue, how can we use
> >> http.redirect.max and prevent fetch and parse of duplicates?
> >> 
> >> I'm not a big fan of keeping a map of fetched records in memory as it'll
> >> blow up the heap. We can also not safely remove a record from the fetch
> >> queue as the queue feeder may not have finished and duplicates may still
> >> enter a queue.
> >> 
> >> Any thoughts?
> >> 
> >> Thanks,
> >> Markus

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: http.redirect.max and duplicate fetch/parse

Posted by Sergey A Volkov <se...@gmail.com>.

Hi

I think some external key-value storage may replace map. They are fast 
enough and overhead will be unsignificant (for many threads)
But this is very creepy solution.

Sergey Volkov.
On Tue 18 Oct 2011 03:15:33 AM MSK, Markus Jelsma wrote:
> Anyone?
>
>> Hi,
>>
>> With a>  0 value for http.redirect.max there's a possibility for fetching
>> and parsing duplicates, this is especially true for fetch lists with many
>> domains, even with just a few (+10) records per domain/host queue.
>>
>> Assuming there's only one thread per queue, how can we use
>> http.redirect.max and prevent fetch and parse of duplicates?
>>
>> I'm not a big fan of keeping a map of fetched records in memory as it'll
>> blow up the heap. We can also not safely remove a record from the fetch
>> queue as the queue feeder may not have finished and duplicates may still
>> enter a queue.
>>
>> Any thoughts?
>>
>> Thanks,
>> Markus

Re: http.redirect.max and duplicate fetch/parse

Posted by Markus Jelsma <ma...@openindex.io>.

Anyone?

> Hi,
> 
> With a > 0 value for http.redirect.max there's a possibility for fetching
> and parsing duplicates, this is especially true for fetch lists with many
> domains, even with just a few (+10) records per domain/host queue.
> 
> Assuming there's only one thread per queue, how can we use
> http.redirect.max and prevent fetch and parse of duplicates?
> 
> I'm not a big fan of keeping a map of fetched records in memory as it'll
> blow up the heap. We can also not safely remove a record from the fetch
> queue as the queue feeder may not have finished and duplicates may still
> enter a queue.
> 
> Any thoughts?
> 
> Thanks,
> Markus