You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/11/28 19:14:11 UTC

Very large filter lists

Hi,

Anyone used URL filters containing up to a million rows? In our case this 
would be only 25MB so heap space is no problem (unless the data is not shared 
between threads). Will it perform?

Thanks,

Re: Very large filter lists

Posted by Markus Jelsma <ma...@openindex.io>.
This was actually not about a regex filter, at least not from my point of 
view, i wasn't clear it seems.

Anyway, it works well. Instead of a filter we built a normalizer that takes a  
large file and uses a HashMap for a key look-up.

Cheers


On Wednesday 30 November 2011 17:19:44 Lewis John Mcgibbney wrote:
> Yes I was interested in seeing if this issue has any traction and where (if
> any) interest there is in kick starting it.
> 
> From Kirby's original comments on the issue, on the face of it it looks
> like it would be really useful to you guys doing LARGE crawls.
> 
> On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling 
<ki...@gmail.com>wrote:
> > Julien,
> > 
> > 
> > 
> > On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
> > 
> > <li...@gmail.com> wrote:
> > > That would be a good thing to benchmark. IIRC there is a JIRA about
> > > improvements to the Finite State library we use, would be good to see
> > > the impact of the patch. The regex-urlfilter will probably take more
> > > memory
> > 
> > and
> > 
> > > be much slower.
> > 
> > https://issues.apache.org/jira/browse/NUTCH-1068
> > 
> > Pretty sure that is the JIRA item you are discussing.  Still not sure
> > what to do with the Automaton library, I don't think that the
> > maintainer has integrated any parts of the performance improvements
> > from Lucene.
> > 
> > Kirby
> > 
> > > Julien
> > > 
> > > On 28 November 2011 18:14, Markus Jelsma <ma...@openindex.io>
> > 
> > wrote:
> > >> Hi,
> > >> 
> > >> Anyone used URL filters containing up to a million rows? In our case
> > 
> > this
> > 
> > >> would be only 25MB so heap space is no problem (unless the data is not
> > >> shared
> > >> between threads). Will it perform?
> > >> 
> > >> Thanks,
> > > 
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > > 
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com

-- 
Markus Jelsma - CTO - Openindex

Re: Very large filter lists

Posted by Markus Jelsma <ma...@openindex.io>.

On Monday 05 December 2011 18:37:25 Markus Jelsma wrote:
> We use bloom filters as well but instead of having a domain filter, for
> which a bloom filter would be a good choice, we have a sub domain
> normalizer. We need to look-up a key and get something back.
> 
> Now, i've checked the code again and both normalizers, filters are
> instantiated in each thread. This causes significant additional heap space.
> 
> Are there any objections for sharing them between threads? I assume things
> will get a lot slower. Or could i just share the HashMap between instances?
> Suggestions?

Well, i remembered some pieces of concurrency in Java. The map is now static 
final and the method building the structure synchronized and checking if it 
has to rebuild the map. Seems to run fine. It is a plain HashMap because it is 
read-only so there is no need to ConcurrentHashMap.

> 
> This is about a custom fetcher that does parsing and outlink processing as
> well.
> 
> On Wednesday 30 November 2011 22:41:58 Andrzej Bialecki wrote:
> > There's an implementation of Bloom filter in Hadoop. Since the number of
> > items is known in advance it's possible to pick the right size of the
> > filter to keep the error rate at acceptable level.
> > 
> > One trick that you may consider when using Bloom filters is to have an
> > additional list of exceptions, i.e. common items that give false
> > positives. If you properly balance the size of the filter and the size
> > of the exception list you can still keep the total size of the structure
> > down while improving the error rate.

-- 
Markus Jelsma - CTO - Openindex

Re: Very large filter lists

Posted by Markus Jelsma <ma...@openindex.io>.
We use bloom filters as well but instead of having a domain filter, for which 
a bloom filter would be a good choice, we have a sub domain normalizer. We 
need to look-up a key and get something back.

Now, i've checked the code again and both normalizers, filters are 
instantiated in each thread. This causes significant additional heap space.

Are there any objections for sharing them between threads? I assume things 
will get a lot slower. Or could i just share the HashMap between instances? 
Suggestions?

This is about a custom fetcher that does parsing and outlink processing as 
well.

On Wednesday 30 November 2011 22:41:58 Andrzej Bialecki wrote:
> There's an implementation of Bloom filter in Hadoop. Since the number of 
> items is known in advance it's possible to pick the right size of the 
> filter to keep the error rate at acceptable level.
> 
> One trick that you may consider when using Bloom filters is to have an 
> additional list of exceptions, i.e. common items that give false 
> positives. If you properly balance the size of the filter and the size 
> of the exception list you can still keep the total size of the structure 
> down while improving the error rate.

-- 
Markus Jelsma - CTO - Openindex

Re: Very large filter lists

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 30/11/2011 22:00, Ken Krugler wrote:
> Normally when I see a 1M entry URL filter, it's doing domain-level filtering.
>
> If that's the case, I'd use a BloomFilter, which has worked well for us in the past during large-scale crawls.

There's an implementation of Bloom filter in Hadoop. Since the number of 
items is known in advance it's possible to pick the right size of the 
filter to keep the error rate at acceptable level.

One trick that you may consider when using Bloom filters is to have an 
additional list of exceptions, i.e. common items that give false 
positives. If you properly balance the size of the filter and the size 
of the exception list you can still keep the total size of the structure 
down while improving the error rate.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Very large filter lists

Posted by Ken Krugler <kk...@transpac.com>.
Normally when I see a 1M entry URL filter, it's doing domain-level filtering.

If that's the case, I'd use a BloomFilter, which has worked well for us in the past during large-scale crawls.

-- Ken

On Nov 30, 2011, at 8:19am, Lewis John Mcgibbney wrote:

> Yes I was interested in seeing if this issue has any traction and where (if
> any) interest there is in kick starting it.
> 
> From Kirby's original comments on the issue, on the face of it it looks
> like it would be really useful to you guys doing LARGE crawls.
> 
> On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling <ki...@gmail.com>wrote:
> 
>> Julien,
>> 
>> 
>> 
>> On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
>> <li...@gmail.com> wrote:
>>> That would be a good thing to benchmark. IIRC there is a JIRA about
>>> improvements to the Finite State library we use, would be good to see the
>>> impact of the patch. The regex-urlfilter will probably take more memory
>> and
>>> be much slower.
>>> 
>> 
>> https://issues.apache.org/jira/browse/NUTCH-1068
>> 
>> Pretty sure that is the JIRA item you are discussing.  Still not sure
>> what to do with the Automaton library, I don't think that the
>> maintainer has integrated any parts of the performance improvements
>> from Lucene.
>> 
>> Kirby
>> 
>> 
>>> Julien
>>> 
>>> On 28 November 2011 18:14, Markus Jelsma <ma...@openindex.io>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Anyone used URL filters containing up to a million rows? In our case
>> this
>>>> would be only 25MB so heap space is no problem (unless the data is not
>>>> shared
>>>> between threads). Will it perform?
>>>> 
>>>> Thanks,
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> 
>> 
> 
> 
> 
> -- 
> *Lewis*

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Very large filter lists

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Yes I was interested in seeing if this issue has any traction and where (if
any) interest there is in kick starting it.

>From Kirby's original comments on the issue, on the face of it it looks
like it would be really useful to you guys doing LARGE crawls.

On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling <ki...@gmail.com>wrote:

> Julien,
>
>
>
> On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
> <li...@gmail.com> wrote:
> > That would be a good thing to benchmark. IIRC there is a JIRA about
> > improvements to the Finite State library we use, would be good to see the
> > impact of the patch. The regex-urlfilter will probably take more memory
> and
> > be much slower.
> >
>
> https://issues.apache.org/jira/browse/NUTCH-1068
>
> Pretty sure that is the JIRA item you are discussing.  Still not sure
> what to do with the Automaton library, I don't think that the
> maintainer has integrated any parts of the performance improvements
> from Lucene.
>
> Kirby
>
>
> > Julien
> >
> > On 28 November 2011 18:14, Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> >> Hi,
> >>
> >> Anyone used URL filters containing up to a million rows? In our case
> this
> >> would be only 25MB so heap space is no problem (unless the data is not
> >> shared
> >> between threads). Will it perform?
> >>
> >> Thanks,
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*Lewis*

Re: Very large filter lists

Posted by Kirby Bohling <ki...@gmail.com>.
Julien,



On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
<li...@gmail.com> wrote:
> That would be a good thing to benchmark. IIRC there is a JIRA about
> improvements to the Finite State library we use, would be good to see the
> impact of the patch. The regex-urlfilter will probably take more memory and
> be much slower.
>

https://issues.apache.org/jira/browse/NUTCH-1068

Pretty sure that is the JIRA item you are discussing.  Still not sure
what to do with the Automaton library, I don't think that the
maintainer has integrated any parts of the performance improvements
from Lucene.

Kirby


> Julien
>
> On 28 November 2011 18:14, Markus Jelsma <ma...@openindex.io> wrote:
>
>> Hi,
>>
>> Anyone used URL filters containing up to a million rows? In our case this
>> would be only 25MB so heap space is no problem (unless the data is not
>> shared
>> between threads). Will it perform?
>>
>> Thanks,
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Very large filter lists

Posted by Julien Nioche <li...@gmail.com>.
That would be a good thing to benchmark. IIRC there is a JIRA about
improvements to the Finite State library we use, would be good to see the
impact of the patch. The regex-urlfilter will probably take more memory and
be much slower.

Julien

On 28 November 2011 18:14, Markus Jelsma <ma...@openindex.io> wrote:

> Hi,
>
> Anyone used URL filters containing up to a million rows? In our case this
> would be only 25MB so heap space is no problem (unless the data is not
> shared
> between threads). Will it perform?
>
> Thanks,
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com