You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/03/15 20:50:31 UTC

Searching only a whitelist (country specific SE)

Hi All,

We're merrily proceeding down our route of a country specific search 
engine, nutch seems to be working well.  However we're finding some 
sites creeping in that aren't from our country.  Specifically, we 
automatically allow in sites that are hosted within the country.  We're 
finding more sites than we'd like hosted here that are actually 
owned/operated in another country and thus not relevant.  I'd like to 
get rid of these if I can.

Is there a viable way of using nutch 0.7 using only a whitelist of sites 
- and a very large whitelist at that (say 500K to a million+ sites, all 
in one whitelist)?  If not, is it possible in nutch 0.8?  That way I can 
just find other ways of adding known-to-be-good sites into the white 
list over time.

(fwiw, we automatically allow our specific country TLD, then for 
.com/.net/.org we only allow if the site is physically hosted here by 
checking an IP list.  If other country search engine folks have comments 
on a better way to do this I'd welcome the input.).

Re: Searching only a whitelist (country specific SE)

Posted by Matt Kangas <ka...@gmail.com>.

Aha, I see the misunderstanding. PrefixURLFilter uses a class called  
TrieStringMatcher. A "trie" is a data structure that stores an  
associative array with string-valued keys. Keys are stored in a  
decomposed form: an ordered tree of the key string's characters.

http://en.wikipedia.org/wiki/Trie

The actual class used is:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/util/ 
TrieStringMatcher.html

For a URL of length K and N patterns to test against, only K  
character tests are performed. Performance should be similar to a  
HashMap: not exactly O(1), but close. With 500k domains to match  
against, PrefixURLFilter should still be reasonably good.

This is definitely not the case for RegexURLFilter, 'tho. ;)


On Mar 20, 2006, at 3:41 AM, TDLN wrote:

> Yes, a HashMap underlies the cache.
>
> Isn't the main difference that in case of the PrefixURLFilter, the  
> url that
> has to be tested, is matched against every single url pattern in the
> regex-urlfilter file (500k patterns in this case)? Only in case one  
> uses the
> cache route, it's a lookup of a single entry. If the entry is  
> found, the url
> passes. I can imagine this is actually much faster.
>
> I my particular scenario, I am also storing a "category" with every
> permitted domain in the database. The category is stored as the  
> value in the
> hashmap and used to add a category field to the index.
>
> Rgrds, Thomas
>
>
>
> On 3/19/06, Matt Kangas <ka...@gmail.com> wrote:
>>
>> I'm still curious how this compares to PrefixURLFilter. If you go the
>> "load all domains" route, I don't see why you wouldn't just dump the
>> DB data into a flat text file and feed this to PrefixURLFilter.
>>
>> (Also, the trie underlying PrefixURLFilter should consume less RAM
>> than the hashmap presumably underlying your cache, while still
>> delivering similar lookup speed. But perhaps I'm wrong?)
>>
>> --Matt
>>
>> On Mar 19, 2006, at 1:09 PM, TDLN wrote:
>>
>>> I agree with you. That was a bold statement, not necessarily backed
>>> up by
>>> any hard evidence that I can provide you with.
>>>
>>> The DBUrlFilter can be adapted though so that it loads all domains
>>> in the
>>> database into the cache only once. In case of a cache miss, the
>>> plugin does
>>> not go to the database anymore, but rejects the url. The only thing
>>> to think
>>> about is to make the cache big enough to hold all domains in the
>>> database.
>>>
>>> In this case the DBUrlFilter performs better, but I have no
>>> comparison with
>>> the PrefixURLFilter.
>>>
>>> Rgrds, Thomas
>>>
>>>
>>>
>>>
>>> On 3/19/06, Matt Kangas <ka...@gmail.com> wrote:
>>>>
>>>> I'm curious how this "performs better than PrefixURLFilter".
>>>> Management, yes, but performance? According to the description on
>>>> NUTCH-100, you go to the database for every cache miss. This  
>>>> implies
>>>> that filter hits are cheap, whereas misses are expensive. (tcp/ip
>>>> roundtrip, etc)
>>>>
>>>> Can you please explain?
>>>>
>>>> --Matt
>>>>
>>>> On Mar 19, 2006, at 3:13 AM, TDLN wrote:
>>>>
>>>>> There's the DBUrlFilter as well, that stores the Whitelist in the
>>>>> database:
>>>>> http://issues.apache.org/jira/browse/NUTCH-100
>>>>>
>>>>> It performs better than the PrefixURLFilter and also makes the
>>>>> management of
>>>>> the list more easy.
>>>>>
>>>>> Rgrds, Thomas
>>>>>
>>>>> On 3/15/06, Matt Kangas <ka...@gmail.com> wrote:
>>>>>>
>>>>>> For a large whitelist filtered by hostname, you should use
>>>>>> PrefixURLFilter. (built-in to 0.7)
>>>>>>
>>>>>> If you wanted to apply regex rules to the paths of these  
>>>>>> sites, you
>>>>>> could use my WhitelistURLFilter (http://issues.apache.org/jira/
>>>>>> browse/
>>>>>> NUTCH-87). But it sounds like you don't quite need that.
>>>>>>
>>>>>> Cheers,
>>>>>> --Matt
>>>>>>
>>>>>> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're merrily proceeding down our route of a country specific
>>>>>>> search engine, nutch seems to be working well.  However we're
>>>>>>> finding some sites creeping in that aren't from our country.
>>>>>>> Specifically, we automatically allow in sites that are hosted
>>>>>>> within the country.  We're finding more sites than we'd like
>>>>>>> hosted
>>>>>>> here that are actually owned/operated in another country and  
>>>>>>> thus
>>>>>>> not relevant.  I'd like to get rid of these if I can.
>>>>>>>
>>>>>>> Is there a viable way of using nutch 0.7 using only a  
>>>>>>> whitelist of
>>>>>>> sites - and a very large whitelist at that (say 500K to a  
>>>>>>> million+
>>>>>>> sites, all in one whitelist)?  If not, is it possible in nutch
>>>>>>> 0.8?  That way I can just find other ways of adding known-to-be-
>>>>>>> good sites into the white list over time.
>>>>>>>
>>>>>>> (fwiw, we automatically allow our specific country TLD, then
>>>>>>> for .com/.net/.org we only allow if the site is physically  
>>>>>>> hosted
>>>>>>> here by checking an IP list.  If other country search engine  
>>>>>>> folks
>>>>>>> have comments on a better way to do this I'd welcome the  
>>>>>>> input.).
>>>>
>>>>
>>
>> --
>> Matt Kangas / kangas@gmail.com
>>
>>
>>

--
Matt Kangas / kangas@gmail.com

Re: Searching only a whitelist (country specific SE)

Posted by TDLN <di...@gmail.com>.

Yes, a HashMap underlies the cache.

Isn't the main difference that in case of the PrefixURLFilter, the url that
has to be tested, is matched against every single url pattern in the
regex-urlfilter file (500k patterns in this case)? Only in case one uses the
cache route, it's a lookup of a single entry. If the entry is found, the url
passes. I can imagine this is actually much faster.

I my particular scenario, I am also storing a "category" with every
permitted domain in the database. The category is stored as the value in the
hashmap and used to add a category field to the index.

Rgrds, Thomas



On 3/19/06, Matt Kangas <ka...@gmail.com> wrote:
>
> I'm still curious how this compares to PrefixURLFilter. If you go the
> "load all domains" route, I don't see why you wouldn't just dump the
> DB data into a flat text file and feed this to PrefixURLFilter.
>
> (Also, the trie underlying PrefixURLFilter should consume less RAM
> than the hashmap presumably underlying your cache, while still
> delivering similar lookup speed. But perhaps I'm wrong?)
>
> --Matt
>
> On Mar 19, 2006, at 1:09 PM, TDLN wrote:
>
> > I agree with you. That was a bold statement, not necessarily backed
> > up by
> > any hard evidence that I can provide you with.
> >
> > The DBUrlFilter can be adapted though so that it loads all domains
> > in the
> > database into the cache only once. In case of a cache miss, the
> > plugin does
> > not go to the database anymore, but rejects the url. The only thing
> > to think
> > about is to make the cache big enough to hold all domains in the
> > database.
> >
> > In this case the DBUrlFilter performs better, but I have no
> > comparison with
> > the PrefixURLFilter.
> >
> > Rgrds, Thomas
> >
> >
> >
> >
> > On 3/19/06, Matt Kangas <ka...@gmail.com> wrote:
> >>
> >> I'm curious how this "performs better than PrefixURLFilter".
> >> Management, yes, but performance? According to the description on
> >> NUTCH-100, you go to the database for every cache miss. This implies
> >> that filter hits are cheap, whereas misses are expensive. (tcp/ip
> >> roundtrip, etc)
> >>
> >> Can you please explain?
> >>
> >> --Matt
> >>
> >> On Mar 19, 2006, at 3:13 AM, TDLN wrote:
> >>
> >>> There's the DBUrlFilter as well, that stores the Whitelist in the
> >>> database:
> >>> http://issues.apache.org/jira/browse/NUTCH-100
> >>>
> >>> It performs better than the PrefixURLFilter and also makes the
> >>> management of
> >>> the list more easy.
> >>>
> >>> Rgrds, Thomas
> >>>
> >>> On 3/15/06, Matt Kangas <ka...@gmail.com> wrote:
> >>>>
> >>>> For a large whitelist filtered by hostname, you should use
> >>>> PrefixURLFilter. (built-in to 0.7)
> >>>>
> >>>> If you wanted to apply regex rules to the paths of these sites, you
> >>>> could use my WhitelistURLFilter (http://issues.apache.org/jira/
> >>>> browse/
> >>>> NUTCH-87). But it sounds like you don't quite need that.
> >>>>
> >>>> Cheers,
> >>>> --Matt
> >>>>
> >>>> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> We're merrily proceeding down our route of a country specific
> >>>>> search engine, nutch seems to be working well.  However we're
> >>>>> finding some sites creeping in that aren't from our country.
> >>>>> Specifically, we automatically allow in sites that are hosted
> >>>>> within the country.  We're finding more sites than we'd like
> >>>>> hosted
> >>>>> here that are actually owned/operated in another country and thus
> >>>>> not relevant.  I'd like to get rid of these if I can.
> >>>>>
> >>>>> Is there a viable way of using nutch 0.7 using only a whitelist of
> >>>>> sites - and a very large whitelist at that (say 500K to a million+
> >>>>> sites, all in one whitelist)?  If not, is it possible in nutch
> >>>>> 0.8?  That way I can just find other ways of adding known-to-be-
> >>>>> good sites into the white list over time.
> >>>>>
> >>>>> (fwiw, we automatically allow our specific country TLD, then
> >>>>> for .com/.net/.org we only allow if the site is physically hosted
> >>>>> here by checking an IP list.  If other country search engine folks
> >>>>> have comments on a better way to do this I'd welcome the input.).
> >>
> >>
>
> --
> Matt Kangas / kangas@gmail.com
>
>
>

Re: Searching only a whitelist (country specific SE)

Posted by Matt Kangas <ka...@gmail.com>.

I'm still curious how this compares to PrefixURLFilter. If you go the  
"load all domains" route, I don't see why you wouldn't just dump the  
DB data into a flat text file and feed this to PrefixURLFilter.

(Also, the trie underlying PrefixURLFilter should consume less RAM  
than the hashmap presumably underlying your cache, while still  
delivering similar lookup speed. But perhaps I'm wrong?)

--Matt

On Mar 19, 2006, at 1:09 PM, TDLN wrote:

> I agree with you. That was a bold statement, not necessarily backed  
> up by
> any hard evidence that I can provide you with.
>
> The DBUrlFilter can be adapted though so that it loads all domains  
> in the
> database into the cache only once. In case of a cache miss, the  
> plugin does
> not go to the database anymore, but rejects the url. The only thing  
> to think
> about is to make the cache big enough to hold all domains in the  
> database.
>
> In this case the DBUrlFilter performs better, but I have no  
> comparison with
> the PrefixURLFilter.
>
> Rgrds, Thomas
>
>
>
>
> On 3/19/06, Matt Kangas <ka...@gmail.com> wrote:
>>
>> I'm curious how this "performs better than PrefixURLFilter".
>> Management, yes, but performance? According to the description on
>> NUTCH-100, you go to the database for every cache miss. This implies
>> that filter hits are cheap, whereas misses are expensive. (tcp/ip
>> roundtrip, etc)
>>
>> Can you please explain?
>>
>> --Matt
>>
>> On Mar 19, 2006, at 3:13 AM, TDLN wrote:
>>
>>> There's the DBUrlFilter as well, that stores the Whitelist in the
>>> database:
>>> http://issues.apache.org/jira/browse/NUTCH-100
>>>
>>> It performs better than the PrefixURLFilter and also makes the
>>> management of
>>> the list more easy.
>>>
>>> Rgrds, Thomas
>>>
>>> On 3/15/06, Matt Kangas <ka...@gmail.com> wrote:
>>>>
>>>> For a large whitelist filtered by hostname, you should use
>>>> PrefixURLFilter. (built-in to 0.7)
>>>>
>>>> If you wanted to apply regex rules to the paths of these sites, you
>>>> could use my WhitelistURLFilter (http://issues.apache.org/jira/
>>>> browse/
>>>> NUTCH-87). But it sounds like you don't quite need that.
>>>>
>>>> Cheers,
>>>> --Matt
>>>>
>>>> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We're merrily proceeding down our route of a country specific
>>>>> search engine, nutch seems to be working well.  However we're
>>>>> finding some sites creeping in that aren't from our country.
>>>>> Specifically, we automatically allow in sites that are hosted
>>>>> within the country.  We're finding more sites than we'd like  
>>>>> hosted
>>>>> here that are actually owned/operated in another country and thus
>>>>> not relevant.  I'd like to get rid of these if I can.
>>>>>
>>>>> Is there a viable way of using nutch 0.7 using only a whitelist of
>>>>> sites - and a very large whitelist at that (say 500K to a million+
>>>>> sites, all in one whitelist)?  If not, is it possible in nutch
>>>>> 0.8?  That way I can just find other ways of adding known-to-be-
>>>>> good sites into the white list over time.
>>>>>
>>>>> (fwiw, we automatically allow our specific country TLD, then
>>>>> for .com/.net/.org we only allow if the site is physically hosted
>>>>> here by checking an IP list.  If other country search engine folks
>>>>> have comments on a better way to do this I'd welcome the input.).
>>
>>

--
Matt Kangas / kangas@gmail.com

Re: Searching only a whitelist (country specific SE)

Posted by TDLN <di...@gmail.com>.

I agree with you. That was a bold statement, not necessarily backed up by
any hard evidence that I can provide you with.

The DBUrlFilter can be adapted though so that it loads all domains in the
database into the cache only once. In case of a cache miss, the plugin does
not go to the database anymore, but rejects the url. The only thing to think
about is to make the cache big enough to hold all domains in the database.

In this case the DBUrlFilter performs better, but I have no comparison with
the PrefixURLFilter.

Rgrds, Thomas














On 3/19/06, Matt Kangas <ka...@gmail.com> wrote:
>
> I'm curious how this "performs better than PrefixURLFilter".
> Management, yes, but performance? According to the description on
> NUTCH-100, you go to the database for every cache miss. This implies
> that filter hits are cheap, whereas misses are expensive. (tcp/ip
> roundtrip, etc)
>
> Can you please explain?
>
> --Matt
>
> On Mar 19, 2006, at 3:13 AM, TDLN wrote:
>
> > There's the DBUrlFilter as well, that stores the Whitelist in the
> > database:
> > http://issues.apache.org/jira/browse/NUTCH-100
> >
> > It performs better than the PrefixURLFilter and also makes the
> > management of
> > the list more easy.
> >
> > Rgrds, Thomas
> >
> > On 3/15/06, Matt Kangas <ka...@gmail.com> wrote:
> >>
> >> For a large whitelist filtered by hostname, you should use
> >> PrefixURLFilter. (built-in to 0.7)
> >>
> >> If you wanted to apply regex rules to the paths of these sites, you
> >> could use my WhitelistURLFilter (http://issues.apache.org/jira/
> >> browse/
> >> NUTCH-87). But it sounds like you don't quite need that.
> >>
> >> Cheers,
> >> --Matt
> >>
> >> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
> >>
> >>> Hi All,
> >>>
> >>> We're merrily proceeding down our route of a country specific
> >>> search engine, nutch seems to be working well.  However we're
> >>> finding some sites creeping in that aren't from our country.
> >>> Specifically, we automatically allow in sites that are hosted
> >>> within the country.  We're finding more sites than we'd like hosted
> >>> here that are actually owned/operated in another country and thus
> >>> not relevant.  I'd like to get rid of these if I can.
> >>>
> >>> Is there a viable way of using nutch 0.7 using only a whitelist of
> >>> sites - and a very large whitelist at that (say 500K to a million+
> >>> sites, all in one whitelist)?  If not, is it possible in nutch
> >>> 0.8?  That way I can just find other ways of adding known-to-be-
> >>> good sites into the white list over time.
> >>>
> >>> (fwiw, we automatically allow our specific country TLD, then
> >>> for .com/.net/.org we only allow if the site is physically hosted
> >>> here by checking an IP list.  If other country search engine folks
> >>> have comments on a better way to do this I'd welcome the input.).
> >>
> >> --
> >> Matt Kangas / kangas@gmail.com
> >>
> >>
> >>
>
> --
> Matt Kangas / kangas@gmail.com
>
>
>

Re: Searching only a whitelist (country specific SE)

Posted by Matt Kangas <ka...@gmail.com>.

I'm curious how this "performs better than PrefixURLFilter".  
Management, yes, but performance? According to the description on  
NUTCH-100, you go to the database for every cache miss. This implies  
that filter hits are cheap, whereas misses are expensive. (tcp/ip  
roundtrip, etc)

Can you please explain?

--Matt

On Mar 19, 2006, at 3:13 AM, TDLN wrote:

> There's the DBUrlFilter as well, that stores the Whitelist in the  
> database:
> http://issues.apache.org/jira/browse/NUTCH-100
>
> It performs better than the PrefixURLFilter and also makes the  
> management of
> the list more easy.
>
> Rgrds, Thomas
>
> On 3/15/06, Matt Kangas <ka...@gmail.com> wrote:
>>
>> For a large whitelist filtered by hostname, you should use
>> PrefixURLFilter. (built-in to 0.7)
>>
>> If you wanted to apply regex rules to the paths of these sites, you
>> could use my WhitelistURLFilter (http://issues.apache.org/jira/ 
>> browse/
>> NUTCH-87). But it sounds like you don't quite need that.
>>
>> Cheers,
>> --Matt
>>
>> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
>>
>>> Hi All,
>>>
>>> We're merrily proceeding down our route of a country specific
>>> search engine, nutch seems to be working well.  However we're
>>> finding some sites creeping in that aren't from our country.
>>> Specifically, we automatically allow in sites that are hosted
>>> within the country.  We're finding more sites than we'd like hosted
>>> here that are actually owned/operated in another country and thus
>>> not relevant.  I'd like to get rid of these if I can.
>>>
>>> Is there a viable way of using nutch 0.7 using only a whitelist of
>>> sites - and a very large whitelist at that (say 500K to a million+
>>> sites, all in one whitelist)?  If not, is it possible in nutch
>>> 0.8?  That way I can just find other ways of adding known-to-be-
>>> good sites into the white list over time.
>>>
>>> (fwiw, we automatically allow our specific country TLD, then
>>> for .com/.net/.org we only allow if the site is physically hosted
>>> here by checking an IP list.  If other country search engine folks
>>> have comments on a better way to do this I'd welcome the input.).
>>
>> --
>> Matt Kangas / kangas@gmail.com
>>
>>
>>

--
Matt Kangas / kangas@gmail.com

Re: Searching only a whitelist (country specific SE)

Posted by TDLN <di...@gmail.com>.

There's the DBUrlFilter as well, that stores the Whitelist in the database:
http://issues.apache.org/jira/browse/NUTCH-100

It performs better than the PrefixURLFilter and also makes the management of
the list more easy.

Rgrds, Thomas

On 3/15/06, Matt Kangas <ka...@gmail.com> wrote:
>
> For a large whitelist filtered by hostname, you should use
> PrefixURLFilter. (built-in to 0.7)
>
> If you wanted to apply regex rules to the paths of these sites, you
> could use my WhitelistURLFilter (http://issues.apache.org/jira/browse/
> NUTCH-87). But it sounds like you don't quite need that.
>
> Cheers,
> --Matt
>
> On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:
>
> > Hi All,
> >
> > We're merrily proceeding down our route of a country specific
> > search engine, nutch seems to be working well.  However we're
> > finding some sites creeping in that aren't from our country.
> > Specifically, we automatically allow in sites that are hosted
> > within the country.  We're finding more sites than we'd like hosted
> > here that are actually owned/operated in another country and thus
> > not relevant.  I'd like to get rid of these if I can.
> >
> > Is there a viable way of using nutch 0.7 using only a whitelist of
> > sites - and a very large whitelist at that (say 500K to a million+
> > sites, all in one whitelist)?  If not, is it possible in nutch
> > 0.8?  That way I can just find other ways of adding known-to-be-
> > good sites into the white list over time.
> >
> > (fwiw, we automatically allow our specific country TLD, then
> > for .com/.net/.org we only allow if the site is physically hosted
> > here by checking an IP list.  If other country search engine folks
> > have comments on a better way to do this I'd welcome the input.).
>
> --
> Matt Kangas / kangas@gmail.com
>
>
>

Re: Searching only a whitelist (country specific SE)

Posted by Matt Kangas <ka...@gmail.com>.

For a large whitelist filtered by hostname, you should use  
PrefixURLFilter. (built-in to 0.7)

If you wanted to apply regex rules to the paths of these sites, you  
could use my WhitelistURLFilter (http://issues.apache.org/jira/browse/ 
NUTCH-87). But it sounds like you don't quite need that.

Cheers,
--Matt

On Mar 15, 2006, at 2:50 PM, Insurance Squared Inc. wrote:

> Hi All,
>
> We're merrily proceeding down our route of a country specific  
> search engine, nutch seems to be working well.  However we're  
> finding some sites creeping in that aren't from our country.   
> Specifically, we automatically allow in sites that are hosted  
> within the country.  We're finding more sites than we'd like hosted  
> here that are actually owned/operated in another country and thus  
> not relevant.  I'd like to get rid of these if I can.
>
> Is there a viable way of using nutch 0.7 using only a whitelist of  
> sites - and a very large whitelist at that (say 500K to a million+  
> sites, all in one whitelist)?  If not, is it possible in nutch  
> 0.8?  That way I can just find other ways of adding known-to-be- 
> good sites into the white list over time.
>
> (fwiw, we automatically allow our specific country TLD, then  
> for .com/.net/.org we only allow if the site is physically hosted  
> here by checking an IP list.  If other country search engine folks  
> have comments on a better way to do this I'd welcome the input.).

--
Matt Kangas / kangas@gmail.com