You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Kristan Uccello <ku...@gmail.com> on 2005/11/29 22:57:17 UTC

How to hack the config?

Hello all,

I am attempting to modify the RegexUrlFilter and/or the NutchConfig so 
that I may dynamically apply a set of domain names to fetcher.

In the FAQ:


           >>Is it possible to fetch only pages from some specific domains?

 >>Please have a look on PrefixURLFilter. Adding some regular 
expressions to the urlfilter.regex.file might work, but adding a list 
with thousands of regular expressions would slow down your system 
excessively.


I wish to be able to provide a list of urls that I want to have 
fetchedand I want the fetcher to only fetch from those sites (not follow 
any links out of those sites) I would like to be able to keep adding to 
this list without having to modify the nutch-config.xml each time but 
instead just add it to the config (or other object) in memory. All I am 
after is a point in the right direction. If someone could tell me if I 
am looking in the wrong files (or off my rocker!) please let me know 
where I could/should go.

The reason I am asking this is that I am working on a "roll your own 
search". I want to be able to crawl specific sites only, and then, in 
the search results, get search results pertaining only to some subset of 
those crawled sites.

Best regards,

Kristan Uccello

Re: How to hack the config?

Posted by Kristan Uccello <ku...@gmail.com>.

Hi Matt,

I can see you and I are thinking along similar lines as I did think to 
add a custom field during indexing. However I rejected this thought as 
it does not scale very well. For example, say I did add a custom field 
during indexing (call it setID) and for site foo.com I had a set of 
sites to index (a.com, b.com, c.com [setID=1]) this would work great but 
I would also have bar.com which needed to index sites as well (a.com, 
d.com, e.com [setID=2]) now in this case (a.com) would be indexed twice 
(once for each setID) or if remove duplicates was used it may lose its 
origonal identity (setID=1) in place of its new identity (setID=2) this 
would mean that setID=1 would lose its index of site a.com.

This is very close to a many-to-many relationship problem encountered 
regualarly in database design. I'm looking for how to achive this 
relationship inside the rules of nutch.

Cheers,

Kristan Uccello

Matt Kangas wrote:

> (i'm moving this to nutch-user, so we don't piss off the nutch-dev  
> folks.)
>
> a few ideas:
> - if you only want to match one site at a time, you can just add  
> "site:xxx" to the query. the "site" field exists in the index by default
> - if you want assign ids to clusters of sites, you can do the site- 
> >id lookup at index time and add a custom field to the index
>
> if you want to do the latter, you need to write custom indexer +  
> query plugins. it's not hard, but the documentation is somewhat poor.
>
> i'd suggest looking at the source for the plugins "index-more" and  
> "query-more". those two plugins add support for "type:" and "date:"  
> queries.
>
> --matt
>
> On Nov 29, 2005, at 7:16 PM, Kristan Uccello wrote:
>
>> Hi Matt,
>>
>> thank you for the reply I will play with what you sugested as it  
>> sounds pretty close to what I am after. The only caviet is that all  
>> the websites that I will be hosting are running from one source so  I 
>> am after a way that I can dynamically inject a filter based on  what 
>> site is calling the searcher. I plan to have nutch running on  one 
>> web server and access it through its rss REST and my thought  was to 
>> provide an additional REST parameter like ...&callerID=xxx  and have 
>> nutch load the appropriate class that would filter the  results to a 
>> list of domains (stored in a file or database or  something) based on 
>> the callerID value as the key.
>>
>> I'm pretty new to nutch ( only working with it for a few days ) so  
>> any guideance you can give me will be greatly appreciated.
>>
>> Cheers,
>>
>> Kristan Uccello
>>
>> Matt Kangas wrote:
>>
>>> (note: this is probably more relevant to nutch-user. please send   
>>> replies there.)
>>>
>>> This question seems to come up periodically.
>>>
>>> Personally, I accomplish this via a custom URLFilter that uses a   
>>> MapFile of regex pattern-lists, e.g. one set of regexes per  
>>> website.  You can find the code in http://issues.apache.org/jira/ 
>>> browse/NUTCH-87
>>>
>>> All this does is allow you to keep track of a large set of  
>>> regexes,  partitioned by site. It's useful if you want an  
>>> extremely-focused  crawl, possibly burrowing through CGIs.
>>>
>>> If instead you want to crawl entire sites, but ignoring CGIs is  
>>> OK,  then PrefixURLFilter is the easiest answer. Create a newline-  
>>> delimited text file with the site URLs and use this as both seed  
>>> urls  (nutch inject -urlfile) and as the prefixurlfilter config  
>>> file (set  "urlfilter.prefix.file" in nutch-site.xml).
>>>
>>> HTH,
>>> --Matt
>>>
>>> On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am attempting to modify the RegexUrlFilter and/or the  
>>>> NutchConfig  so that I may dynamically apply a set of domain  names 
>>>> to fetcher.
>>>>
>>>> In the FAQ:
>>>>
>>>>
>>>>           >>Is it possible to fetch only pages from some  specific  
>>>> domains?
>>>>
>>>> >>Please have a look on PrefixURLFilter. Adding some regular   
>>>> expressions to the urlfilter.regex.file might work, but adding a   
>>>> list with thousands of regular expressions would slow down your   
>>>> system excessively.
>>>>
>>>>
>>>> I wish to be able to provide a list of urls that I want to have   
>>>> fetchedand I want the fetcher to only fetch from those sites  (not  
>>>> follow any links out of those sites) I would like to be  able to  
>>>> keep adding to this list without having to modify the  nutch- 
>>>> config.xml each time but instead just add it to the config  (or  
>>>> other object) in memory. All I am after is a point in the  right  
>>>> direction. If someone could tell me if I am looking in the  wrong  
>>>> files (or off my rocker!) please let me know where I could/ should go.
>>>>
>>>> The reason I am asking this is that I am working on a "roll your   
>>>> own search". I want to be able to crawl specific sites only, and   
>>>> then, in the search results, get search results pertaining only  
>>>> to  some subset of those crawled sites.
>>>>
>>>> Best regards,
>>>>
>>>> Kristan Uccello
>>>>
>>>
>>> -- 
>>> Matt Kangas / kangas@gmail.com
>>>
>>>
>>>
>>
>
> -- 
> Matt Kangas / kangas@gmail.com
>
>
>

Re: How to hack the config?

Posted by Bud Witney <wi...@osu.edu>.

How would one go about this ----->>>>>   "During indexing, your  
indexing filter could add a field named "sitecluster""
Could I create a field called "region" and apply to sites based on  
there location . If so how. Can this be tweaked in config file

-Bud


On Nov 30, 2005, at 12:29 PM, Andy Lee wrote:

> On Nov 30, 2005, at 1:20 AM, Matt Kangas wrote:
>> - if you only want to match one site at a time, you can just add  
>> "site:xxx" to the query. the "site" field exists in the index by  
>> default
>
> Note that the index-basic indexing filter does not tokenize the  
> "site" field, so if you do "site:salami.com" you will only match  
> URLs whose host component exactly matches the value you give --  
> http://salami.com/etc and ftp://salami.com/etc but NOT http:// 
> www.salami.com.  This may or may not be what you want.
>
>> - if you want assign ids to clusters of sites, you can do the site- 
>> >id lookup at index time and add a custom field to the index
>
> This is one way to address the above issue.  During indexing, your  
> indexing filter could add a field named "sitecluster" (or  
> whatever), and for all the above URLs (and anything else you want  
> to cluster with them) you would set "salami.com" as the value of  
> that field.  Then your search would be "sitecluster:salami.com".
>
> Another approach would be to search not on the "site" field but the  
> "url" field, which *is* tokenized at indexing time.  So  
> "url:salami" would find all the salami URLs above, as well as  
> http://www2.salami.com and http://www.salami.org and http:// 
> salami.lunch.com -- which again may or may not be what you want.
>
> --Andy
>

Re: How to hack the config?

Posted by Andy Lee <ag...@earthlink.net>.

On Nov 30, 2005, at 1:20 AM, Matt Kangas wrote:
> - if you only want to match one site at a time, you can just add  
> "site:xxx" to the query. the "site" field exists in the index by  
> default

Note that the index-basic indexing filter does not tokenize the  
"site" field, so if you do "site:salami.com" you will only match URLs  
whose host component exactly matches the value you give -- http:// 
salami.com/etc and ftp://salami.com/etc but NOT http:// 
www.salami.com.  This may or may not be what you want.

> - if you want assign ids to clusters of sites, you can do the site- 
> >id lookup at index time and add a custom field to the index

This is one way to address the above issue.  During indexing, your  
indexing filter could add a field named "sitecluster" (or whatever),  
and for all the above URLs (and anything else you want to cluster  
with them) you would set "salami.com" as the value of that field.   
Then your search would be "sitecluster:salami.com".

Another approach would be to search not on the "site" field but the  
"url" field, which *is* tokenized at indexing time.  So "url:salami"  
would find all the salami URLs above, as well as http:// 
www2.salami.com and http://www.salami.org and http://salami.lunch.com  
-- which again may or may not be what you want.

--Andy

Re: How to hack the config?

Posted by Matt Kangas <ka...@gmail.com>.

(i'm moving this to nutch-user, so we don't piss off the nutch-dev  
folks.)

a few ideas:
- if you only want to match one site at a time, you can just add  
"site:xxx" to the query. the "site" field exists in the index by default
- if you want assign ids to clusters of sites, you can do the site- 
 >id lookup at index time and add a custom field to the index

if you want to do the latter, you need to write custom indexer +  
query plugins. it's not hard, but the documentation is somewhat poor.

i'd suggest looking at the source for the plugins "index-more" and  
"query-more". those two plugins add support for "type:" and "date:"  
queries.

--matt

On Nov 29, 2005, at 7:16 PM, Kristan Uccello wrote:

> Hi Matt,
>
> thank you for the reply I will play with what you sugested as it  
> sounds pretty close to what I am after. The only caviet is that all  
> the websites that I will be hosting are running from one source so  
> I am after a way that I can dynamically inject a filter based on  
> what site is calling the searcher. I plan to have nutch running on  
> one web server and access it through its rss REST and my thought  
> was to provide an additional REST parameter like ...&callerID=xxx  
> and have nutch load the appropriate class that would filter the  
> results to a list of domains (stored in a file or database or  
> something) based on the callerID value as the key.
>
> I'm pretty new to nutch ( only working with it for a few days ) so  
> any guideance you can give me will be greatly appreciated.
>
> Cheers,
>
> Kristan Uccello
>
> Matt Kangas wrote:
>
>> (note: this is probably more relevant to nutch-user. please send   
>> replies there.)
>>
>> This question seems to come up periodically.
>>
>> Personally, I accomplish this via a custom URLFilter that uses a   
>> MapFile of regex pattern-lists, e.g. one set of regexes per  
>> website.  You can find the code in http://issues.apache.org/jira/ 
>> browse/NUTCH-87
>>
>> All this does is allow you to keep track of a large set of  
>> regexes,  partitioned by site. It's useful if you want an  
>> extremely-focused  crawl, possibly burrowing through CGIs.
>>
>> If instead you want to crawl entire sites, but ignoring CGIs is  
>> OK,  then PrefixURLFilter is the easiest answer. Create a newline-  
>> delimited text file with the site URLs and use this as both seed  
>> urls  (nutch inject -urlfile) and as the prefixurlfilter config  
>> file (set  "urlfilter.prefix.file" in nutch-site.xml).
>>
>> HTH,
>> --Matt
>>
>> On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
>>
>>> Hello all,
>>>
>>> I am attempting to modify the RegexUrlFilter and/or the  
>>> NutchConfig  so that I may dynamically apply a set of domain  
>>> names to fetcher.
>>>
>>> In the FAQ:
>>>
>>>
>>>           >>Is it possible to fetch only pages from some  
>>> specific  domains?
>>>
>>> >>Please have a look on PrefixURLFilter. Adding some regular   
>>> expressions to the urlfilter.regex.file might work, but adding a   
>>> list with thousands of regular expressions would slow down your   
>>> system excessively.
>>>
>>>
>>> I wish to be able to provide a list of urls that I want to have   
>>> fetchedand I want the fetcher to only fetch from those sites  
>>> (not  follow any links out of those sites) I would like to be  
>>> able to  keep adding to this list without having to modify the  
>>> nutch- config.xml each time but instead just add it to the config  
>>> (or  other object) in memory. All I am after is a point in the  
>>> right  direction. If someone could tell me if I am looking in the  
>>> wrong  files (or off my rocker!) please let me know where I could/ 
>>> should go.
>>>
>>> The reason I am asking this is that I am working on a "roll your   
>>> own search". I want to be able to crawl specific sites only, and   
>>> then, in the search results, get search results pertaining only  
>>> to  some subset of those crawled sites.
>>>
>>> Best regards,
>>>
>>> Kristan Uccello
>>>
>>
>> -- 
>> Matt Kangas / kangas@gmail.com
>>
>>
>>
>

--
Matt Kangas / kangas@gmail.com

Re: How to hack the config?

Posted by Kristan Uccello <ku...@gmail.com>.

Hi Matt,

thank you for the reply I will play with what you sugested as it sounds 
pretty close to what I am after. The only caviet is that all the 
websites that I will be hosting are running from one source so I am 
after a way that I can dynamically inject a filter based on what site is 
calling the searcher. I plan to have nutch running on one web server and 
access it through its rss REST and my thought was to provide an 
additional REST parameter like ...&callerID=xxx and have nutch load the 
appropriate class that would filter the results to a list of domains 
(stored in a file or database or something) based on the callerID value 
as the key.

I'm pretty new to nutch ( only working with it for a few days ) so any 
guideance you can give me will be greatly appreciated.

Cheers,

Kristan Uccello

Matt Kangas wrote:

> (note: this is probably more relevant to nutch-user. please send  
> replies there.)
>
> This question seems to come up periodically.
>
> Personally, I accomplish this via a custom URLFilter that uses a  
> MapFile of regex pattern-lists, e.g. one set of regexes per website.  
> You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
>
> All this does is allow you to keep track of a large set of regexes,  
> partitioned by site. It's useful if you want an extremely-focused  
> crawl, possibly burrowing through CGIs.
>
> If instead you want to crawl entire sites, but ignoring CGIs is OK,  
> then PrefixURLFilter is the easiest answer. Create a newline- 
> delimited text file with the site URLs and use this as both seed urls  
> (nutch inject -urlfile) and as the prefixurlfilter config file (set  
> "urlfilter.prefix.file" in nutch-site.xml).
>
> HTH,
> --Matt
>
> On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
>
>> Hello all,
>>
>> I am attempting to modify the RegexUrlFilter and/or the NutchConfig  
>> so that I may dynamically apply a set of domain names to fetcher.
>>
>> In the FAQ:
>>
>>
>>           >>Is it possible to fetch only pages from some specific  
>> domains?
>>
>> >>Please have a look on PrefixURLFilter. Adding some regular  
>> expressions to the urlfilter.regex.file might work, but adding a  
>> list with thousands of regular expressions would slow down your  
>> system excessively.
>>
>>
>> I wish to be able to provide a list of urls that I want to have  
>> fetchedand I want the fetcher to only fetch from those sites (not  
>> follow any links out of those sites) I would like to be able to  keep 
>> adding to this list without having to modify the nutch- config.xml 
>> each time but instead just add it to the config (or  other object) in 
>> memory. All I am after is a point in the right  direction. If someone 
>> could tell me if I am looking in the wrong  files (or off my rocker!) 
>> please let me know where I could/should go.
>>
>> The reason I am asking this is that I am working on a "roll your  own 
>> search". I want to be able to crawl specific sites only, and  then, 
>> in the search results, get search results pertaining only to  some 
>> subset of those crawled sites.
>>
>> Best regards,
>>
>> Kristan Uccello
>>
>
> -- 
> Matt Kangas / kangas@gmail.com
>
>
>

Re: How to hack the config?

Posted by Matt Kangas <ka...@gmail.com>.

(note: this is probably more relevant to nutch-user. please send  
replies there.)

This question seems to come up periodically.

Personally, I accomplish this via a custom URLFilter that uses a  
MapFile of regex pattern-lists, e.g. one set of regexes per website.  
You can find the code in http://issues.apache.org/jira/browse/NUTCH-87

All this does is allow you to keep track of a large set of regexes,  
partitioned by site. It's useful if you want an extremely-focused  
crawl, possibly burrowing through CGIs.

If instead you want to crawl entire sites, but ignoring CGIs is OK,  
then PrefixURLFilter is the easiest answer. Create a newline- 
delimited text file with the site URLs and use this as both seed urls  
(nutch inject -urlfile) and as the prefixurlfilter config file (set  
"urlfilter.prefix.file" in nutch-site.xml).

HTH,
--Matt

On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:

> Hello all,
>
> I am attempting to modify the RegexUrlFilter and/or the NutchConfig  
> so that I may dynamically apply a set of domain names to fetcher.
>
> In the FAQ:
>
>
>           >>Is it possible to fetch only pages from some specific  
> domains?
>
> >>Please have a look on PrefixURLFilter. Adding some regular  
> expressions to the urlfilter.regex.file might work, but adding a  
> list with thousands of regular expressions would slow down your  
> system excessively.
>
>
> I wish to be able to provide a list of urls that I want to have  
> fetchedand I want the fetcher to only fetch from those sites (not  
> follow any links out of those sites) I would like to be able to  
> keep adding to this list without having to modify the nutch- 
> config.xml each time but instead just add it to the config (or  
> other object) in memory. All I am after is a point in the right  
> direction. If someone could tell me if I am looking in the wrong  
> files (or off my rocker!) please let me know where I could/should go.
>
> The reason I am asking this is that I am working on a "roll your  
> own search". I want to be able to crawl specific sites only, and  
> then, in the search results, get search results pertaining only to  
> some subset of those crawled sites.
>
> Best regards,
>
> Kristan Uccello
>

--
Matt Kangas / kangas@gmail.com