You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Kristan Uccello <ku...@gmail.com> on 2005/11/29 22:57:17 UTC
How to hack the config?
Hello all,
I am attempting to modify the RegexUrlFilter and/or the NutchConfig so
that I may dynamically apply a set of domain names to fetcher.
In the FAQ:
>>Is it possible to fetch only pages from some specific domains?
>>Please have a look on PrefixURLFilter. Adding some regular
expressions to the urlfilter.regex.file might work, but adding a list
with thousands of regular expressions would slow down your system
excessively.
I wish to be able to provide a list of urls that I want to have
fetchedand I want the fetcher to only fetch from those sites (not follow
any links out of those sites) I would like to be able to keep adding to
this list without having to modify the nutch-config.xml each time but
instead just add it to the config (or other object) in memory. All I am
after is a point in the right direction. If someone could tell me if I
am looking in the wrong files (or off my rocker!) please let me know
where I could/should go.
The reason I am asking this is that I am working on a "roll your own
search". I want to be able to crawl specific sites only, and then, in
the search results, get search results pertaining only to some subset of
those crawled sites.
Best regards,
Kristan Uccello
Re: How to hack the config?
Posted by Kristan Uccello <ku...@gmail.com>.
Hi Matt,
I can see you and I are thinking along similar lines as I did think to
add a custom field during indexing. However I rejected this thought as
it does not scale very well. For example, say I did add a custom field
during indexing (call it setID) and for site foo.com I had a set of
sites to index (a.com, b.com, c.com [setID=1]) this would work great but
I would also have bar.com which needed to index sites as well (a.com,
d.com, e.com [setID=2]) now in this case (a.com) would be indexed twice
(once for each setID) or if remove duplicates was used it may lose its
origonal identity (setID=1) in place of its new identity (setID=2) this
would mean that setID=1 would lose its index of site a.com.
This is very close to a many-to-many relationship problem encountered
regualarly in database design. I'm looking for how to achive this
relationship inside the rules of nutch.
Cheers,
Kristan Uccello
Matt Kangas wrote:
> (i'm moving this to nutch-user, so we don't piss off the nutch-dev
> folks.)
>
> a few ideas:
> - if you only want to match one site at a time, you can just add
> "site:xxx" to the query. the "site" field exists in the index by default
> - if you want assign ids to clusters of sites, you can do the site-
> >id lookup at index time and add a custom field to the index
>
> if you want to do the latter, you need to write custom indexer +
> query plugins. it's not hard, but the documentation is somewhat poor.
>
> i'd suggest looking at the source for the plugins "index-more" and
> "query-more". those two plugins add support for "type:" and "date:"
> queries.
>
> --matt
>
> On Nov 29, 2005, at 7:16 PM, Kristan Uccello wrote:
>
>> Hi Matt,
>>
>> thank you for the reply I will play with what you sugested as it
>> sounds pretty close to what I am after. The only caviet is that all
>> the websites that I will be hosting are running from one source so I
>> am after a way that I can dynamically inject a filter based on what
>> site is calling the searcher. I plan to have nutch running on one
>> web server and access it through its rss REST and my thought was to
>> provide an additional REST parameter like ...&callerID=xxx and have
>> nutch load the appropriate class that would filter the results to a
>> list of domains (stored in a file or database or something) based on
>> the callerID value as the key.
>>
>> I'm pretty new to nutch ( only working with it for a few days ) so
>> any guideance you can give me will be greatly appreciated.
>>
>> Cheers,
>>
>> Kristan Uccello
>>
>> Matt Kangas wrote:
>>
>>> (note: this is probably more relevant to nutch-user. please send
>>> replies there.)
>>>
>>> This question seems to come up periodically.
>>>
>>> Personally, I accomplish this via a custom URLFilter that uses a
>>> MapFile of regex pattern-lists, e.g. one set of regexes per
>>> website. You can find the code in http://issues.apache.org/jira/
>>> browse/NUTCH-87
>>>
>>> All this does is allow you to keep track of a large set of
>>> regexes, partitioned by site. It's useful if you want an
>>> extremely-focused crawl, possibly burrowing through CGIs.
>>>
>>> If instead you want to crawl entire sites, but ignoring CGIs is
>>> OK, then PrefixURLFilter is the easiest answer. Create a newline-
>>> delimited text file with the site URLs and use this as both seed
>>> urls (nutch inject -urlfile) and as the prefixurlfilter config
>>> file (set "urlfilter.prefix.file" in nutch-site.xml).
>>>
>>> HTH,
>>> --Matt
>>>
>>> On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am attempting to modify the RegexUrlFilter and/or the
>>>> NutchConfig so that I may dynamically apply a set of domain names
>>>> to fetcher.
>>>>
>>>> In the FAQ:
>>>>
>>>>
>>>> >>Is it possible to fetch only pages from some specific
>>>> domains?
>>>>
>>>> >>Please have a look on PrefixURLFilter. Adding some regular
>>>> expressions to the urlfilter.regex.file might work, but adding a
>>>> list with thousands of regular expressions would slow down your
>>>> system excessively.
>>>>
>>>>
>>>> I wish to be able to provide a list of urls that I want to have
>>>> fetchedand I want the fetcher to only fetch from those sites (not
>>>> follow any links out of those sites) I would like to be able to
>>>> keep adding to this list without having to modify the nutch-
>>>> config.xml each time but instead just add it to the config (or
>>>> other object) in memory. All I am after is a point in the right
>>>> direction. If someone could tell me if I am looking in the wrong
>>>> files (or off my rocker!) please let me know where I could/ should go.
>>>>
>>>> The reason I am asking this is that I am working on a "roll your
>>>> own search". I want to be able to crawl specific sites only, and
>>>> then, in the search results, get search results pertaining only
>>>> to some subset of those crawled sites.
>>>>
>>>> Best regards,
>>>>
>>>> Kristan Uccello
>>>>
>>>
>>> --
>>> Matt Kangas / kangas@gmail.com
>>>
>>>
>>>
>>
>
> --
> Matt Kangas / kangas@gmail.com
>
>
>
Re: How to hack the config?
Posted by Bud Witney <wi...@osu.edu>.
How would one go about this ----->>>>> "During indexing, your
indexing filter could add a field named "sitecluster""
Could I create a field called "region" and apply to sites based on
there location . If so how. Can this be tweaked in config file
-Bud
On Nov 30, 2005, at 12:29 PM, Andy Lee wrote:
> On Nov 30, 2005, at 1:20 AM, Matt Kangas wrote:
>> - if you only want to match one site at a time, you can just add
>> "site:xxx" to the query. the "site" field exists in the index by
>> default
>
> Note that the index-basic indexing filter does not tokenize the
> "site" field, so if you do "site:salami.com" you will only match
> URLs whose host component exactly matches the value you give --
> http://salami.com/etc and ftp://salami.com/etc but NOT http://
> www.salami.com. This may or may not be what you want.
>
>> - if you want assign ids to clusters of sites, you can do the site-
>> >id lookup at index time and add a custom field to the index
>
> This is one way to address the above issue. During indexing, your
> indexing filter could add a field named "sitecluster" (or
> whatever), and for all the above URLs (and anything else you want
> to cluster with them) you would set "salami.com" as the value of
> that field. Then your search would be "sitecluster:salami.com".
>
> Another approach would be to search not on the "site" field but the
> "url" field, which *is* tokenized at indexing time. So
> "url:salami" would find all the salami URLs above, as well as
> http://www2.salami.com and http://www.salami.org and http://
> salami.lunch.com -- which again may or may not be what you want.
>
> --Andy
>
Re: How to hack the config?
Posted by Andy Lee <ag...@earthlink.net>.
On Nov 30, 2005, at 1:20 AM, Matt Kangas wrote:
> - if you only want to match one site at a time, you can just add
> "site:xxx" to the query. the "site" field exists in the index by
> default
Note that the index-basic indexing filter does not tokenize the
"site" field, so if you do "site:salami.com" you will only match URLs
whose host component exactly matches the value you give -- http://
salami.com/etc and ftp://salami.com/etc but NOT http://
www.salami.com. This may or may not be what you want.
> - if you want assign ids to clusters of sites, you can do the site-
> >id lookup at index time and add a custom field to the index
This is one way to address the above issue. During indexing, your
indexing filter could add a field named "sitecluster" (or whatever),
and for all the above URLs (and anything else you want to cluster
with them) you would set "salami.com" as the value of that field.
Then your search would be "sitecluster:salami.com".
Another approach would be to search not on the "site" field but the
"url" field, which *is* tokenized at indexing time. So "url:salami"
would find all the salami URLs above, as well as http://
www2.salami.com and http://www.salami.org and http://salami.lunch.com
-- which again may or may not be what you want.
--Andy
Re: How to hack the config?
Posted by Matt Kangas <ka...@gmail.com>.
(i'm moving this to nutch-user, so we don't piss off the nutch-dev
folks.)
a few ideas:
- if you only want to match one site at a time, you can just add
"site:xxx" to the query. the "site" field exists in the index by default
- if you want assign ids to clusters of sites, you can do the site-
>id lookup at index time and add a custom field to the index
if you want to do the latter, you need to write custom indexer +
query plugins. it's not hard, but the documentation is somewhat poor.
i'd suggest looking at the source for the plugins "index-more" and
"query-more". those two plugins add support for "type:" and "date:"
queries.
--matt
On Nov 29, 2005, at 7:16 PM, Kristan Uccello wrote:
> Hi Matt,
>
> thank you for the reply I will play with what you sugested as it
> sounds pretty close to what I am after. The only caviet is that all
> the websites that I will be hosting are running from one source so
> I am after a way that I can dynamically inject a filter based on
> what site is calling the searcher. I plan to have nutch running on
> one web server and access it through its rss REST and my thought
> was to provide an additional REST parameter like ...&callerID=xxx
> and have nutch load the appropriate class that would filter the
> results to a list of domains (stored in a file or database or
> something) based on the callerID value as the key.
>
> I'm pretty new to nutch ( only working with it for a few days ) so
> any guideance you can give me will be greatly appreciated.
>
> Cheers,
>
> Kristan Uccello
>
> Matt Kangas wrote:
>
>> (note: this is probably more relevant to nutch-user. please send
>> replies there.)
>>
>> This question seems to come up periodically.
>>
>> Personally, I accomplish this via a custom URLFilter that uses a
>> MapFile of regex pattern-lists, e.g. one set of regexes per
>> website. You can find the code in http://issues.apache.org/jira/
>> browse/NUTCH-87
>>
>> All this does is allow you to keep track of a large set of
>> regexes, partitioned by site. It's useful if you want an
>> extremely-focused crawl, possibly burrowing through CGIs.
>>
>> If instead you want to crawl entire sites, but ignoring CGIs is
>> OK, then PrefixURLFilter is the easiest answer. Create a newline-
>> delimited text file with the site URLs and use this as both seed
>> urls (nutch inject -urlfile) and as the prefixurlfilter config
>> file (set "urlfilter.prefix.file" in nutch-site.xml).
>>
>> HTH,
>> --Matt
>>
>> On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
>>
>>> Hello all,
>>>
>>> I am attempting to modify the RegexUrlFilter and/or the
>>> NutchConfig so that I may dynamically apply a set of domain
>>> names to fetcher.
>>>
>>> In the FAQ:
>>>
>>>
>>> >>Is it possible to fetch only pages from some
>>> specific domains?
>>>
>>> >>Please have a look on PrefixURLFilter. Adding some regular
>>> expressions to the urlfilter.regex.file might work, but adding a
>>> list with thousands of regular expressions would slow down your
>>> system excessively.
>>>
>>>
>>> I wish to be able to provide a list of urls that I want to have
>>> fetchedand I want the fetcher to only fetch from those sites
>>> (not follow any links out of those sites) I would like to be
>>> able to keep adding to this list without having to modify the
>>> nutch- config.xml each time but instead just add it to the config
>>> (or other object) in memory. All I am after is a point in the
>>> right direction. If someone could tell me if I am looking in the
>>> wrong files (or off my rocker!) please let me know where I could/
>>> should go.
>>>
>>> The reason I am asking this is that I am working on a "roll your
>>> own search". I want to be able to crawl specific sites only, and
>>> then, in the search results, get search results pertaining only
>>> to some subset of those crawled sites.
>>>
>>> Best regards,
>>>
>>> Kristan Uccello
>>>
>>
>> --
>> Matt Kangas / kangas@gmail.com
>>
>>
>>
>
--
Matt Kangas / kangas@gmail.com
Re: How to hack the config?
Posted by Kristan Uccello <ku...@gmail.com>.
Hi Matt,
thank you for the reply I will play with what you sugested as it sounds
pretty close to what I am after. The only caviet is that all the
websites that I will be hosting are running from one source so I am
after a way that I can dynamically inject a filter based on what site is
calling the searcher. I plan to have nutch running on one web server and
access it through its rss REST and my thought was to provide an
additional REST parameter like ...&callerID=xxx and have nutch load the
appropriate class that would filter the results to a list of domains
(stored in a file or database or something) based on the callerID value
as the key.
I'm pretty new to nutch ( only working with it for a few days ) so any
guideance you can give me will be greatly appreciated.
Cheers,
Kristan Uccello
Matt Kangas wrote:
> (note: this is probably more relevant to nutch-user. please send
> replies there.)
>
> This question seems to come up periodically.
>
> Personally, I accomplish this via a custom URLFilter that uses a
> MapFile of regex pattern-lists, e.g. one set of regexes per website.
> You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
>
> All this does is allow you to keep track of a large set of regexes,
> partitioned by site. It's useful if you want an extremely-focused
> crawl, possibly burrowing through CGIs.
>
> If instead you want to crawl entire sites, but ignoring CGIs is OK,
> then PrefixURLFilter is the easiest answer. Create a newline-
> delimited text file with the site URLs and use this as both seed urls
> (nutch inject -urlfile) and as the prefixurlfilter config file (set
> "urlfilter.prefix.file" in nutch-site.xml).
>
> HTH,
> --Matt
>
> On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
>
>> Hello all,
>>
>> I am attempting to modify the RegexUrlFilter and/or the NutchConfig
>> so that I may dynamically apply a set of domain names to fetcher.
>>
>> In the FAQ:
>>
>>
>> >>Is it possible to fetch only pages from some specific
>> domains?
>>
>> >>Please have a look on PrefixURLFilter. Adding some regular
>> expressions to the urlfilter.regex.file might work, but adding a
>> list with thousands of regular expressions would slow down your
>> system excessively.
>>
>>
>> I wish to be able to provide a list of urls that I want to have
>> fetchedand I want the fetcher to only fetch from those sites (not
>> follow any links out of those sites) I would like to be able to keep
>> adding to this list without having to modify the nutch- config.xml
>> each time but instead just add it to the config (or other object) in
>> memory. All I am after is a point in the right direction. If someone
>> could tell me if I am looking in the wrong files (or off my rocker!)
>> please let me know where I could/should go.
>>
>> The reason I am asking this is that I am working on a "roll your own
>> search". I want to be able to crawl specific sites only, and then,
>> in the search results, get search results pertaining only to some
>> subset of those crawled sites.
>>
>> Best regards,
>>
>> Kristan Uccello
>>
>
> --
> Matt Kangas / kangas@gmail.com
>
>
>
Re: How to hack the config?
Posted by Matt Kangas <ka...@gmail.com>.
(note: this is probably more relevant to nutch-user. please send
replies there.)
This question seems to come up periodically.
Personally, I accomplish this via a custom URLFilter that uses a
MapFile of regex pattern-lists, e.g. one set of regexes per website.
You can find the code in http://issues.apache.org/jira/browse/NUTCH-87
All this does is allow you to keep track of a large set of regexes,
partitioned by site. It's useful if you want an extremely-focused
crawl, possibly burrowing through CGIs.
If instead you want to crawl entire sites, but ignoring CGIs is OK,
then PrefixURLFilter is the easiest answer. Create a newline-
delimited text file with the site URLs and use this as both seed urls
(nutch inject -urlfile) and as the prefixurlfilter config file (set
"urlfilter.prefix.file" in nutch-site.xml).
HTH,
--Matt
On Nov 29, 2005, at 4:57 PM, Kristan Uccello wrote:
> Hello all,
>
> I am attempting to modify the RegexUrlFilter and/or the NutchConfig
> so that I may dynamically apply a set of domain names to fetcher.
>
> In the FAQ:
>
>
> >>Is it possible to fetch only pages from some specific
> domains?
>
> >>Please have a look on PrefixURLFilter. Adding some regular
> expressions to the urlfilter.regex.file might work, but adding a
> list with thousands of regular expressions would slow down your
> system excessively.
>
>
> I wish to be able to provide a list of urls that I want to have
> fetchedand I want the fetcher to only fetch from those sites (not
> follow any links out of those sites) I would like to be able to
> keep adding to this list without having to modify the nutch-
> config.xml each time but instead just add it to the config (or
> other object) in memory. All I am after is a point in the right
> direction. If someone could tell me if I am looking in the wrong
> files (or off my rocker!) please let me know where I could/should go.
>
> The reason I am asking this is that I am working on a "roll your
> own search". I want to be able to crawl specific sites only, and
> then, in the search results, get search results pertaining only to
> some subset of those crawled sites.
>
> Best regards,
>
> Kristan Uccello
>
--
Matt Kangas / kangas@gmail.com