You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/01/05 14:58:44 UTC
Per-page crawling policy
Hi,
I've been toying with the following idea, which is an extension of the
existing URLFilter mechanism and the concept of a "crawl frontier".
Let's suppose we have several initial seed urls, each with a different
subjective quality. We would like to crawl these, and expand the
"crawling frontier" using outlinks. However, we don't want to do it
uniformly for every initial url, but rather propagate certain "crawling
policy" through the expanding trees of linked pages. This "crawling
policy" could consist of url filters, scoring methods, etc - basically
anything configurable in Nutch could be included in this "policy".
Perhaps it could even be the new version of non-static NutchConf ;-)
Then, if a given initial url is a known high-quality source, we would
like to apply a "favor" policy, where we e.g. add pages linked from that
url, and in doing so we give them a higher score. Recursively, we could
apply the same policy for the next generation pages, or perhaps only for
pages belonging to the same domain. So, in a sense the original notion
of high-quality would cascade down to other linked pages. The important
aspect of this to note is that all newly discovered pages would be
subject to the same policy - unless we have compelling reasons to switch
the policy (from "favor" to "default" or to "distrust"), which at that
point would essentially change the shape of the expanding frontier.
If a given initial url is a known spammer, we would like to apply a
"distrust" policy for adding pages linked from that url (e.g. adding or
not adding, if adding then lowering their score, or applying different
score calculation). And recursively we could apply a similar policy of
"distrust" to any pages discovered this way. We could also change the
policy on the way, if there are compelling reasons to do so. This means
that we could follow some high-quality links from low-quality pages,
without drilling down the sites which are known to be of low quality.
Special care needs to be taken if the same page is discovered from pages
with different policies, I haven't thought about this aspect yet... ;-)
What would be the benefits of such approach?
* the initial page + policy would both control the expanding crawling
frontier, and it could be differently defined for different starting
pages. I.e. in a single web database we could keep different
"collections" or "areas of interest" with differently specified
policies. But still we could reap the benefits of a single web db,
namely the link information.
* URLFilters could be grouped into several policies, and it would be
easy to switch between them, or edit them.
* if the crawl process realizes it ended up on a spam page, it can
switch the page policy to "distrust", or the other way around, and stop
crawling unwanted content. From now on the pages linked from that page
will follow the new policy. In other words, if a crawling frontier
reaches pages with known quality problems, it would be easy to change
the policy on-the-fly to avoid them or pages linked from them, without
resorting to modifications of URLFilters.
Some of the above you can do even now with URLFilters, but any change
you do now has global consequences. You may also end up with awfully
complicated rules if you try to cover all cases in one rule set.
How to implement it? Surprisingly, I think that it's very simple - just
adding a CrawlDatum.policyId field would suffice, assuming we have a
means to store and retrieve these policies by ID; and then instantiate
it and call appropriate methods whenever we use today the URLFilters and
do the score calculations.
Any comments?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Per-page crawling policy
Posted by Neal Whitley <ne...@e-travelmedia.com>.
For others working in a vertical search scenario I am having some
good luck with the following steps.
For starters it begins with a bit of a manual process to obtain a
good seed starting point. For my current business I already had a
basic seed list of about 7,500 unique links to home pages of
companies in my industry. But then I wanted to increase the scope to
include "any sites" in my business category, not just a list of companies.
So now what:
1.) I found a nice desktop spider application to automate the
process. Visual Web Spider (http://www.newprosoft.com) has turned
out to be a very good tool to set up some very focused crawls. This
spider app has some "fantastic" filters (that I wish Nutch had) that
allow me to configure a crawl: Depth of crawl, limit number of pages
by domain, url stop words, url include words and more...
a.) Custom List Crawling: First I crawled the sites in my current
list of urls to a depth of x. This increased the width of my current
seed list and I told the spider to stay in the current domain and not
spider external sites.
b.) Search Engine Crawling: Visual Web Spider also a easy to use
function that allows me to crawl Google, Yahoo, All The Web and Alta
Vista. So now I created some URL fetch queries to query industry
specific pages from these search engines. Again, I could configure
depth, max pages etc from these starting points.
Example:
http://www.google.com/search?q=ExampleSearchTerm&hl=en&num=40&start=0
(The crawler will grab the first 40 results for my SearchTerm). Then
I could tell the spider how deep to crawl each page and how many
pages to grab etc.
c.) Site Crawling: There are a few dozen other "Hub Sites" in my
industry that offer excellent content where I wanted to index the
majority of there content. So I set up a new task and told the
spider to grab all the pages in these domains only - but ran it
against a filter to filter out forums and some other content areas
that I did not want to spider.
2.) In my nutch-site.xml conf file I then set
"db.max.outlinks.per.page" to 10 (default was 100). This is keeping
nutch fetch lists smaller but a but more focused.
In a short period of time I had ten's of thousands of focused seed
urls to inject into Nutch.
These steps have allowed me to start a vertical search engine and not
have Nutch go too far in it's own fetches/crawls thus limiting the
number of urls that get in. I still get a fair number of off-topic
sites in the database but for the most part it's not a problem
because Nutch index's the database so well. I love Nutch!
Neal
At 01:38 PM 1/5/2006, you wrote:
>Andrzej,
>
>This sounds like another great way to create more of a vertical
>search application as well. By defining trusted seed sources we can
>limit the scope of the crawl to a more suitable set of links.
>
>Also, being able to apply additional rules by domain/host or by
>trusted source would be great as well. I.E. If "trusted" allow
>crawling of dynamic content and allow up to N pages of "?" urls. Or
>even having a trusted URL list where specific hosts would be crawled
>for dynamic content. This may be similar to the "Hub" concept of
>Google where certain sites carry a heavier weight - though perhaps
>being able to manually apply this would be suitable in a Nutch
>vertical implentation.
>
>A "quality" score could also be calculated using a set of "core
>keywords" that apply to a vertical. So a list of several hundred
>core words could try to match words on the page that is being
>crawled. When the crawler finds sites with these words it gives it
>a bump in it's quality score - hence allowing for example a deeper
>crawl of that site and extended crawling of outlinks.
>
>I imagine extensive rule lists/filters like this might cause a
>strain on a Full Web Crawl. But for those of us who are only going
>to be crawling a certain segment of the net this would not slow
>things down to bad (say 500,000 urls or so).
>
>Neal Whitley
>
>
>At 08:58 AM 1/5/2006, you wrote:
>
>>Hi,
>>
>>I've been toying with the following idea, which is an extension of
>>the existing URLFilter mechanism and the concept of a "crawl frontier".
>>
>>Let's suppose we have several initial seed urls, each with a
>>different subjective quality. We would like to crawl these, and
>>expand the "crawling frontier" using outlinks. However, we don't
>>want to do it uniformly for every initial url, but rather propagate
>>certain "crawling policy" through the expanding trees of linked
>>pages. This "crawling policy" could consist of url filters, scoring
>>methods, etc - basically anything configurable in Nutch could be
>>included in this "policy". Perhaps it could even be the new version
>>of non-static NutchConf ;-)
>>
>>Then, if a given initial url is a known high-quality source, we
>>would like to apply a "favor" policy, where we e.g. add pages
>>linked from that url, and in doing so we give them a higher score.
>>Recursively, we could apply the same policy for the next generation
>>pages, or perhaps only for pages belonging to the same domain. So,
>>in a sense the original notion of high-quality would cascade down
>>to other linked pages. The important aspect of this to note is that
>>all newly discovered pages would be subject to the same policy -
>>unless we have compelling reasons to switch the policy (from
>>"favor" to "default" or to "distrust"), which at that point would
>>essentially change the shape of the expanding frontier.
>>
>>If a given initial url is a known spammer, we would like to apply a
>>"distrust" policy for adding pages linked from that url (e.g.
>>adding or not adding, if adding then lowering their score, or
>>applying different score calculation). And recursively we could
>>apply a similar policy of "distrust" to any pages discovered this
>>way. We could also change the policy on the way, if there are
>>compelling reasons to do so. This means that we could follow some
>>high-quality links from low-quality pages, without drilling down
>>the sites which are known to be of low quality.
>>
>>Special care needs to be taken if the same page is discovered from
>>pages with different policies, I haven't thought about this aspect yet... ;-)
>>
>>What would be the benefits of such approach?
>>
>>* the initial page + policy would both control the expanding
>>crawling frontier, and it could be differently defined for
>>different starting pages. I.e. in a single web database we could
>>keep different "collections" or "areas of interest" with
>>differently specified policies. But still we could reap the
>>benefits of a single web db, namely the link information.
>>
>>* URLFilters could be grouped into several policies, and it would
>>be easy to switch between them, or edit them.
>>
>>* if the crawl process realizes it ended up on a spam page, it can
>>switch the page policy to "distrust", or the other way around, and
>>stop crawling unwanted content. From now on the pages linked from
>>that page will follow the new policy. In other words, if a crawling
>>frontier reaches pages with known quality problems, it would be
>>easy to change the policy on-the-fly to avoid them or pages linked
>>from them, without resorting to modifications of URLFilters.
>>
>>Some of the above you can do even now with URLFilters, but any
>>change you do now has global consequences. You may also end up with
>>awfully complicated rules if you try to cover all cases in one rule set.
>>
>>How to implement it? Surprisingly, I think that it's very simple -
>>just adding a CrawlDatum.policyId field would suffice, assuming we
>>have a means to store and retrieve these policies by ID; and then
>>instantiate it and call appropriate methods whenever we use today
>>the URLFilters and do the score calculations.
>>
>>Any comments?
>>
>>--
>>Best regards,
>>Andrzej Bialecki <><
>>___. ___ ___ ___ _ _ __________________________________
>>[__ || __|__/|__||\/| Information Retrieval, Semantic Web
>>___|||__|| \| || | Embedded Unix, System Integration
>>http://www.sigram.com Contact: info at sigram dot com
>>
Re: Per-page crawling policy
Posted by Neal Whitley <ne...@e-travelmedia.com>.
Andrzej,
This sounds like another great way to create more of a vertical
search application as well. By defining trusted seed sources we can
limit the scope of the crawl to a more suitable set of links.
Also, being able to apply additional rules by domain/host or by
trusted source would be great as well. I.E. If "trusted" allow
crawling of dynamic content and allow up to N pages of "?" urls. Or
even having a trusted URL list where specific hosts would be crawled
for dynamic content. This may be similar to the "Hub" concept of
Google where certain sites carry a heavier weight - though perhaps
being able to manually apply this would be suitable in a Nutch
vertical implentation.
A "quality" score could also be calculated using a set of "core
keywords" that apply to a vertical. So a list of several hundred
core words could try to match words on the page that is being
crawled. When the crawler finds sites with these words it gives it
a bump in it's quality score - hence allowing for example a deeper
crawl of that site and extended crawling of outlinks.
I imagine extensive rule lists/filters like this might cause a strain
on a Full Web Crawl. But for those of us who are only going to be
crawling a certain segment of the net this would not slow things down
to bad (say 500,000 urls or so).
Neal Whitley
At 08:58 AM 1/5/2006, you wrote:
>Hi,
>
>I've been toying with the following idea, which is an extension of
>the existing URLFilter mechanism and the concept of a "crawl frontier".
>
>Let's suppose we have several initial seed urls, each with a
>different subjective quality. We would like to crawl these, and
>expand the "crawling frontier" using outlinks. However, we don't
>want to do it uniformly for every initial url, but rather propagate
>certain "crawling policy" through the expanding trees of linked
>pages. This "crawling policy" could consist of url filters, scoring
>methods, etc - basically anything configurable in Nutch could be
>included in this "policy". Perhaps it could even be the new version
>of non-static NutchConf ;-)
>
>Then, if a given initial url is a known high-quality source, we
>would like to apply a "favor" policy, where we e.g. add pages linked
>from that url, and in doing so we give them a higher score.
>Recursively, we could apply the same policy for the next generation
>pages, or perhaps only for pages belonging to the same domain. So,
>in a sense the original notion of high-quality would cascade down to
>other linked pages. The important aspect of this to note is that all
>newly discovered pages would be subject to the same policy - unless
>we have compelling reasons to switch the policy (from "favor" to
>"default" or to "distrust"), which at that point would essentially
>change the shape of the expanding frontier.
>
>If a given initial url is a known spammer, we would like to apply a
>"distrust" policy for adding pages linked from that url (e.g. adding
>or not adding, if adding then lowering their score, or applying
>different score calculation). And recursively we could apply a
>similar policy of "distrust" to any pages discovered this way. We
>could also change the policy on the way, if there are compelling
>reasons to do so. This means that we could follow some high-quality
>links from low-quality pages, without drilling down the sites which
>are known to be of low quality.
>
>Special care needs to be taken if the same page is discovered from
>pages with different policies, I haven't thought about this aspect yet... ;-)
>
>What would be the benefits of such approach?
>
>* the initial page + policy would both control the expanding
>crawling frontier, and it could be differently defined for different
>starting pages. I.e. in a single web database we could keep
>different "collections" or "areas of interest" with differently
>specified policies. But still we could reap the benefits of a single
>web db, namely the link information.
>
>* URLFilters could be grouped into several policies, and it would be
>easy to switch between them, or edit them.
>
>* if the crawl process realizes it ended up on a spam page, it can
>switch the page policy to "distrust", or the other way around, and
>stop crawling unwanted content. From now on the pages linked from
>that page will follow the new policy. In other words, if a crawling
>frontier reaches pages with known quality problems, it would be easy
>to change the policy on-the-fly to avoid them or pages linked from
>them, without resorting to modifications of URLFilters.
>
>Some of the above you can do even now with URLFilters, but any
>change you do now has global consequences. You may also end up with
>awfully complicated rules if you try to cover all cases in one rule set.
>
>How to implement it? Surprisingly, I think that it's very simple -
>just adding a CrawlDatum.policyId field would suffice, assuming we
>have a means to store and retrieve these policies by ID; and then
>instantiate it and call appropriate methods whenever we use today
>the URLFilters and do the score calculations.
>
>Any comments?
>
>--
>Best regards,
>Andrzej Bialecki <><
>___. ___ ___ ___ _ _ __________________________________
>[__ || __|__/|__||\/| Information Retrieval, Semantic Web
>___|||__|| \| || | Embedded Unix, System Integration
>http://www.sigram.com Contact: info at sigram dot com
>
>
Re: Per-page crawling policy
Posted by Andrzej Bialecki <ab...@getopt.org>.
Jack Tang wrote:
>Hi Andrzej
>
>The idea brings vertical search into nutch and definitely it is great:)
>I think nutch should add information retrieving layer into the who
>architecture, and export some abstract interface, say
>UrlBasedInformationRetrieve(you can implement your url grouping idea
>here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
>user can implement these in their vertical search by their own.
>
>
We sort of reached an agreement to add Properties to CrawlDatum. Users
will be able to put arbitrary metadata in there, so that each page
record could be processed differently if needs be.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Per-page crawling policy
Posted by Jack Tang <hi...@gmail.com>.
BTW: if nutch is going to support vertical searching, I think page
urls should be grouped in three type: fetchable url(just fetching it),
extractable url(fetch it and extract information from this page) and
pagination url.
/Jack
On 1/5/06, Jack Tang <hi...@gmail.com> wrote:
> Hi Andrzej
>
> The idea brings vertical search into nutch and definitely it is great:)
> I think nutch should add information retrieving layer into the who
> architecture, and export some abstract interface, say
> UrlBasedInformationRetrieve(you can implement your url grouping idea
> here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
> user can implement these in their vertical search by their own.
>
> /Jack
>
> On 1/5/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> > Hi,
> >
> > I've been toying with the following idea, which is an extension of the
> > existing URLFilter mechanism and the concept of a "crawl frontier".
> >
> > Let's suppose we have several initial seed urls, each with a different
> > subjective quality. We would like to crawl these, and expand the
> > "crawling frontier" using outlinks. However, we don't want to do it
> > uniformly for every initial url, but rather propagate certain "crawling
> > policy" through the expanding trees of linked pages. This "crawling
> > policy" could consist of url filters, scoring methods, etc - basically
> > anything configurable in Nutch could be included in this "policy".
> > Perhaps it could even be the new version of non-static NutchConf ;-)
> >
> > Then, if a given initial url is a known high-quality source, we would
> > like to apply a "favor" policy, where we e.g. add pages linked from that
> > url, and in doing so we give them a higher score. Recursively, we could
> > apply the same policy for the next generation pages, or perhaps only for
> > pages belonging to the same domain. So, in a sense the original notion
> > of high-quality would cascade down to other linked pages. The important
> > aspect of this to note is that all newly discovered pages would be
> > subject to the same policy - unless we have compelling reasons to switch
> > the policy (from "favor" to "default" or to "distrust"), which at that
> > point would essentially change the shape of the expanding frontier.
> >
> > If a given initial url is a known spammer, we would like to apply a
> > "distrust" policy for adding pages linked from that url (e.g. adding or
> > not adding, if adding then lowering their score, or applying different
> > score calculation). And recursively we could apply a similar policy of
> > "distrust" to any pages discovered this way. We could also change the
> > policy on the way, if there are compelling reasons to do so. This means
> > that we could follow some high-quality links from low-quality pages,
> > without drilling down the sites which are known to be of low quality.
> >
> > Special care needs to be taken if the same page is discovered from pages
> > with different policies, I haven't thought about this aspect yet... ;-)
> >
> > What would be the benefits of such approach?
> >
> > * the initial page + policy would both control the expanding crawling
> > frontier, and it could be differently defined for different starting
> > pages. I.e. in a single web database we could keep different
> > "collections" or "areas of interest" with differently specified
> > policies. But still we could reap the benefits of a single web db,
> > namely the link information.
> >
> > * URLFilters could be grouped into several policies, and it would be
> > easy to switch between them, or edit them.
> >
> > * if the crawl process realizes it ended up on a spam page, it can
> > switch the page policy to "distrust", or the other way around, and stop
> > crawling unwanted content. From now on the pages linked from that page
> > will follow the new policy. In other words, if a crawling frontier
> > reaches pages with known quality problems, it would be easy to change
> > the policy on-the-fly to avoid them or pages linked from them, without
> > resorting to modifications of URLFilters.
> >
> > Some of the above you can do even now with URLFilters, but any change
> > you do now has global consequences. You may also end up with awfully
> > complicated rules if you try to cover all cases in one rule set.
> >
> > How to implement it? Surprisingly, I think that it's very simple - just
> > adding a CrawlDatum.policyId field would suffice, assuming we have a
> > means to store and retrieve these policies by ID; and then instantiate
> > it and call appropriate methods whenever we use today the URLFilters and
> > do the score calculations.
> >
> > Any comments?
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> > ___. ___ ___ ___ _ _ __________________________________
> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> > ___|||__|| \| || | Embedded Unix, System Integration
> > http://www.sigram.com Contact: info at sigram dot com
> >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: Per-page crawling policy
Posted by Jack Tang <hi...@gmail.com>.
Hi Andrzej
The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
user can implement these in their vertical search by their own.
/Jack
On 1/5/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> I've been toying with the following idea, which is an extension of the
> existing URLFilter mechanism and the concept of a "crawl frontier".
>
> Let's suppose we have several initial seed urls, each with a different
> subjective quality. We would like to crawl these, and expand the
> "crawling frontier" using outlinks. However, we don't want to do it
> uniformly for every initial url, but rather propagate certain "crawling
> policy" through the expanding trees of linked pages. This "crawling
> policy" could consist of url filters, scoring methods, etc - basically
> anything configurable in Nutch could be included in this "policy".
> Perhaps it could even be the new version of non-static NutchConf ;-)
>
> Then, if a given initial url is a known high-quality source, we would
> like to apply a "favor" policy, where we e.g. add pages linked from that
> url, and in doing so we give them a higher score. Recursively, we could
> apply the same policy for the next generation pages, or perhaps only for
> pages belonging to the same domain. So, in a sense the original notion
> of high-quality would cascade down to other linked pages. The important
> aspect of this to note is that all newly discovered pages would be
> subject to the same policy - unless we have compelling reasons to switch
> the policy (from "favor" to "default" or to "distrust"), which at that
> point would essentially change the shape of the expanding frontier.
>
> If a given initial url is a known spammer, we would like to apply a
> "distrust" policy for adding pages linked from that url (e.g. adding or
> not adding, if adding then lowering their score, or applying different
> score calculation). And recursively we could apply a similar policy of
> "distrust" to any pages discovered this way. We could also change the
> policy on the way, if there are compelling reasons to do so. This means
> that we could follow some high-quality links from low-quality pages,
> without drilling down the sites which are known to be of low quality.
>
> Special care needs to be taken if the same page is discovered from pages
> with different policies, I haven't thought about this aspect yet... ;-)
>
> What would be the benefits of such approach?
>
> * the initial page + policy would both control the expanding crawling
> frontier, and it could be differently defined for different starting
> pages. I.e. in a single web database we could keep different
> "collections" or "areas of interest" with differently specified
> policies. But still we could reap the benefits of a single web db,
> namely the link information.
>
> * URLFilters could be grouped into several policies, and it would be
> easy to switch between them, or edit them.
>
> * if the crawl process realizes it ended up on a spam page, it can
> switch the page policy to "distrust", or the other way around, and stop
> crawling unwanted content. From now on the pages linked from that page
> will follow the new policy. In other words, if a crawling frontier
> reaches pages with known quality problems, it would be easy to change
> the policy on-the-fly to avoid them or pages linked from them, without
> resorting to modifications of URLFilters.
>
> Some of the above you can do even now with URLFilters, but any change
> you do now has global consequences. You may also end up with awfully
> complicated rules if you try to cover all cases in one rule set.
>
> How to implement it? Surprisingly, I think that it's very simple - just
> adding a CrawlDatum.policyId field would suffice, assuming we have a
> means to store and retrieve these policies by ID; and then instantiate
> it and call appropriate methods whenever we use today the URLFilters and
> do the score calculations.
>
> Any comments?
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: Per-page crawling policy
Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Stefan Groschupf wrote:
>
>> Before we start adding meta data and more meta data, why not once in
>> general adding meta data to the crawlDatum, than we can have any
>> kinds of plugins that add and process metadata that belongs to a url.
>
>
> +1
>
> This feature strikes me as something that might prove very useful, but
> might also prove unworkable, or at least not useful to everyone. Thus
> it would be best if it doesn't require changes to a core class like
> CrawlDatum. If it does eventually prove generally useful, as
> something that everyone will use and that should be enabled by
> default, then we could promote its data from metadata to a field for
> efficiency.
>
> In this vein, should modifiedTime be moved to metadata, once metadata
> is added?
I'm of a split mind on this, because I hope that the detection of
unmodified content will be the default mode of operation... OTOH,
perhaps it's a premature micro-optimization. We can move it to metadata
for now, but I see it as a strong candidate to be moved back...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Per-page crawling policy
Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> Before we start adding meta data and more meta data, why not once in
> general adding meta data to the crawlDatum, than we can have any kinds
> of plugins that add and process metadata that belongs to a url.
+1
This feature strikes me as something that might prove very useful, but
might also prove unworkable, or at least not useful to everyone. Thus
it would be best if it doesn't require changes to a core class like
CrawlDatum. If it does eventually prove generally useful, as something
that everyone will use and that should be enabled by default, then we
could promote its data from metadata to a field for efficiency.
In this vein, should modifiedTime be moved to metadata, once metadata is
added?
Cheers,
Doug
Re: Per-page crawling policy
Posted by Stefan Groschupf <sg...@media-style.com>.
> Hehe... That was what I advocated from the beginning. There is a
> cost associated with this, though, i.e. any change in CrawlDatum
> size has a significant impact on most operations' performance.
Sure, if you every had a look to the 0.7 meta data patch, there i had
implement things in a way that only in case there was meta data these
meta data and the key was written to the file.
So no meta data means the same file size as before. in general we
need to accept that meta data pump up the file == the processing and
IO load, but people doing a complete web index, can work without meta
data and people that need these function need to accept that nothing
is for free.
>
>> All solutions I had seen until today load this kind of meta data
>> until indexing from a third party data source (database) and add
>> it into the index. This works but is very slow.
>
>
> Well, maybe it makes sense to store the CrawlDatum and its
> "metadata" separately in two MapFiles, so that you can perform some
> operations using only the lightweight CrawlDatum, and for other
> operations you will need to load the properties too...
Yes, I like this idea and I remember that Doug had suggest such a
solution also.
However first I will focus on the NutchConf issue.
Stefan
Re: Per-page crawling policy
Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:
> I like the idea and it is another step in the direction of vertical
> search, where I personal see the biggest chance for nutch.
>
>> How to implement it? Surprisingly, I think that it's very simple -
>> just adding a CrawlDatum.policyId field would suffice, assuming we
>> have a means to store and retrieve these policies by ID; and then
>> instantiate it and call appropriate methods whenever we use today
>> the URLFilters and do the score calculations.
>
>
> Before we start adding meta data and more meta data, why not once in
> general adding meta data to the crawlDatum, than we can have any
> kinds of plugins that add and process metadata that belongs to a url.
> Beside policyId, I see much more canidates for crawl metadata:
> Last Crawl date. Url category. collection key (similar to google
> appliance collections) etc.
>
Hehe... That was what I advocated from the beginning. There is a cost
associated with this, though, i.e. any change in CrawlDatum size has a
significant impact on most operations' performance.
> All solutions I had seen until today load this kind of meta data
> until indexing from a third party data source (database) and add it
> into the index. This works but is very slow.
Well, maybe it makes sense to store the CrawlDatum and its "metadata"
separately in two MapFiles, so that you can perform some operations
using only the lightweight CrawlDatum, and for other operations you will
need to load the properties too...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Per-page crawling policy
Posted by Stefan Groschupf <sg...@media-style.com>.
I like the idea and it is another step in the direction of vertical
search, where I personal see the biggest chance for nutch.
> How to implement it? Surprisingly, I think that it's very simple -
> just adding a CrawlDatum.policyId field would suffice, assuming we
> have a means to store and retrieve these policies by ID; and then
> instantiate it and call appropriate methods whenever we use today
> the URLFilters and do the score calculations.
Before we start adding meta data and more meta data, why not once in
general adding meta data to the crawlDatum, than we can have any
kinds of plugins that add and process metadata that belongs to a url.
Beside policyId, I see much more canidates for crawl metadata:
Last Crawl date. Url category. collection key (similar to google
appliance collections) etc.
All solutions I had seen until today load this kind of meta data
until indexing from a third party data source (database) and add it
into the index. This works but is very slow.
Stefan
Re: Per-page crawling policy
Posted by Byron Miller <by...@yahoo.com>.
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.
I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
computer generated map (via standard link processing,
anchor results and such)
My only continuing question is how to manage the
merge, index process of staging/processing your
crawl/fetch jobs such as this. It seems all of our
theories would be a single crawl and publish of that
index rather than a living/breathing corpus.
Unless we map/bucket the segments to have some purpose
it's difficult to manage how we process them, sort
them or analyze them to defign or extra more meaning
from them.
Brain is exploding :)
-byron
--- Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> I've been toying with the following idea, which is
> an extension of the
> existing URLFilter mechanism and the concept of a
> "crawl frontier".
>
> Let's suppose we have several initial seed urls,
> each with a different
> subjective quality. We would like to crawl these,
> and expand the
> "crawling frontier" using outlinks. However, we
> don't want to do it
> uniformly for every initial url, but rather
> propagate certain "crawling
> policy" through the expanding trees of linked pages.
> This "crawling
> policy" could consist of url filters, scoring
> methods, etc - basically
> anything configurable in Nutch could be included in
> this "policy".
> Perhaps it could even be the new version of
> non-static NutchConf ;-)
>
> Then, if a given initial url is a known high-quality
> source, we would
> like to apply a "favor" policy, where we e.g. add
> pages linked from that
> url, and in doing so we give them a higher score.
> Recursively, we could
> apply the same policy for the next generation pages,
> or perhaps only for
> pages belonging to the same domain. So, in a sense
> the original notion
> of high-quality would cascade down to other linked
> pages. The important
> aspect of this to note is that all newly discovered
> pages would be
> subject to the same policy - unless we have
> compelling reasons to switch
> the policy (from "favor" to "default" or to
> "distrust"), which at that
> point would essentially change the shape of the
> expanding frontier.
>
> If a given initial url is a known spammer, we would
> like to apply a
> "distrust" policy for adding pages linked from that
> url (e.g. adding or
> not adding, if adding then lowering their score, or
> applying different
> score calculation). And recursively we could apply a
> similar policy of
> "distrust" to any pages discovered this way. We
> could also change the
> policy on the way, if there are compelling reasons
> to do so. This means
> that we could follow some high-quality links from
> low-quality pages,
> without drilling down the sites which are known to
> be of low quality.
>
> Special care needs to be taken if the same page is
> discovered from pages
> with different policies, I haven't thought about
> this aspect yet... ;-)
>
> What would be the benefits of such approach?
>
> * the initial page + policy would both control the
> expanding crawling
> frontier, and it could be differently defined for
> different starting
> pages. I.e. in a single web database we could keep
> different
> "collections" or "areas of interest" with
> differently specified
> policies. But still we could reap the benefits of a
> single web db,
> namely the link information.
>
> * URLFilters could be grouped into several policies,
> and it would be
> easy to switch between them, or edit them.
>
> * if the crawl process realizes it ended up on a
> spam page, it can
> switch the page policy to "distrust", or the other
> way around, and stop
> crawling unwanted content. From now on the pages
> linked from that page
> will follow the new policy. In other words, if a
> crawling frontier
> reaches pages with known quality problems, it would
> be easy to change
> the policy on-the-fly to avoid them or pages linked
> from them, without
> resorting to modifications of URLFilters.
>
> Some of the above you can do even now with
> URLFilters, but any change
> you do now has global consequences. You may also end
> up with awfully
> complicated rules if you try to cover all cases in
> one rule set.
>
> How to implement it? Surprisingly, I think that it's
> very simple - just
> adding a CrawlDatum.policyId field would suffice,
> assuming we have a
> means to store and retrieve these policies by ID;
> and then instantiate
> it and call appropriate methods whenever we use
> today the URLFilters and
> do the score calculations.
>
> Any comments?
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
> __________________________________
> [__ || __|__/|__||\/| Information Retrieval,
> Semantic Web
> ___|||__|| \| || | Embedded Unix, System
> Integration
> http://www.sigram.com Contact: info at sigram dot
> com
>
>
>