You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/01/05 14:58:44 UTC

Per-page crawling policy

Hi,

I've been toying with the following idea, which is an extension of the 
existing URLFilter mechanism and the concept of a "crawl frontier".

Let's suppose we have several initial seed urls, each with a different 
subjective quality. We would like to crawl these, and expand the 
"crawling frontier" using outlinks. However, we don't want to do it 
uniformly for every initial url, but rather propagate certain "crawling 
policy" through the expanding trees of linked pages. This "crawling 
policy" could consist of url filters, scoring methods, etc - basically 
anything configurable in Nutch could be included in this "policy". 
Perhaps it could even be the new version of non-static NutchConf ;-)

Then, if a given initial url is a known high-quality source, we would 
like to apply a "favor" policy, where we e.g. add pages linked from that 
url, and in doing so we give them a higher score. Recursively, we could 
apply the same policy for the next generation pages, or perhaps only for 
pages belonging to the same domain. So, in a sense the original notion 
of high-quality would cascade down to other linked pages. The important 
aspect of this to note is that all newly discovered pages would be 
subject to the same policy - unless we have compelling reasons to switch 
the policy (from "favor" to "default" or to "distrust"), which at that 
point would essentially change the shape of the expanding frontier.

If a given initial url is a known spammer, we would like to apply a 
"distrust" policy for adding pages linked from that url (e.g. adding or 
not adding, if adding then lowering their score, or applying different 
score calculation). And recursively we could apply a similar policy of 
"distrust" to any pages discovered this way. We could also change the 
policy on the way, if there are compelling reasons to do so. This means 
that we could follow some high-quality links from low-quality pages, 
without drilling down the sites which are known to be of low quality.

Special care needs to be taken if the same page is discovered from pages 
with different policies, I haven't thought about this aspect yet... ;-)

What would be the benefits of such approach?

* the initial page + policy would both control the expanding crawling 
frontier, and it could be differently defined for different starting 
pages. I.e. in a single web database we could keep different 
"collections" or "areas of interest" with differently specified 
policies. But still we could reap the benefits of a single web db, 
namely the link information.

* URLFilters could be grouped into several policies, and it would be 
easy to switch between them, or edit them.

* if the crawl process realizes it ended up on a spam page, it can 
switch the page policy to "distrust", or the other way around, and stop 
crawling unwanted content. From now on the pages linked from that page 
will follow the new policy. In other words, if a crawling frontier 
reaches pages with known quality problems, it would be easy to change 
the policy on-the-fly to avoid them or pages linked from them, without 
resorting to modifications of URLFilters.

Some of the above you can do even now with URLFilters, but any change 
you do now has global consequences. You may also end up with awfully 
complicated rules if you try to cover all cases in one rule set.

How to implement it? Surprisingly, I think that it's very simple - just 
adding a CrawlDatum.policyId field would suffice, assuming we have a 
means to store and retrieve these policies by ID; and then instantiate 
it and call appropriate methods whenever we use today the URLFilters and 
do the score calculations.

Any comments?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Per-page crawling policy

Posted by Neal Whitley <ne...@e-travelmedia.com>.
For others working in a vertical search scenario I am having some 
good luck with the following steps.

For starters it begins with a bit of a manual process to obtain a 
good seed starting point.  For my current business I already had a 
basic seed list of about 7,500 unique links  to home pages of 
companies in my industry.  But then I wanted to increase the scope to 
include "any sites" in my business category, not just a list of companies.

So now what:

1.)  I found a nice desktop spider application to automate the 
process.  Visual Web Spider (http://www.newprosoft.com) has turned 
out to be a very good tool to set up some very focused crawls.  This 
spider app has some "fantastic" filters (that I wish Nutch had) that 
allow me to configure a crawl:  Depth of crawl, limit number of pages 
by domain, url stop words, url include words and more...

a.)  Custom List Crawling:  First I crawled the sites in my current 
list of urls to a depth of x.  This increased the width of my current 
seed list and I told the spider to stay in the current domain and not 
spider external sites.

b.)  Search Engine Crawling:  Visual Web Spider also a easy to use 
function that allows me to crawl Google, Yahoo, All The Web and Alta 
Vista.  So now I created some URL fetch queries to query industry 
specific pages from these search engines.  Again, I could configure 
depth, max pages etc from these starting points.

Example: 
http://www.google.com/search?q=ExampleSearchTerm&hl=en&num=40&start=0 
(The crawler will grab the first 40 results for my SearchTerm).  Then 
I could tell the spider how deep to crawl each page and how many 
pages to grab etc.

c.)  Site Crawling:  There are a few dozen other "Hub Sites" in my 
industry that offer excellent content where I wanted to index the 
majority of there content.  So I set up a new task and told the 
spider to grab all the pages in these domains only - but ran it 
against a filter to filter out forums and some other content areas 
that I did not want to spider.


2.)  In my nutch-site.xml conf file I then set 
"db.max.outlinks.per.page" to 10 (default was 100).  This is keeping 
nutch fetch lists smaller but a but more focused.

In a short period of time I had ten's of thousands of focused seed 
urls to inject into Nutch.

These steps have allowed me to start a vertical search engine and not 
have Nutch go too far in it's own fetches/crawls thus limiting the 
number of urls that get in.  I still get a fair number of off-topic 
sites in the database but for the most part it's not a problem 
because Nutch index's the database so well.  I love Nutch!


Neal


At 01:38 PM 1/5/2006, you wrote:
>Andrzej,
>
>This sounds like another great way to create more of a vertical 
>search application as well.  By defining trusted seed sources we can 
>limit the scope of the crawl to a more suitable set of links.
>
>Also, being able to apply additional rules by domain/host or by 
>trusted source would be great as well.  I.E.  If "trusted" allow 
>crawling of dynamic content and allow up to N pages of "?" urls.  Or 
>even having a trusted URL list where specific hosts would be crawled 
>for dynamic content.  This may be similar to the "Hub" concept of 
>Google where certain sites carry a heavier weight - though perhaps 
>being able to manually apply this would be suitable in a Nutch 
>vertical implentation.
>
>A "quality" score could also be calculated using a set of "core 
>keywords" that apply to a vertical.   So a list of several hundred 
>core words could try to match words on the page that is being 
>crawled.   When the crawler finds sites with these words it gives it 
>a bump in it's quality score - hence allowing for example a deeper 
>crawl of that site and extended crawling of outlinks.
>
>I imagine extensive rule lists/filters like this might cause a 
>strain on a Full Web Crawl.  But for those of us who are only going 
>to be crawling a certain segment of the net this would not slow 
>things down to bad (say 500,000 urls or so).
>
>Neal Whitley
>
>
>At 08:58 AM 1/5/2006, you wrote:
>
>>Hi,
>>
>>I've been toying with the following idea, which is an extension of 
>>the existing URLFilter mechanism and the concept of a "crawl frontier".
>>
>>Let's suppose we have several initial seed urls, each with a 
>>different subjective quality. We would like to crawl these, and 
>>expand the "crawling frontier" using outlinks. However, we don't 
>>want to do it uniformly for every initial url, but rather propagate 
>>certain "crawling policy" through the expanding trees of linked 
>>pages. This "crawling policy" could consist of url filters, scoring 
>>methods, etc - basically anything configurable in Nutch could be 
>>included in this "policy". Perhaps it could even be the new version 
>>of non-static NutchConf ;-)
>>
>>Then, if a given initial url is a known high-quality source, we 
>>would like to apply a "favor" policy, where we e.g. add pages 
>>linked from that url, and in doing so we give them a higher score. 
>>Recursively, we could apply the same policy for the next generation 
>>pages, or perhaps only for pages belonging to the same domain. So, 
>>in a sense the original notion of high-quality would cascade down 
>>to other linked pages. The important aspect of this to note is that 
>>all newly discovered pages would be subject to the same policy - 
>>unless we have compelling reasons to switch the policy (from 
>>"favor" to "default" or to "distrust"), which at that point would 
>>essentially change the shape of the expanding frontier.
>>
>>If a given initial url is a known spammer, we would like to apply a 
>>"distrust" policy for adding pages linked from that url (e.g. 
>>adding or not adding, if adding then lowering their score, or 
>>applying different score calculation). And recursively we could 
>>apply a similar policy of "distrust" to any pages discovered this 
>>way. We could also change the policy on the way, if there are 
>>compelling reasons to do so. This means that we could follow some 
>>high-quality links from low-quality pages, without drilling down 
>>the sites which are known to be of low quality.
>>
>>Special care needs to be taken if the same page is discovered from 
>>pages with different policies, I haven't thought about this aspect yet... ;-)
>>
>>What would be the benefits of such approach?
>>
>>* the initial page + policy would both control the expanding 
>>crawling frontier, and it could be differently defined for 
>>different starting pages. I.e. in a single web database we could 
>>keep different "collections" or "areas of interest" with 
>>differently specified policies. But still we could reap the 
>>benefits of a single web db, namely the link information.
>>
>>* URLFilters could be grouped into several policies, and it would 
>>be easy to switch between them, or edit them.
>>
>>* if the crawl process realizes it ended up on a spam page, it can 
>>switch the page policy to "distrust", or the other way around, and 
>>stop crawling unwanted content. From now on the pages linked from 
>>that page will follow the new policy. In other words, if a crawling 
>>frontier reaches pages with known quality problems, it would be 
>>easy to change the policy on-the-fly to avoid them or pages linked 
>>from them, without resorting to modifications of URLFilters.
>>
>>Some of the above you can do even now with URLFilters, but any 
>>change you do now has global consequences. You may also end up with 
>>awfully complicated rules if you try to cover all cases in one rule set.
>>
>>How to implement it? Surprisingly, I think that it's very simple - 
>>just adding a CrawlDatum.policyId field would suffice, assuming we 
>>have a means to store and retrieve these policies by ID; and then 
>>instantiate it and call appropriate methods whenever we use today 
>>the URLFilters and do the score calculations.
>>
>>Any comments?
>>
>>--
>>Best regards,
>>Andrzej Bialecki     <><
>>___. ___ ___ ___ _ _   __________________________________
>>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>___|||__||  \|  ||  |  Embedded Unix, System Integration
>>http://www.sigram.com  Contact: info at sigram dot com
>>


Re: Per-page crawling policy

Posted by Neal Whitley <ne...@e-travelmedia.com>.
Andrzej,

This sounds like another great way to create more of a vertical 
search application as well.  By defining trusted seed sources we can 
limit the scope of the crawl to a more suitable set of links.

Also, being able to apply additional rules by domain/host or by 
trusted source would be great as well.  I.E.  If "trusted" allow 
crawling of dynamic content and allow up to N pages of "?" urls.  Or 
even having a trusted URL list where specific hosts would be crawled 
for dynamic content.  This may be similar to the "Hub" concept of 
Google where certain sites carry a heavier weight - though perhaps 
being able to manually apply this would be suitable in a Nutch 
vertical implentation.

A "quality" score could also be calculated using a set of "core 
keywords" that apply to a vertical.   So a list of several hundred 
core words could try to match words on the page that is being 
crawled.   When the crawler finds sites with these words it gives it 
a bump in it's quality score - hence allowing for example a deeper 
crawl of that site and extended crawling of outlinks.

I imagine extensive rule lists/filters like this might cause a strain 
on a Full Web Crawl.  But for those of us who are only going to be 
crawling a certain segment of the net this would not slow things down 
to bad (say 500,000 urls or so).

Neal Whitley


At 08:58 AM 1/5/2006, you wrote:

>Hi,
>
>I've been toying with the following idea, which is an extension of 
>the existing URLFilter mechanism and the concept of a "crawl frontier".
>
>Let's suppose we have several initial seed urls, each with a 
>different subjective quality. We would like to crawl these, and 
>expand the "crawling frontier" using outlinks. However, we don't 
>want to do it uniformly for every initial url, but rather propagate 
>certain "crawling policy" through the expanding trees of linked 
>pages. This "crawling policy" could consist of url filters, scoring 
>methods, etc - basically anything configurable in Nutch could be 
>included in this "policy". Perhaps it could even be the new version 
>of non-static NutchConf ;-)
>
>Then, if a given initial url is a known high-quality source, we 
>would like to apply a "favor" policy, where we e.g. add pages linked 
>from that url, and in doing so we give them a higher score. 
>Recursively, we could apply the same policy for the next generation 
>pages, or perhaps only for pages belonging to the same domain. So, 
>in a sense the original notion of high-quality would cascade down to 
>other linked pages. The important aspect of this to note is that all 
>newly discovered pages would be subject to the same policy - unless 
>we have compelling reasons to switch the policy (from "favor" to 
>"default" or to "distrust"), which at that point would essentially 
>change the shape of the expanding frontier.
>
>If a given initial url is a known spammer, we would like to apply a 
>"distrust" policy for adding pages linked from that url (e.g. adding 
>or not adding, if adding then lowering their score, or applying 
>different score calculation). And recursively we could apply a 
>similar policy of "distrust" to any pages discovered this way. We 
>could also change the policy on the way, if there are compelling 
>reasons to do so. This means that we could follow some high-quality 
>links from low-quality pages, without drilling down the sites which 
>are known to be of low quality.
>
>Special care needs to be taken if the same page is discovered from 
>pages with different policies, I haven't thought about this aspect yet... ;-)
>
>What would be the benefits of such approach?
>
>* the initial page + policy would both control the expanding 
>crawling frontier, and it could be differently defined for different 
>starting pages. I.e. in a single web database we could keep 
>different "collections" or "areas of interest" with differently 
>specified policies. But still we could reap the benefits of a single 
>web db, namely the link information.
>
>* URLFilters could be grouped into several policies, and it would be 
>easy to switch between them, or edit them.
>
>* if the crawl process realizes it ended up on a spam page, it can 
>switch the page policy to "distrust", or the other way around, and 
>stop crawling unwanted content. From now on the pages linked from 
>that page will follow the new policy. In other words, if a crawling 
>frontier reaches pages with known quality problems, it would be easy 
>to change the policy on-the-fly to avoid them or pages linked from 
>them, without resorting to modifications of URLFilters.
>
>Some of the above you can do even now with URLFilters, but any 
>change you do now has global consequences. You may also end up with 
>awfully complicated rules if you try to cover all cases in one rule set.
>
>How to implement it? Surprisingly, I think that it's very simple - 
>just adding a CrawlDatum.policyId field would suffice, assuming we 
>have a means to store and retrieve these policies by ID; and then 
>instantiate it and call appropriate methods whenever we use today 
>the URLFilters and do the score calculations.
>
>Any comments?
>
>--
>Best regards,
>Andrzej Bialecki     <><
>___. ___ ___ ___ _ _   __________________________________
>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>___|||__||  \|  ||  |  Embedded Unix, System Integration
>http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Per-page crawling policy

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jack Tang wrote:

>Hi Andrzej
>
>The idea brings vertical search into nutch and definitely it is great:)
>I think nutch should add information retrieving layer into the who
>architecture, and export some abstract interface, say
>UrlBasedInformationRetrieve(you can implement your url grouping idea
>here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
>user can implement these in their vertical search by their own.
>  
>

We sort of reached an agreement to add Properties to CrawlDatum. Users 
will be able to put arbitrary metadata in there, so that each page 
record could be processed differently if needs be.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Per-page crawling policy

Posted by Jack Tang <hi...@gmail.com>.
BTW: if nutch is going to support vertical searching, I think page
urls should be grouped in three type: fetchable url(just fetching it),
extractable url(fetch it and extract information from this page) and
pagination url.

/Jack

On 1/5/06, Jack Tang <hi...@gmail.com> wrote:
> Hi Andrzej
>
> The idea brings vertical search into nutch and definitely it is great:)
> I think nutch should add information retrieving layer into the who
> architecture, and export some abstract interface, say
> UrlBasedInformationRetrieve(you can implement your url grouping idea
> here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
> user can implement these in their vertical search by their own.
>
> /Jack
>
> On 1/5/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> > Hi,
> >
> > I've been toying with the following idea, which is an extension of the
> > existing URLFilter mechanism and the concept of a "crawl frontier".
> >
> > Let's suppose we have several initial seed urls, each with a different
> > subjective quality. We would like to crawl these, and expand the
> > "crawling frontier" using outlinks. However, we don't want to do it
> > uniformly for every initial url, but rather propagate certain "crawling
> > policy" through the expanding trees of linked pages. This "crawling
> > policy" could consist of url filters, scoring methods, etc - basically
> > anything configurable in Nutch could be included in this "policy".
> > Perhaps it could even be the new version of non-static NutchConf ;-)
> >
> > Then, if a given initial url is a known high-quality source, we would
> > like to apply a "favor" policy, where we e.g. add pages linked from that
> > url, and in doing so we give them a higher score. Recursively, we could
> > apply the same policy for the next generation pages, or perhaps only for
> > pages belonging to the same domain. So, in a sense the original notion
> > of high-quality would cascade down to other linked pages. The important
> > aspect of this to note is that all newly discovered pages would be
> > subject to the same policy - unless we have compelling reasons to switch
> > the policy (from "favor" to "default" or to "distrust"), which at that
> > point would essentially change the shape of the expanding frontier.
> >
> > If a given initial url is a known spammer, we would like to apply a
> > "distrust" policy for adding pages linked from that url (e.g. adding or
> > not adding, if adding then lowering their score, or applying different
> > score calculation). And recursively we could apply a similar policy of
> > "distrust" to any pages discovered this way. We could also change the
> > policy on the way, if there are compelling reasons to do so. This means
> > that we could follow some high-quality links from low-quality pages,
> > without drilling down the sites which are known to be of low quality.
> >
> > Special care needs to be taken if the same page is discovered from pages
> > with different policies, I haven't thought about this aspect yet... ;-)
> >
> > What would be the benefits of such approach?
> >
> > * the initial page + policy would both control the expanding crawling
> > frontier, and it could be differently defined for different starting
> > pages. I.e. in a single web database we could keep different
> > "collections" or "areas of interest" with differently specified
> > policies. But still we could reap the benefits of a single web db,
> > namely the link information.
> >
> > * URLFilters could be grouped into several policies, and it would be
> > easy to switch between them, or edit them.
> >
> > * if the crawl process realizes it ended up on a spam page, it can
> > switch the page policy to "distrust", or the other way around, and stop
> > crawling unwanted content. From now on the pages linked from that page
> > will follow the new policy. In other words, if a crawling frontier
> > reaches pages with known quality problems, it would be easy to change
> > the policy on-the-fly to avoid them or pages linked from them, without
> > resorting to modifications of URLFilters.
> >
> > Some of the above you can do even now with URLFilters, but any change
> > you do now has global consequences. You may also end up with awfully
> > complicated rules if you try to cover all cases in one rule set.
> >
> > How to implement it? Surprisingly, I think that it's very simple - just
> > adding a CrawlDatum.policyId field would suffice, assuming we have a
> > means to store and retrieve these policies by ID; and then instantiate
> > it and call appropriate methods whenever we use today the URLFilters and
> > do the score calculations.
> >
> > Any comments?
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Per-page crawling policy

Posted by Jack Tang <hi...@gmail.com>.
Hi Andrzej

The idea brings vertical search into nutch and definitely it is great:)
I think nutch should add information retrieving layer into the who
architecture, and export some abstract interface, say
UrlBasedInformationRetrieve(you can implement your url grouping idea
here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
user can implement these in their vertical search by their own.

/Jack

On 1/5/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> I've been toying with the following idea, which is an extension of the
> existing URLFilter mechanism and the concept of a "crawl frontier".
>
> Let's suppose we have several initial seed urls, each with a different
> subjective quality. We would like to crawl these, and expand the
> "crawling frontier" using outlinks. However, we don't want to do it
> uniformly for every initial url, but rather propagate certain "crawling
> policy" through the expanding trees of linked pages. This "crawling
> policy" could consist of url filters, scoring methods, etc - basically
> anything configurable in Nutch could be included in this "policy".
> Perhaps it could even be the new version of non-static NutchConf ;-)
>
> Then, if a given initial url is a known high-quality source, we would
> like to apply a "favor" policy, where we e.g. add pages linked from that
> url, and in doing so we give them a higher score. Recursively, we could
> apply the same policy for the next generation pages, or perhaps only for
> pages belonging to the same domain. So, in a sense the original notion
> of high-quality would cascade down to other linked pages. The important
> aspect of this to note is that all newly discovered pages would be
> subject to the same policy - unless we have compelling reasons to switch
> the policy (from "favor" to "default" or to "distrust"), which at that
> point would essentially change the shape of the expanding frontier.
>
> If a given initial url is a known spammer, we would like to apply a
> "distrust" policy for adding pages linked from that url (e.g. adding or
> not adding, if adding then lowering their score, or applying different
> score calculation). And recursively we could apply a similar policy of
> "distrust" to any pages discovered this way. We could also change the
> policy on the way, if there are compelling reasons to do so. This means
> that we could follow some high-quality links from low-quality pages,
> without drilling down the sites which are known to be of low quality.
>
> Special care needs to be taken if the same page is discovered from pages
> with different policies, I haven't thought about this aspect yet... ;-)
>
> What would be the benefits of such approach?
>
> * the initial page + policy would both control the expanding crawling
> frontier, and it could be differently defined for different starting
> pages. I.e. in a single web database we could keep different
> "collections" or "areas of interest" with differently specified
> policies. But still we could reap the benefits of a single web db,
> namely the link information.
>
> * URLFilters could be grouped into several policies, and it would be
> easy to switch between them, or edit them.
>
> * if the crawl process realizes it ended up on a spam page, it can
> switch the page policy to "distrust", or the other way around, and stop
> crawling unwanted content. From now on the pages linked from that page
> will follow the new policy. In other words, if a crawling frontier
> reaches pages with known quality problems, it would be easy to change
> the policy on-the-fly to avoid them or pages linked from them, without
> resorting to modifications of URLFilters.
>
> Some of the above you can do even now with URLFilters, but any change
> you do now has global consequences. You may also end up with awfully
> complicated rules if you try to cover all cases in one rule set.
>
> How to implement it? Surprisingly, I think that it's very simple - just
> adding a CrawlDatum.policyId field would suffice, assuming we have a
> means to store and retrieve these policies by ID; and then instantiate
> it and call appropriate methods whenever we use today the URLFilters and
> do the score calculations.
>
> Any comments?
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Per-page crawling policy

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Stefan Groschupf wrote:
>
>> Before we start adding meta data and more meta data, why not once in  
>> general adding meta data to the crawlDatum, than we can have any  
>> kinds of plugins that add and process metadata that belongs to a url.
>
>
> +1
>
> This feature strikes me as something that might prove very useful, but 
> might also prove unworkable, or at least not useful to everyone.  Thus 
> it would be best if it doesn't require changes to a core class like 
> CrawlDatum.  If it does eventually prove generally useful, as 
> something that everyone will use and that should be enabled by 
> default, then we could promote its data from metadata to a field for 
> efficiency.
>
> In this vein, should modifiedTime be moved to metadata, once metadata 
> is added?


I'm of a split mind on this, because I hope that the detection of 
unmodified content will be the default mode of operation... OTOH, 
perhaps it's a premature micro-optimization. We can move it to metadata 
for now, but I see it as a strong candidate to be moved back...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Per-page crawling policy

Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> Before we start adding meta data and more meta data, why not once in  
> general adding meta data to the crawlDatum, than we can have any  kinds 
> of plugins that add and process metadata that belongs to a url.

+1

This feature strikes me as something that might prove very useful, but 
might also prove unworkable, or at least not useful to everyone.  Thus 
it would be best if it doesn't require changes to a core class like 
CrawlDatum.  If it does eventually prove generally useful, as something 
that everyone will use and that should be enabled by default, then we 
could promote its data from metadata to a field for efficiency.

In this vein, should modifiedTime be moved to metadata, once metadata is 
added?

Cheers,

Doug

Re: Per-page crawling policy

Posted by Stefan Groschupf <sg...@media-style.com>.
> Hehe... That was what I advocated from the beginning. There is a  
> cost associated with this, though, i.e. any change in CrawlDatum  
> size has a significant impact on most operations' performance.
Sure, if you every had a look to the 0.7 meta data patch, there i had  
implement things in a way that only in case there was meta data these  
meta data  and the key was written to the file.
So no meta data means the same file size as before. in general we  
need to accept that meta data pump up the file == the processing and  
IO load, but people doing a complete web index, can work without meta  
data and people that need these function need to accept that nothing  
is for free.
>
>> All solutions I had seen until today load this kind of meta data   
>> until indexing from a third party data source (database) and add  
>> it  into the index. This works but is very slow.
>
>
> Well, maybe it makes sense to store the CrawlDatum and its  
> "metadata" separately in two MapFiles, so that you can perform some  
> operations using only the lightweight CrawlDatum, and for other  
> operations you will need to load the properties too...

Yes, I like this idea and I remember that Doug had suggest such a  
solution also.
However first I will focus on the NutchConf issue.

Stefan


Re: Per-page crawling policy

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:

> I like the idea and it is another step in the direction of vertical  
> search, where I personal see the biggest chance for nutch.
>
>> How to implement it? Surprisingly, I think that it's very simple -  
>> just adding a CrawlDatum.policyId field would suffice, assuming we  
>> have a means to store and retrieve these policies by ID; and then  
>> instantiate it and call appropriate methods whenever we use today  
>> the URLFilters and do the score calculations.
>
>
> Before we start adding meta data and more meta data, why not once in  
> general adding meta data to the crawlDatum, than we can have any  
> kinds of plugins that add and process metadata that belongs to a url.
> Beside policyId, I see much more canidates for crawl metadata:
> Last Crawl date. Url category. collection key (similar to google  
> appliance collections) etc.
>

Hehe... That was what I advocated from the beginning. There is a cost 
associated with this, though, i.e. any change in CrawlDatum size has a 
significant impact on most operations' performance.

> All solutions I had seen until today load this kind of meta data  
> until indexing from a third party data source (database) and add it  
> into the index. This works but is very slow.


Well, maybe it makes sense to store the CrawlDatum and its "metadata" 
separately in two MapFiles, so that you can perform some operations 
using only the lightweight CrawlDatum, and for other operations you will 
need to load the properties too...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Per-page crawling policy

Posted by Stefan Groschupf <sg...@media-style.com>.
I like the idea and it is another step in the direction of vertical  
search, where I personal see the biggest chance for nutch.

> How to implement it? Surprisingly, I think that it's very simple -  
> just adding a CrawlDatum.policyId field would suffice, assuming we  
> have a means to store and retrieve these policies by ID; and then  
> instantiate it and call appropriate methods whenever we use today  
> the URLFilters and do the score calculations.

Before we start adding meta data and more meta data, why not once in  
general adding meta data to the crawlDatum, than we can have any  
kinds of plugins that add and process metadata that belongs to a url.
Beside policyId, I see much more canidates for crawl metadata:
Last Crawl date. Url category. collection key (similar to google  
appliance collections) etc.

All solutions I had seen until today load this kind of meta data  
until indexing from a third party data source (database) and add it  
into the index. This works but is very slow.

Stefan


Re: Per-page crawling policy

Posted by Byron Miller <by...@yahoo.com>.
Excellent Ideas and that is what i'm hoping to use
some of the social bookmarking type ideas to build the
starter sites from and linkmaps from.

I hope to work with Simpy or other bookmarking
projects to build somewhat of a popularity map(human
edited authorit) to merge and calculate against a
computer generated map (via standard link processing,
anchor results and such)

My only continuing question is how to manage the
merge, index process of staging/processing your
crawl/fetch jobs such as this.  It seems all of our
theories would be a single crawl and publish of that
index rather than a living/breathing corpus.

Unless we map/bucket the segments to have some purpose
it's difficult to manage how we process them, sort
them or analyze them to defign or extra more meaning
from them.

Brain is exploding :)

-byron

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Hi,
> 
> I've been toying with the following idea, which is
> an extension of the 
> existing URLFilter mechanism and the concept of a
> "crawl frontier".
> 
> Let's suppose we have several initial seed urls,
> each with a different 
> subjective quality. We would like to crawl these,
> and expand the 
> "crawling frontier" using outlinks. However, we
> don't want to do it 
> uniformly for every initial url, but rather
> propagate certain "crawling 
> policy" through the expanding trees of linked pages.
> This "crawling 
> policy" could consist of url filters, scoring
> methods, etc - basically 
> anything configurable in Nutch could be included in
> this "policy". 
> Perhaps it could even be the new version of
> non-static NutchConf ;-)
> 
> Then, if a given initial url is a known high-quality
> source, we would 
> like to apply a "favor" policy, where we e.g. add
> pages linked from that 
> url, and in doing so we give them a higher score.
> Recursively, we could 
> apply the same policy for the next generation pages,
> or perhaps only for 
> pages belonging to the same domain. So, in a sense
> the original notion 
> of high-quality would cascade down to other linked
> pages. The important 
> aspect of this to note is that all newly discovered
> pages would be 
> subject to the same policy - unless we have
> compelling reasons to switch 
> the policy (from "favor" to "default" or to
> "distrust"), which at that 
> point would essentially change the shape of the
> expanding frontier.
> 
> If a given initial url is a known spammer, we would
> like to apply a 
> "distrust" policy for adding pages linked from that
> url (e.g. adding or 
> not adding, if adding then lowering their score, or
> applying different 
> score calculation). And recursively we could apply a
> similar policy of 
> "distrust" to any pages discovered this way. We
> could also change the 
> policy on the way, if there are compelling reasons
> to do so. This means 
> that we could follow some high-quality links from
> low-quality pages, 
> without drilling down the sites which are known to
> be of low quality.
> 
> Special care needs to be taken if the same page is
> discovered from pages 
> with different policies, I haven't thought about
> this aspect yet... ;-)
> 
> What would be the benefits of such approach?
> 
> * the initial page + policy would both control the
> expanding crawling 
> frontier, and it could be differently defined for
> different starting 
> pages. I.e. in a single web database we could keep
> different 
> "collections" or "areas of interest" with
> differently specified 
> policies. But still we could reap the benefits of a
> single web db, 
> namely the link information.
> 
> * URLFilters could be grouped into several policies,
> and it would be 
> easy to switch between them, or edit them.
> 
> * if the crawl process realizes it ended up on a
> spam page, it can 
> switch the page policy to "distrust", or the other
> way around, and stop 
> crawling unwanted content. From now on the pages
> linked from that page 
> will follow the new policy. In other words, if a
> crawling frontier 
> reaches pages with known quality problems, it would
> be easy to change 
> the policy on-the-fly to avoid them or pages linked
> from them, without 
> resorting to modifications of URLFilters.
> 
> Some of the above you can do even now with
> URLFilters, but any change 
> you do now has global consequences. You may also end
> up with awfully 
> complicated rules if you try to cover all cases in
> one rule set.
> 
> How to implement it? Surprisingly, I think that it's
> very simple - just 
> adding a CrawlDatum.policyId field would suffice,
> assuming we have a 
> means to store and retrieve these policies by ID;
> and then instantiate 
> it and call appropriate methods whenever we use
> today the URLFilters and 
> do the score calculations.
> 
> Any comments?
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
>