You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2007/09/17 21:38:36 UTC

Host-level stats, ranking and recrawl

Hi,

I was recently reading again some scoring-related papers, and found some 
interesting data in a paper by Baeza-Yates et al, "Crawling a Country: 
Better Strategies than Breadth-First for Web Page Ordering" 
(http://citeseer.ist.psu.edu/730674.html).

This paper compares various strategies for prioritizing a crawl of 
unfetched pages. Among others, it compared the OPIC scoring and a simple 
strategy which is called "large sites first". This strategy prioritizes 
pages from large sites and deprioritizes pages from small / medium 
sites. In order to measure the effectiveness the authors used the value 
of accumulated PageRank vs. the percentage of crawled pages - the 
strategy that ensures quick ramp-up of aggregate pagerank is the best.

A bit surprisingly, they found that large-sites-first wins over OPIC:

"Breadth-first is close to the best strategies for the first 20-30% of 
pages, but after that it becomes less efficient.
  The strategies batch-pagerank, larger-sites-first and OPIC have better 
performance than the other strategies, with an advantage towards 
larger-sites-first when the desired coverage is high. These strategies 
can retrieve about half of the Pagerank value of their domains 
downloading only around 20-30% of the pages."

Nutch currently uses OPIC-like scoring for this, so most likely it 
suffers from the same symptoms (the authors also mention a relatively 
poor OPIC performance at the beginning of a crawl).

Nutch doesn't collect at the moment any host-level statistics, so we 
couldn't use the other strategy even if we wanted.

What if we added a host-level DB to Nutch? Arguments against this: it's 
an additional data structure to maintain, and this adds complexity to 
the system; it's an additional step in the workflow (-> it takes longer 
time to complete one cycle of crawling). Arguments for are the 
following: we could implement the above scoring method ;), plus the 
host-level statistics are good for detecting spam sites, limiting the 
crawl by site size, etc.

We could start by implementing a tool to collect such statistics from 
CrawlDb - this should be a trivial map-reduce job, so if anyone wants to 
take a crack at this it would be a good exercise ... ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Host-level stats, ranking and recrawl

Posted by Chris Schneider <Sc...@TransPac.com>.
Andrzej, et. al.,

At 9:38 PM +0200 9/17/07, Andrzej Bialecki wrote:
>I was recently reading again some scoring-related papers, and found some interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering" (http://citeseer.ist.psu.edu/730674.html).
>
>This paper compares various strategies for prioritizing a crawl of unfetched pages. Among others, it compared the OPIC scoring and a simple strategy which is called "large sites first". This strategy prioritizes pages from large sites and deprioritizes pages from small / medium sites. In order to measure the effectiveness the authors used the value of accumulated PageRank vs. the percentage of crawled pages - the strategy that ensures quick ramp-up of aggregate pagerank is the best.
>
>A bit surprisingly, they found that large-sites-first wins over OPIC:
>
>"Breadth-first is close to the best strategies for the first 20-30% of pages, but after that it becomes less efficient.
> The strategies batch-pagerank, larger-sites-first and OPIC have better performance than the other strategies, with an advantage towards larger-sites-first when the desired coverage is high. These strategies can retrieve about half of the Pagerank value of their domains downloading only around 20-30% of the pages."
>
>Nutch currently uses OPIC-like scoring for this, so most likely it suffers from the same symptoms (the authors also mention a relatively poor OPIC performance at the beginning of a crawl).
>
>Nutch doesn't collect at the moment any host-level statistics, so we couldn't use the other strategy even if we wanted.
>
>What if we added a host-level DB to Nutch? Arguments against this: it's an additional data structure to maintain, and this adds complexity to the system; it's an additional step in the workflow (-> it takes longer time to complete one cycle of crawling). Arguments for are the following: we could implement the above scoring method ;), plus the host-level statistics are good for detecting spam sites, limiting the crawl by site size, etc.
>
>We could start by implementing a tool to collect such statistics from CrawlDb - this should be a trivial map-reduce job, so if anyone wants to take a crack at this it would be a good exercise ... ;)

Stefan Groschupf developed a tool (with a little help from me) called DomainStats that collects such domain-level data from the crawl results (both crawldb and segment data). We use it to count both pages crawled in each domain and pages crawled that meet a "technical" threshold, since the tool can be used to select for various field and metadata conditions when counting pages. We use the results to create a "white list" of the most technical domains in which to focus our next crawl. Domains and sub-domains are counted separately, so you get separate counts for www.apache.org, apache.org, and org.

Is there a Jira ticket open for this? If not, I could create one and submit a patch. We're currently using a Nutch code base that originated around 417928, but I think this is pretty self-contained.

Let me know,

- Schmed
-- 
----------------------------
Chris Schneider
Krugle, Inc.
http://www.krugle.com
CSchneider@Krugle.com
----------------------------

Re: Host-level stats, ranking and recrawl

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On 9/17/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Hi,
>
> I was recently reading again some scoring-related papers, and found some
> interesting data in a paper by Baeza-Yates et al, "Crawling a Country:
> Better Strategies than Breadth-First for Web Page Ordering"
> (http://citeseer.ist.psu.edu/730674.html).
>
> This paper compares various strategies for prioritizing a crawl of
> unfetched pages. Among others, it compared the OPIC scoring and a simple
> strategy which is called "large sites first". This strategy prioritizes
> pages from large sites and deprioritizes pages from small / medium
> sites. In order to measure the effectiveness the authors used the value
> of accumulated PageRank vs. the percentage of crawled pages - the
> strategy that ensures quick ramp-up of aggregate pagerank is the best.
>
> A bit surprisingly, they found that large-sites-first wins over OPIC:
>
> "Breadth-first is close to the best strategies for the first 20-30% of
> pages, but after that it becomes less efficient.
>   The strategies batch-pagerank, larger-sites-first and OPIC have better
> performance than the other strategies, with an advantage towards
> larger-sites-first when the desired coverage is high. These strategies
> can retrieve about half of the Pagerank value of their domains
> downloading only around 20-30% of the pages."
>
> Nutch currently uses OPIC-like scoring for this, so most likely it
> suffers from the same symptoms (the authors also mention a relatively
> poor OPIC performance at the beginning of a crawl).
>
> Nutch doesn't collect at the moment any host-level statistics, so we
> couldn't use the other strategy even if we wanted.
>
> What if we added a host-level DB to Nutch? Arguments against this: it's
> an additional data structure to maintain, and this adds complexity to
> the system; it's an additional step in the workflow (-> it takes longer
> time to complete one cycle of crawling). Arguments for are the
> following: we could implement the above scoring method ;), plus the
> host-level statistics are good for detecting spam sites, limiting the
> crawl by site size, etc.

Another +1. We definitely need domain-level statistics anyway, so
being able to implement large-sites-first is a nice bonus, I think :)

>
> We could start by implementing a tool to collect such statistics from
> CrawlDb - this should be a trivial map-reduce job, so if anyone wants to
> take a crack at this it would be a good exercise ... ;)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

RE: Host-level stats, ranking and recrawl

Posted by Vishal Shah <vi...@rediff.co.in>.
Hi Andrzej,

  This sounds like a good addition to the current system IMO. It would
especially be helpful for building a generic web search or for building a
domain-specific search where you would have an algorithm to prioritize which
sites to crawl for your domain. 

  I would go one step further and say that we should consider storing domain
level stats and even ip level stats if possible. For e.g., how many pages do
we have from each host/domain/ip (H/D/I), what is the avg. error rate while
crawling pages for a H/D/I, what is the number of dynamic pages from an
H/D/I, what is the avg. size of a page, the avg. response time from the
H/D/I etc.

  These stats would be very useful to improve the crawler efficiency as
well. For e.g., if we know that a host/domain's error rate is very high, the
scoring plugin can penalize urls from that host/domain so they are
deprioritized while crawling. 

Also, based on the avg. response time from a host/domain, we can mix
appropriate number of pages from various sites in a fetchlist so that the
fetch can be completed in a certain time. Currently, we have a global
property max.pages.per.host (something like that). Instead of that, let's
say we input the amount of time that we wanna spend in one fetch. Then by
computing the estimated response time from a site, we can mix more pages
from faster sites and fewer from slow sites.

Last, as Andrzej said - aggregated stats are useful for spam detection.
Let's say you identified a host as spam. There is a high probability that
other hosts from the same domain are spam (except for portal sites like
geocities.com of course).

Basically, what I am trying to say is that this is definitely something we
should seriously consider integrating inside Nutch - a big thumbs up from me
:)

Regards,

-vishal.


-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Tuesday, September 18, 2007 1:09 AM
To: nutch-dev@lucene.apache.org
Subject: Host-level stats, ranking and recrawl

Hi,

I was recently reading again some scoring-related papers, and found some 
interesting data in a paper by Baeza-Yates et al, "Crawling a Country: 
Better Strategies than Breadth-First for Web Page Ordering" 
(http://citeseer.ist.psu.edu/730674.html).

This paper compares various strategies for prioritizing a crawl of 
unfetched pages. Among others, it compared the OPIC scoring and a simple 
strategy which is called "large sites first". This strategy prioritizes 
pages from large sites and deprioritizes pages from small / medium 
sites. In order to measure the effectiveness the authors used the value 
of accumulated PageRank vs. the percentage of crawled pages - the 
strategy that ensures quick ramp-up of aggregate pagerank is the best.

A bit surprisingly, they found that large-sites-first wins over OPIC:

"Breadth-first is close to the best strategies for the first 20-30% of 
pages, but after that it becomes less efficient.
  The strategies batch-pagerank, larger-sites-first and OPIC have better 
performance than the other strategies, with an advantage towards 
larger-sites-first when the desired coverage is high. These strategies 
can retrieve about half of the Pagerank value of their domains 
downloading only around 20-30% of the pages."

Nutch currently uses OPIC-like scoring for this, so most likely it 
suffers from the same symptoms (the authors also mention a relatively 
poor OPIC performance at the beginning of a crawl).

Nutch doesn't collect at the moment any host-level statistics, so we 
couldn't use the other strategy even if we wanted.

What if we added a host-level DB to Nutch? Arguments against this: it's 
an additional data structure to maintain, and this adds complexity to 
the system; it's an additional step in the workflow (-> it takes longer 
time to complete one cycle of crawling). Arguments for are the 
following: we could implement the above scoring method ;), plus the 
host-level statistics are good for detecting spam sites, limiting the 
crawl by site size, etc.

We could start by implementing a tool to collect such statistics from 
CrawlDb - this should be a trivial map-reduce job, so if anyone wants to 
take a crack at this it would be a good exercise ... ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com