You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dalton, Jeffery" <jd...@globalspec.com> on 2005/11/29 20:09:46 UTC
(Re-Formatted) RE: Nutch WebDb storage alternatives: Revisited

 
Here is my post with corrected indentation for better clarity of reading
;-).
Apparently the HTML font colors didn't make the transition to this
mailing list.  

Sorry for any inconvenience.

- Jeff

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@nutch.org] 
> Sent: Monday, November 28, 2005 5:59 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Nutch WebDb storage alternatives: Revisited
> 
> Dalton, Jeffery wrote:
> > I would propose that even in crawling large web collections 
> that the 
> > updates may not always be proportional to the total size of 
> the database
> > if you want to keep your index fresh.   One of the goals of 
> a web search
> > engine is to be an accurate representation of what is found 
> on the web,
> > to maximize freshness.   Several studies have shown that 
> the web is very
> > dynamic with a subset of pages changing constantly -- weekly (15%), 
> > daily or hourly (~23%), or even more often ("The Evolution 
> of the Web 
> > and the Implications for an Incremental Crawler" -- 
> > http://citeseer.ist.psu.edu/419993.html and "What's new on the web? 
> > The evolution of the Web from a Search Engine Perspective -- 
> > http://www.www2004.org/proceedings/docs/1p1.pdf).  In order to keep 
> > dynamic pages fresh they must be crawled and indexed at a high 
> > frequency.  In short, you have to be able to update your database 
> > often, say on a daily basis, to keep it fresh with 
> important dynamic pages.
> 
> I don't follow your argument.  If X% of pages change every 
> day, that is a rate proportional to the size of the entire 
> collection, no?

My point above is merely that the web is very dynamic and that it is
important to be able to update the database very frequently.  In other
words, I am arguing for a system capable of performing real-time (or
near real-time updates) with an acceptable performance level.  In the
above I am not arguing that you should crawl all of those urls as they
change, because there are scarce resources that need to be balanced.  

If you want a more cogent argument to challenge the assumption about
proportional update size let me present the following.  Economics might
dictate that updates may not grow proportionately to the index size.
Consider for example a scenario where a resource, say bandwidth (this
could equally be crawler hardware) is fixed in the medium term (6-12
months), because of budget allocations, etc... The index database size
will grow because new crawling is being performed, but the update size
will remain the same because of resource constraints, in this case,
bandwidth. In this scenario the updates do not remain proportional to
the index size.   I made a point in the previous post that it is
important to keep "important" dynamic pages fresh.  In this scenario,
instead of increasing the update size you have to become smarter about
how you allocate bandwidth to crawling and recrawling. The problem
remains that bandwidth is fixed, but the index continues to grow.
Perhaps in the long term, say next year, you can buy some more crawler
machines or bandwidth, but in the mean time the database performance
will linearly degrade because the proportion of update to index size is
shrinking.

> 
> It comes down to accessing at seek rate or transfer rate.  
> Assume a drive can seek in 10ms and can transfer at 10MB/s, 
> and assume you've got a 1B url collection stored on 10 
> drives.  To update just 1% of the urls at seek rate requires 
> 10M seeks, or 1M seeks/drive, requiring around 3 hours.  To 
> update an arbitrary percentage at transfer rate (copying the 
> database, merging in changes), assuming 100 bytes per url, 
> requires 100GB of transfer, or 10G/drive, requiring around 17 
> minutes.  The breakeven point is around 0.01%: only for less 
> than 100k urls updated out of 1B would seek-based updates be 
> faster than transfer-based.

I would like to advocate a middle-of-the-road architecture where updates
don't require one seek per document and also don't require re-streaming
the entire database.  You should be able, with the right architecture,
to batch updates into updates of smaller batches, say thousands of
documents that more closely resembles a continuous transactional system
and still provides fast performance.  With smaller batches, you might be
able to achieve something on the order of O(1) seeks per batch update
instead of O(n) seeks with no batching.  For example, imagine that you
had a database organized by host, or host and some portion of the url,
as described in the Internet Archive Wiki.  If you sorted your updates
by host then you could read the portion of the database for that given
host into your in-memory cache with a single seek.  You could then
stream all of the updates for that host out to disk in a second single
transfer.  This operation would require 2 disk seeks per batch: one to
read in the block of the appropriate database leaves into memory and
another to transfer the updates to disk in a batch.  However, the amount
of data read and written would on the order of megabytes rather than
gigabytes.  This model would avoid the O(N) seeks and also not require
the entire database to be re-written.

My point is that there is a middle ground between a large batch update
approach and atomic operations that could provide adequate performance
for both large and small updates.  One possibly way to do this is the
Hashed hybrid approach described in the Webbase paper which was one of
the ideas I proposed.  The underlying storage of such a system could
easily be swapped out similar to mySQL with various storage systems:
Mapfile, Log (perhaps BDB), Lucene-like, etc... depending on the needs
of the deployment. 
 
> Doug
>