You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dalton, Jeffery" <jd...@globalspec.com> on 2005/11/15 23:15:59 UTC

Nutch WebDb storage alternatives: Revisited

I wanted to revive some discussion about the webdb storage
possibilities.  I ran across this thread from awhile back:
http://marc.theaimsgroup.com/?l=nutch-developers&m=109958192318128&w=2
where there was some discussion about design alternatives for the WebDb.

 
Let me lay out a scenario to illustrate the problem as I see it.  Let's
say that we have a 10 billion page database (quite reasonable by today's
standards).  Assuming an average page size of 10k, the PageDb should
require roughly 10 terabytes of data.  Using Doug's update numbers
below, executing an update on a single PageDb using the Mapfile approach
should require 900 hours of disk time.  Yes?  The problem I observe is
that as you scale up the size of the database, the size of the batch
updates need to scale proportionally in size also in order to achieve
the same update performance.  If you don't scale the size of your batch
updates then your performance will degrade because the entire database
still must be re-written, but you are updating less of it.    

 

To illustrate, say we want to update a database of 100 million pages,
roughly 100 GB of data in the PageDb with a new set of 20 million pages
(20 GB).  This should take approximately 9-10 hours (again using Doug's
numbers below) and so you have: 20 million updates divided by  9 hours
(9 *60 *60) = 617 updates / second.   Very good performance.  Now let's
try the same update on a larger database -- our aforementioned database
of 10 billion pages (10 terabytes).   Re-writing this database file
takes approximately 900 hours and so you have 20 million updates in 900
hours = 6.17 updates / second.  This hopefully shows that as your
database scales and your batch size remains the same that your performs
degrades linearly in relation to the size of the database.  In order to
achieve the same performance as the 100 million page database the update
must be 2 billion pages (20% of the index).  

 

One solution is to use ever increasing batch sizes by merging large
segments into even larger batches before merging into the database,
however, at this scale I believe that this process becomes unwieldy when
we are talking about hundreds of gigabytes of data.   Now of course, you
can argue that the answer is to buy more machines (disks) because the
hours are IO/Hours and that you can parallelize the process into an
operation that takes days instead of months.   However, I would argue
that this hasn't helped performance, you still have 6.17 updates/second.
Do people see the problem?  Am I wrong about this?  I believe that this
problem points us in the direction of creating an architecture that
while might not be as fast updating smaller databases, hopefully will
scale more gracefully.   In other words, I would like to be able to
scale the size of the index without having to increase the size of my
batch updates without taking a major performance hit.   

 

I am arguing that it is not feasible to wait until we have crawled a
sizable fraction of the WebDb before updating the link / page databases.
I believe that in order for Nutch to scale and mature, it will have to
adopt a more incremental approach that better balances smaller updates
with space and time performance constraints.   

 

If I am right, I see this as an amazing opportunity to improve
scalability and performance.  I have some ideas about how to implement a
more incremental approach if people that I would love to share, but I
thought it best to try and identify the problem before spouting off
possible solutions when the question may not be correct.  

 

Lastly, Doug, I was also hoping that you, could clarify one point in
your previous post that I don't quite understand: 

	With 100 bytes per link and 10 links per page, a 100M page
database 
	requires 100GB.  At 10MB/second transfer rate this takes on the
order of 
	three hours to read and six hours to re-write, even with tens of

	millions of updates. 

I believe the three hours is an approximation of:  100GB / 10 MB / Sec
converted to minutes and then hours.   I guess I am not understanding
how you got the six hours to re-rewrite the data.  You doubled the read
time?  Why?  

 

Thanks, 

 

- Jeff Dalton

 

Doug's Original post on the WebDb thread for convenience:

> 2) use something like berkely db which will increase space usage
> by I'd guess about 100-150%, but will allow for fast
inserts/updates/deletes. 
> Sounds better to me than the current approach, but for large
installations
> we may run into hardware limits without compressing the data.  I've
heard
> of berkeyly db being used to store 100Gig  databases.  I guess a large
nutch
> installation may push or break that size.

We started out using Berkeley DB and it became very slow when the 
database was large.  The problem is that B-Trees get fragmented as they 
grow.  Each update eventually requires a random access, a disk seek, 
which take around 10 milliseconds.

Consider this: If each B-tree page holds, say, 100 pages or links, and 
we're updating at least 1% of all entries in the B-Tree, then, in the 
course of a db update we'll visit every page in the B-tree, but as a 
random access.  It is much faster to pre-sort the updates and then merge

them with the database.  All disk operations are sequential and hence 
operate at the transfer rate, typically around 10MB/second, nearly 100 
times faster than random seeks.

The last time I benchmarked the db sorting and merging code on large 
collections it was disk i/o bound.  Is this no longer the case?  When 
performing an update on a large (>10M page) db, what is the CPU and disk

utilizations?

In short, maintaining a link graph is a very data intensive operation. 
An RDBMS will always use a B-tree, and will always degenerate to random 
accesses per link update when the database is large.  Fetching at 100 
pages per second with an average of 10 links per page requires 1000 link

updates per second in order for the database to keep up with fetching. 
A typical hard drive can only perform 100 seeks per second.  So any 
approach which requires a random access per link will fail to keep up, 
unless 10 hard drives are allocated per fetcher!

With 100 bytes per link and 10 links per page, a 100M page database 
requires 100GB.  At 10MB/second transfer rate this takes on the order of

three hours to read and six hours to re-write, even with tens of 
millions of updates.  With two 10Ms seeks required per update, only 
around 1M links could be updated in six hours.

So, yes, the implementation Nutch uses does use a lot of space, but it 
is very scalable.

Doug


Re: Nutch WebDb storage alternatives: Revisited

Posted by Andrzej Bialecki <ab...@getopt.org>.
Russell Mayor wrote:

>One approach that I looked at a little while ago used the notion that, when
>sorted, each url in a list is likely to have a character sequence at its
>start that is common with the previous url This is especially true when
>nutch is being used for focussed / deep crawling and when the page database
>is large. The idea was then to encode the url of each page in the web
>database in two parts:
>- the number of characters that the url has in common with the previous page
>in the database
>- the remaining characters from the url once the common part has been
>removed.
>
>ie, the url list:
>- http://www.awebsite.com/section1/page1.html
>- http://www.awebsite.com/section1/page2.html
>- http://www.awebsite.com/section2/page3.html
>- ...
>
>would be encoded as
>- http://www.awebsite.com/section1/page1.html
>- 34 page2.html
>- 23 2/page3.html
>- ...
>
>  
>

I was working on this, and implemented this as a StringListWritable. And 
then after testing I decided not to use it. The space saving are worse 
than you think, because for every string you need to save an int with 
its length. This means 4 bytes wasted. Overall I couldn't go lower than 
25-28% saving, using Deflater with the default values (5?).

>I got part way through implementing this for nutch 0.6, but didn't get it
>completed. I also heard on the list that the mapred branch of nutch was
>going to have a substantially re-worked web database, so I would have been
>trying to hit a moving target with my optimisation. Does it sound suitable
>for the new web database (I'm not familliar with the mapred branch of
>nutch)?
>  
>
You will find the mapred version much much more responsive.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch WebDb storage alternatives: Revisited

Posted by Russell Mayor <ru...@gmail.com>.
Out of interest, has anyone ever looked at compressing the data that is
stored in the web database? There is a lot of text in there that could be
stored in a much smaller space than currently if compressed.

I realize that this would increase serialization / de-serialization time,
but would reduce corresponding disk transfer times, so it would be important
to use a fast encoding.

One approach that I looked at a little while ago used the notion that, when
sorted, each url in a list is likely to have a character sequence at its
start that is common with the previous url This is especially true when
nutch is being used for focussed / deep crawling and when the page database
is large. The idea was then to encode the url of each page in the web
database in two parts:
- the number of characters that the url has in common with the previous page
in the database
- the remaining characters from the url once the common part has been
removed.

ie, the url list:
- http://www.awebsite.com/section1/page1.html
- http://www.awebsite.com/section1/page2.html
- http://www.awebsite.com/section2/page3.html
- ...

would be encoded as
- http://www.awebsite.com/section1/page1.html
- 34 page2.html
- 23 2/page3.html
- ...

The random access of the page database is performed by a course seek into
the sorted list of pages, followed by a sequential fine scan. So, for the
above approach to work it's necessary to ensure that each of the pages at
the course seek points aren't compressed.

I got part way through implementing this for nutch 0.6, but didn't get it
completed. I also heard on the list that the mapred branch of nutch was
going to have a substantially re-worked web database, so I would have been
trying to hit a moving target with my optimisation. Does it sound suitable
for the new web database (I'm not familliar with the mapred branch of
nutch)?

Russell

Re: Nutch WebDb storage alternatives: Revisited

Posted by Doug Cutting <cu...@nutch.org>.
Dalton, Jeffery wrote:
> To illustrate, say we want to update a database of 100 million pages,
> roughly 100 GB of data in the PageDb with a new set of 20 million pages
> (20 GB).  This should take approximately 9-10 hours (again using Doug's
> numbers below) and so you have: 20 million updates divided by  9 hours
> (9 *60 *60) = 617 updates / second.   Very good performance.  Now let's
> try the same update on a larger database -- our aforementioned database
> of 10 billion pages (10 terabytes).   Re-writing this database file
> takes approximately 900 hours and so you have 20 million updates in 900
> hours = 6.17 updates / second.  This hopefully shows that as your
> database scales and your batch size remains the same that your performs
> degrades linearly in relation to the size of the database.  In order to
> achieve the same performance as the 100 million page database the update
> must be 2 billion pages (20% of the index).  

I assume that update batch sizes are proportional to the database size. 
  If you mostly have small constant-sized updates, then a different 
datastructure might be more appropriate, like a B-tree (as used in 
relational databases).  In Nutch, the prototypical use case is crawling 
large web collections, which generally has update batches proportional 
to the total size of the database.  In such cases this approach is 
vastly superior to B-trees.

> I believe that in order for Nutch to scale and mature, it will have to
> adopt a more incremental approach that better balances smaller updates
> with space and time performance constraints.   
> 
> If I am right, I see this as an amazing opportunity to improve
> scalability and performance.  I have some ideas about how to implement a
> more incremental approach if people that I would love to share, but I
> thought it best to try and identify the problem before spouting off
> possible solutions when the question may not be correct.  

I'd love to hear your ideas.  The approach that occurs to me would be to 
have multiple databases that are merged softly on access.  This is akin 
to Lucene's index segments.  A small update could write a new small db. 
  When one, e.g., asks for the set of incoming links to a page, one 
could first ask the full, db, then the small db, and combine the 
results.  In the general case there could be a stack of databases of 
logarithmically increasing size.  When more than N db's of the same size 
are on the stack they can be popped, merged and the result pushed back on.

Also, before you develop a lot of new code, please note that in the 
mapred branch the webdb has been broken in two, into a crawldb and a 
linkdb.  These are batch-only and, as yet, without an abstract API: 
they're simply accessed with MapFile.

> Lastly, Doug, I was also hoping that you, could clarify one point in
> your previous post that I don't quite understand: 
> 
> 	With 100 bytes per link and 10 links per page, a 100M page
> database 
> 	requires 100GB.  At 10MB/second transfer rate this takes on the
> order of 
> 	three hours to read and six hours to re-write, even with tens of
> 
> 	millions of updates. 
> 
> I believe the three hours is an approximation of:  100GB / 10 MB / Sec
> converted to minutes and then hours.   I guess I am not understanding
> how you got the six hours to re-rewrite the data.  You doubled the read
> time?  Why?  

I think I just meant the total time to read and write it would be six hours.

Doug