You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@couchdb.apache.org by Apache Wiki <wi...@apache.org> on 2008/03/29 22:22:55 UTC

[Couchdb Wiki] Update of "Why" by NoahSlater

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Couchdb Wiki" for change notification.

The following page has been changed by NoahSlater:
http://wiki.apache.org/couchdb/Why

The comment on the change is:
Copied from original CouchDB wiki

New page:
''This is a work in progress, I hesitate to even call it a rough draft yet. It's put on the wiki to make collaborative edits easier.''

With enough work, you can find a way to [http://en.wikipedia.org/wiki/Database_normalization express anything relationally]. This is the "brilliant" insight into relational theory that [http://en.wikipedia.org/wiki/Edgar_F._Codd Codd] had and every relational database zealot has harped on forever since. The worst of these zealot will tell you SQL is an abomination because isn't really relational, it adheres only to "some" of the relational principles. [http://www.dbdebunk.com/index.html Some people seem to have made entire careers out of zealotry.] Yet SQL has been very successful for a very long time, despite the obvious deficiency of not being "fully relational". Why? Why is it successful if it doesn't and can't adhere strictly to these relational principles?

Because no one really gives a shit.

Why should a normal person care in the least about relational purity? What they want is a database that stores their data, tells them interesting things about it and gives it back to them reliably.

''The reason SQL is successful is not because of how well adheres to relational principles, but because it deviates from those principles in ways that are convenient in real use.''

On the other end of the spectrum, the opposite of the pure relational model, you have a world of pure objects, where data is treated as unstructured, indivisible units and more closely represents how data is actually represented and passed around in the real world. The relatively recent object-oriented programming paradigm is an extension of this model and in storage we've had the concept of files since just about forever.

This self contained, isolated object model, as it relates to the web, is what's also becoming popularly known as the [http://en.wikipedia.org/wiki/REST REST] model. And as so many of its proponents love to claim, it scales like crazy. In fact, REST, representing and manipulating data as address chunks, is a realization and a formalization of how most of the internet works and scales.

But just like very little of our data is completely relational in nature, very little of our data is completely distinct or unstructured. Most of our data is has some common structure and is completely self contained, but refers to external data and concepts.

Examples of data that has characteristics of both are easy to find, for example a restaurant receipt. A receipt is both a self contained document and relational piece of data, its purpose is to record data and meta-data about a financial transaction. It records definitive facts, things that are always true, like the method of payment and the amount paid, but it also records more conceptual things, like the name of the restaurant involved and the dishes served. Recipes and ingredients change, and restaurants can be renamed, sold and moved. The link between receipt and these external concepts, like the restaurant, can be broken in a number of ways. Even the value of money changes over time, not just with inflation, but also units of currency can be completely revalued by governments. The information on that receipt, at any point in time, is of unknown accuracy and verifiability. And yet despite all these short comings, a receipt is still useful.

And so the receipt data, despite some relational characteristic, is problematic to represent "correctly" in a relational database, because there is no absolute answer to what the meaning of each piece of data is and how it relates to everything else. In paper form, the receipt preserves the original restaurant information regardless what happens in the real world, it’s not just a record of relations but of facts at a place in time. And even if the all the things that link the receipt back to the originating restaurant change, the receipt still has value. This is the "temporal" aspect of data and is something the relational model doesn't handle well.

But like pure relational modeling isn't very productive for many real world problems, modeling data as isolated objects isn't terribly useful either. Beyond acting as a fancy file server (like WebDAV), the REST model for data storage and sharing is too simple to build real applications on, objects alone are far too restrictive to use as a general storage model, it doesn't do enough of the things we want and expect from our applications. So instead SQL databases, with their abilities to slice and dice and tell us interesting things about our data, continue to dominate.

What is needed is a way to make these discrete objects be a little more relational. To aggregate, organize and report on data that hasn't already been decomposed into tables.

This was very much the idea behind WinFS, Microsoft's spectacularly brilliant and disastrous unifying vision for data modeling in the Windows world. The problem was Microsoft couldn't make it work efficiently (not to mention they seriously overreached, trying to make WinFS a solution to solve all problems, before it ever solved a single problem).

One of the reasons I think WinFS has failed is it is too rooted in relational technology, the reporting engine is based in large part on SQL Server. MS is far from the first to fail in this area, building object databases on top of relational technology. In Oracle-land, attempts to reproduce the similar models even has a name, [http://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-big-design-mistakes.html Entity Attribute Value model], and is generally considered a very bad idea. To this day I've never seen anyone build it on a relational store, make something flexible, easy to use and that gives suitable performance.

Even IBM has been working for years to get DB2 to serve as a persistence layer for Notes and Domino, and they still can't get it stable, easy and performant enough to be released gold. IBM is working on it because customers keep saying how much they love the idea, but the reality is that it's damn hard to build on top of a relational database. Essentially they are try to emulate one general database and reporting model on top of another general purpose database and reporting model, but the models conflict.

The updates and usage model of CouchDb is a lockless, optimistic model that is pretty much the opposite of how the relational model works. To edit an object, you load it up, make your changes, then save them back to the database. If someone else edits at the same time and they save their changes before you, you'll get a conflict error on save. To resolved the conflict, the updated document can be opened and edits can be applied to it and try again.

This storage and access model is such that any number of hosts can be replicating while simultaneously any number of users can be reading, editing and deleting documents, performing lookups and reads in computed tables which are simultaneously being updated for other users. Document writes are serialized, except for file attachments and binary blobs, which are written concurrently. And the database never needs to “lock” to do this.

Relational databases, or more specifically SQL databases, simply don't make good document databases, the fundamental set of compromises made is all wrong. If you want something that performs well and is reliable, you need to start from scratch and think carefully about the whole design. Lotus Notes did this long ago, and over the years a lot of very talented people have been successfully adapting and expanding that design to keep it competitive. And Notes and Domino continue to be big money makers despite how many people are other annoyed by the software, because it makes it simple to do the things that a nightmares to deal with in SQL.

CouchDB addresses the data storage problem from the non-relational, RESTful point of view, and is designed and optimized in multiple, complimentary ways for that world view of semi-structured, document oriented data. And by providing a specific but flexible document storage, query and replication model, it hits the sweet spot storage and reporting for things that are more "document oriented" than relational, a huge class of problems in application development.

Just as SQL databases rarely are used for data that is entirely structured and relational in nature, Couch databases rarely will be used for data that has no structure, but where the interrelations and structure aren't strong and might be broken. The CouchDB computed table reporting engine adds back structure and organizes semi-structured, document data and while allowing it to stay in it's preferred state.

The CouchDB document and query model aggregates documents into a manageable tables. Unlike SQL database, where your data must be carefully decomposed into tables ahead of time, the tables on CouchDB are built on the fly and aggregates the already existing data.