You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Steven Ringo <go...@stevenringo.com> on 2011/12/27 22:31:27 UTC

Database design question

Hi all,

(I originally posted this on the mobile-couchbase list, but was 
recommended to post it here instead, since this is more of a general 
couchdb question)

I am struggling with something that I know would be quite trivial to do 
in SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 
years of experience with SQL is clouding my judgement and so I am trying 
rather to break away from the SQL/relational mindset and understand how 
this would be better designed in couch.

I am developing an iOS art gallery app, and wish to display information 
about artworks, artists and galleries.

The menu screen requires a list of all artworks together with the 
gallery they are currently housed at and the artist that created the 
artwork:

  * artwork title
  * artwork medium
  * artwork dimensions
  * artist full_name
  * gallery name,
  * gallery city

for each artwork, respectively.

Imagine the following objects in my database (I am not including all the 
fields for sake of brevity. i.e. only the ones that are important).

|     Artist
     ------
     _id
     full_name
     birth_year
     birth_place

     Gallery
     -------
     _id
     name
     city

     Artwork
     -------
     _id
     title
     medium
     dimensions
     artist_id_ (foreign key)
     gallery_id_ (foreign key)
|

In SQL it would be as easy as one query with two joins:

|SELECT
     artwork.title,
     artwork.medium,
     artwork.dimensions,
     artist.full_name,
     gallery.name,
     gallery.city
FROM
     artwork INNER JOIN artist
     ON artwork.artist_id_ = artist._id INNER JOIN gallery
     ON artwork.gallery_id_ = gallery._id
|

 From this I would be able to get the data I need in one query (i.e. one 
call to the database).

|     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
     Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
     David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
|

Using couchdb I am storing the data in a very similar way to the way I 
would in a relational database, represented as something along the lines of:

|{
    "type"       : "Artwork",
    "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
    "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
    "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
    "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
    "title"      : "Guernica",
    "medium"     : "Oil on canvas",
    "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
}

{
    "type"       : "Artist",
    "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
    "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
    "full_name"  : "Pablo Picasso",
    "birth_year" : "1881"
}

{
    "type"       : "Gallery",
    "_id"        : "d665afaedde12b44c5159cf1782d44af",
    "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
    "name"       : "Museo Reina Sofia"
}
|

To get the same result in couch, I would have to do one query to get all 
the artworks, e.g.

|function(doc) {
   if(doc.type&&  doc.type == 'Artist') {
     emit(doc._id,doc);
   }
}
|

then I would need to do a |GET| for the _id of each |Gallery| and 
|Artist| returned per |Artwork| respectively to fill in the missing 
information. This is usually known as the |n+1| queries problem (in this 
case |2n+1|).

I have always worked on the assumption that the database should be 
leveraged to do as much of the heavy lifting as possible with as few 
network hits as possible (i.e. SQL is always faster than client 
manipulation). With a list of 120 artworks, this  means 241 HTTP GETs to 
the database.

I am aware of "linked documents" in views 
(http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents), 
however this seems to only work with one related row.

I can't really see how "view collation" 
(http://wiki.apache.org/couchdb/View_collation) would work here either. 
With view collation the correlated information ends up being on 
different rows, and I think that would be better suited for where say a 
single Artists or Gallery is displayed with associated Artworks. My case 
is the opposite, whereby Artworks plus corresponding information from 
Artists and Galleries is required.

I am also aware that one could embed each gallery or artist in each 
artwork or vice versa, but that seems to be a lot of duplication, and 
would make it difficult to change artist or gallery information if 
required. Maybe this is how it should be done?

For this small app, this is not going to be a performance issue. I am 
trying rather to break away from a SQL/relational mindset and understand 
how this would be better designed in a database like couch.

Thanks,

Steve


Re: Database design question

Posted by Sean Copenhaver <se...@gmail.com>.
Paul that was a great follow up that made my little couchdb lightbulb brighten.


- Sean Copenhaver

On Dec 28, 2011, at 6:42 AM, Steven Ringo <go...@stevenringo.com> wrote:

> Thanks for this comprehensive explanation. Have an even deeper understanding now. Good to see I am not the only one that had to start over with the mental models of how a system might work.
> 
> You make some very interesting points about the planner (or lack thereof in CouchDB).
> 
> And I see you have also experienced what happens when an ORM abstraction is misunderstood in terms of its underlying execution. I see this a lot with Rails. As an aside I have also noticed must CouchDB to Rails ORMs trying to make CouchDB perform like a SQL database. This is precisely what I was trying to avoid.
> 
> You also make some great points around compromises. I now am on the road to having a much better idea as to what types of data and their exchanges between one another are best suited to SQL and NoSQL respectively.
> 
> The app I am referring to would have worked very well with a conventional SQL store if it were server-only. SQL's design and normalisation would be its strength, and would require much less code. However as soon as you throw mobile replication and offline use into the mix, the SQL store becomes its weakness. Its here where NoSQL shines, and the minor tradeoffs of denormalisation pale in comparison to the benefits.
> 
> You make an analogy between static and dynamic languages. I think there's another great analogy: centralised vs decentralised version control systems (or Git vs Subversion if you like). I have often described CouchDB as the Git of the database world. Replication and peer-to-peer is ingrained into Git as it is to Couch. SQL and Subversion are both strong single server solutions. Git has tradeoffs including a steeper learning curve and some conceptual issues to get one's head around at the beginning. These are very nicely traded for cheap and easy branching, offline use and peer-replication. Git is terrible at managing large binary attachments, whereas Subversion handles this with aplomb. However try use Subversion at 30 000 feet in an aluminium tube over the Atlantic and you're stuck...
> 
> I am glad this discussion has created some food for thought. Thanks again for your contribution.
> 
> Paul Davis wrote:
>> I'll prefix this with the fact that Sean does an excellent point of
>> responding your core concerns. I've had a bit of this brewing as of
>> late for a various number of reasons. But your email provokes a bit of
>> a response so I think this is as good of a place as any to write down
>> some thoughts.
>> 
>> Also, to preface, my nascent background was to be taught SQL and the
>> relational model by a couple guys that are deep into it. Before I
>> moved to non-relational datastores I was deeply entrenched in a purist
>> relational model. Purist beyond even most DBA's because I was taught
>> from practical application of theory. Things like "null != null" and
>> "null values are a hack around Codd's rules" are relatively benign but
>> points that I find few people have completely groked in the real
>> world. I was lucky to have these two guys teach me from the "this is
>> proper and this is why we ignore that" point of view.
>> 
>> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo<go...@stevenringo.com>  wrote:
>>> Hi all,
>>> 
>>> (I originally posted this on the mobile-couchbase list, but was recommended
>>> to post it here instead, since this is more of a general couchdb question)
>>> 
>>> I am struggling with something that I know would be quite trivial to do in
>>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
>>> of experience with SQL is clouding my judgement and so I am trying rather to
>>> break away from the SQL/relational mindset and understand how this would be
>>> better designed in couch.
>>> 
>> 
>> This is good. The first thing to realize in learning a new tool is
>> that its not the old tool. New tools behave differently and
>> differences should be embraced rather than rejected outright.
>> 
>>> I am developing an iOS art gallery app, and wish to display information
>>> about artworks, artists and galleries.
>>> 
>>> The menu screen requires a list of all artworks together with the gallery
>>> they are currently housed at and the artist that created the artwork:
>>> 
>>>  * artwork title
>>>  * artwork medium
>>>  * artwork dimensions
>>>  * artist full_name
>>>  * gallery name,
>>>  * gallery city
>>> 
>>> for each artwork, respectively.
>>> 
>>> Imagine the following objects in my database (I am not including all the
>>> fields for sake of brevity. i.e. only the ones that are important).
>>> 
>>> |     Artist
>>>    ------
>>>    _id
>>>    full_name
>>>    birth_year
>>>    birth_place
>>> 
>>>    Gallery
>>>    -------
>>>    _id
>>>    name
>>>    city
>>> 
>>>    Artwork
>>>    -------
>>>    _id
>>>    title
>>>    medium
>>>    dimensions
>>>    artist_id_ (foreign key)
>>>    gallery_id_ (foreign key)
>>> |
>>> 
>>> In SQL it would be as easy as one query with two joins:
>>> 
>>> |SELECT
>>>    artwork.title,
>>>    artwork.medium,
>>>    artwork.dimensions,
>>>    artist.full_name,
>>>    gallery.name,
>>>    gallery.city
>>> FROM
>>>    artwork INNER JOIN artist
>>>    ON artwork.artist_id_ = artist._id INNER JOIN gallery
>>>    ON artwork.gallery_id_ = gallery._id
>>> |
>>> 
>> 
>> The fact that you took the time to specify INNER JOIN is a good tip
>> you have a thorough understanding of SQL.
>> 
>>> From this I would be able to get the data I need in one query (i.e. one call
>>> to the database).
>>> 
>>> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>>>    Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>>>    David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
>>> |
>>> 
>>> Using couchdb I am storing the data in a very similar way to the way I would
>>> in a relational database, represented as something along the lines of:
>>> 
>> 
>> Roughly speaking, this is your first mistake.
>> 
>>> |{
>>>   "type"       : "Artwork",
>>>   "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>>>   "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>>>   "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>>>   "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>>>   "title"      : "Guernica",
>>>   "medium"     : "Oil on canvas",
>>>   "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
>>> }
>>> 
>>> {
>>>   "type"       : "Artist",
>>>   "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>>>   "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>>>   "full_name"  : "Pablo Picasso",
>>>   "birth_year" : "1881"
>>> }
>>> 
>>> {
>>>   "type"       : "Gallery",
>>>   "_id"        : "d665afaedde12b44c5159cf1782d44af",
>>>   "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>>>   "name"       : "Museo Reina Sofia"
>>> }
>>> |
>>> 
>>> To get the same result in couch, I would have to do one query to get all the
>>> artworks, e.g.
>>> 
>>> |function(doc) {
>>>  if(doc.type&&    doc.type == 'Artist') {
>>>    emit(doc._id,doc);
>>>  }
>>> }
>>> |
>>> 
>>> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
>>> returned per |Artwork| respectively to fill in the missing information. This
>>> is usually known as the |n+1| queries problem (in this case |2n+1|).
>>> 
>>> I have always worked on the assumption that the database should be leveraged
>>> to do as much of the heavy lifting as possible with as few network hits as
>>> possible (i.e. SQL is always faster than client manipulation). With a list
>>> of 120 artworks, this  means 241 HTTP GETs to the database.
>>> 
>> 
>> This is your second mistake. And more to the point, this is the first
>> mistake I had to overcome myself when switching from relational to
>> non-relational data stores. Even more specifically, my stumbling block
>> was exactly the same as yours. The assumption that the number of
>> requests/queries is to be minimized.
>> 
>> I was taught quite specifically to minimize the number of queries per
>> page load. Current ORM layers are still abhorrent to me because of the
>> ease in which they translate directly to your worst case, each row in
>> the list ends up being three queries per row. Specifically, people
>> must understand both the relational model and their ORM layer to avoid
>> it from doing the Wrong Thing&trade;.
>> 
>> The conceptual realization that we need to make here is that SQL isn't
>> magical. When we issue a query, it goes through the planner and ends
>> up doing (seriously advanced) optimizations over the indexes and data
>> retrieval methods it has available. The "minimize number of queries"
>> mantra is a bit overly broad here. In practical terms it has the
>> effect of making sure that we're making the best use of indexes and
>> look up patterns. As an analogy, its akin to writing better C because
>> we might have some intuition on how the assembler translation might be
>> generated.
>> 
>>> I am aware of "linked documents" in views
>>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
>>> however this seems to only work with one related row.
>>> 
>>> I can't really see how "view collation"
>>> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
>>> view collation the correlated information ends up being on different rows,
>>> and I think that would be better suited for where say a single Artists or
>>> Gallery is displayed with associated Artworks. My case is the opposite,
>>> whereby Artworks plus corresponding information from Artists and Galleries
>>> is required.
>>> 
>> 
>> These are definitely tools in the arsenal, but don't really speak to
>> the core issue. I'll come back to this in a bit.
>> 
>>> I am also aware that one could embed each gallery or artist in each artwork
>>> or vice versa, but that seems to be a lot of duplication, and would make it
>>> difficult to change artist or gallery information if required. Maybe this is
>>> how it should be done?
>>> 
>> 
>> As Sean pointed out, the second thing to overcome is data duplication.
>> Storage is cheaper than CPU and network costs. If you don't expect
>> data to change, you should try and exploit that.
>> 
>>> For this small app, this is not going to be a performance issue. I am trying
>>> rather to break away from a SQL/relational mindset and understand how this
>>> would be better designed in a database like couch.
>>> 
>> 
>> Your honest questions are quite refreshing because they remind me of
>> my own initial exploration. There's lots of people talking but I've
>> never found a really concise description of the difference between the
>> approaches. There's lots of anecdotal blog posts about "scaling" and
>> the such, but nothing I've seen really addresses the core issues.
>> 
>>> Thanks,
>>> 
>>> Steve
>>> 
>> 
>> So, the bottom line is that the difference between relational and
>> non-relational is two things. The first issue that people encounter
>> when moving from relational to non is that the query planner is gone.
>> This seems benign at fist, but affects a huge number of moving pieces
>> in anything but non-trivial applications. If the query planner is
>> gone, then all queries are less than optimal. Think as a DBA if
>> *every* query was a full table scan. Most DBA's I know would scream at
>> the thought. And also be highly confused why anyone would give that
>> up.
>> 
>> There are two answers here. One is about "horizontal scalability".
>> Basically, SQL query planner optimizations don't generally work well
>> when there's a network connection involved. Note that I'm not saying
>> that SQL won't work in a clustered environment. There are companies
>> that have proven otherwise. The issue is that the difference isn't
>> clear to most people. Sure you can make SQL work on multiple machines,
>> but the tradeoffs between scaling SQL horizontally vs vertically are
>> complex and generally not well understood.
>> 
>> The second issue is that non-relational stores have less information
>> to do query optimizations. If you're in the crowd that says "I don't
>> need scaling, I need to not be constrained by a schema" you're in this
>> group. When you don't define a schema you're not allowing for lots of
>> important information to be fed into some really awesome algorithms
>> for data retrieval.
>> 
>> There's a rough analogy here between statically and dynamically typed
>> programming languages. Anyone that says one is better than the other
>> is someone you shouldn't listen to. There are extremes where both are
>> obviously better, but in the middle it gets into a complicated mess
>> where its more important on the observer's point of view than any sort
>> of theoretical points. If you have a well defined data usage, then
>> static might be best. A rule of thumb is, if you can't print your data
>> model on paper and have it be relevant for more than a couple months,
>> then you're probably forcing things a bit.
>> 
>> Granted, there are some obvious optimizations that non-relational
>> stores could make by introspecting data, but its quite important to
>> note the difference here. Relational stores regardless of scaling have
>> the enormous benefit of having the data model defined up front.
>> 
>> The best way I've learned to express this is that relational stores
>> require you to invest in defining how you get data into the store.
>> Non-relational means you have to invest in how to get data out.
>> 
>> So, the end of the day decision is if the traditional relational
>> stores provide you the ability to meet your customer's demands. For a
>> lot of people over the last thirty or more years, this is true. It
>> might be less than the optimal, but it generally works. Some people
>> have recently realized that some of the assumptions in the relational
>> model are artificially constraining. NOSQL is basically a response to
>> this. A sudden realization of "If I ignore this one thing, my problem
>> is a  lot easier." This is the issue that drives people. "What
>> constraint am I willing to give up?" The single piece of advice I
>> would give any relational-to-non convert is, "Analyze you're
>> requirements and ask any new data store what they give up in return
>> for their promise."
>> 
>> CouchDB gives up isolation and consistency (in the traditional sense
>> of transactions), for distribution through replication. We prioritize
>> being consistent over being fast. And we use a funky programming
>> language to make it happen.
>> 
>> Other data stores have different reasons for existing. Redis, which
>> I'm quite fond of, is roughly our opposite in that its geared towards
>> RAM and being fast. There are problems I would solve with either, but
>> no single problem for which I would use both interchangeably.

Re: Database design question

Posted by Steven Ringo <go...@stevenringo.com>.
Thanks for this comprehensive explanation. Have an even deeper 
understanding now. Good to see I am not the only one that had to start 
over with the mental models of how a system might work.

You make some very interesting points about the planner (or lack thereof 
in CouchDB).

And I see you have also experienced what happens when an ORM abstraction 
is misunderstood in terms of its underlying execution. I see this a lot 
with Rails. As an aside I have also noticed must CouchDB to Rails ORMs 
trying to make CouchDB perform like a SQL database. This is precisely 
what I was trying to avoid.

You also make some great points around compromises. I now am on the road 
to having a much better idea as to what types of data and their 
exchanges between one another are best suited to SQL and NoSQL respectively.

The app I am referring to would have worked very well with a 
conventional SQL store if it were server-only. SQL's design and 
normalisation would be its strength, and would require much less code. 
However as soon as you throw mobile replication and offline use into the 
mix, the SQL store becomes its weakness. Its here where NoSQL shines, 
and the minor tradeoffs of denormalisation pale in comparison to the 
benefits.

You make an analogy between static and dynamic languages. I think 
there's another great analogy: centralised vs decentralised version 
control systems (or Git vs Subversion if you like). I have often 
described CouchDB as the Git of the database world. Replication and 
peer-to-peer is ingrained into Git as it is to Couch. SQL and Subversion 
are both strong single server solutions. Git has tradeoffs including a 
steeper learning curve and some conceptual issues to get one's head 
around at the beginning. These are very nicely traded for cheap and easy 
branching, offline use and peer-replication. Git is terrible at managing 
large binary attachments, whereas Subversion handles this with aplomb. 
However try use Subversion at 30 000 feet in an aluminium tube over the 
Atlantic and you're stuck...

I am glad this discussion has created some food for thought. Thanks 
again for your contribution.

Paul Davis wrote:
> I'll prefix this with the fact that Sean does an excellent point of
> responding your core concerns. I've had a bit of this brewing as of
> late for a various number of reasons. But your email provokes a bit of
> a response so I think this is as good of a place as any to write down
> some thoughts.
>
> Also, to preface, my nascent background was to be taught SQL and the
> relational model by a couple guys that are deep into it. Before I
> moved to non-relational datastores I was deeply entrenched in a purist
> relational model. Purist beyond even most DBA's because I was taught
> from practical application of theory. Things like "null != null" and
> "null values are a hack around Codd's rules" are relatively benign but
> points that I find few people have completely groked in the real
> world. I was lucky to have these two guys teach me from the "this is
> proper and this is why we ignore that" point of view.
>
> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo<go...@stevenringo.com>  wrote:
>> Hi all,
>>
>> (I originally posted this on the mobile-couchbase list, but was recommended
>> to post it here instead, since this is more of a general couchdb question)
>>
>> I am struggling with something that I know would be quite trivial to do in
>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
>> of experience with SQL is clouding my judgement and so I am trying rather to
>> break away from the SQL/relational mindset and understand how this would be
>> better designed in couch.
>>
>
> This is good. The first thing to realize in learning a new tool is
> that its not the old tool. New tools behave differently and
> differences should be embraced rather than rejected outright.
>
>> I am developing an iOS art gallery app, and wish to display information
>> about artworks, artists and galleries.
>>
>> The menu screen requires a list of all artworks together with the gallery
>> they are currently housed at and the artist that created the artwork:
>>
>>   * artwork title
>>   * artwork medium
>>   * artwork dimensions
>>   * artist full_name
>>   * gallery name,
>>   * gallery city
>>
>> for each artwork, respectively.
>>
>> Imagine the following objects in my database (I am not including all the
>> fields for sake of brevity. i.e. only the ones that are important).
>>
>> |     Artist
>>     ------
>>     _id
>>     full_name
>>     birth_year
>>     birth_place
>>
>>     Gallery
>>     -------
>>     _id
>>     name
>>     city
>>
>>     Artwork
>>     -------
>>     _id
>>     title
>>     medium
>>     dimensions
>>     artist_id_ (foreign key)
>>     gallery_id_ (foreign key)
>> |
>>
>> In SQL it would be as easy as one query with two joins:
>>
>> |SELECT
>>     artwork.title,
>>     artwork.medium,
>>     artwork.dimensions,
>>     artist.full_name,
>>     gallery.name,
>>     gallery.city
>> FROM
>>     artwork INNER JOIN artist
>>     ON artwork.artist_id_ = artist._id INNER JOIN gallery
>>     ON artwork.gallery_id_ = gallery._id
>> |
>>
>
> The fact that you took the time to specify INNER JOIN is a good tip
> you have a thorough understanding of SQL.
>
>>  From this I would be able to get the data I need in one query (i.e. one call
>> to the database).
>>
>> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>>     Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>>     David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
>> |
>>
>> Using couchdb I am storing the data in a very similar way to the way I would
>> in a relational database, represented as something along the lines of:
>>
>
> Roughly speaking, this is your first mistake.
>
>> |{
>>    "type"       : "Artwork",
>>    "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>>    "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>>    "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>>    "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>>    "title"      : "Guernica",
>>    "medium"     : "Oil on canvas",
>>    "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
>> }
>>
>> {
>>    "type"       : "Artist",
>>    "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>>    "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>>    "full_name"  : "Pablo Picasso",
>>    "birth_year" : "1881"
>> }
>>
>> {
>>    "type"       : "Gallery",
>>    "_id"        : "d665afaedde12b44c5159cf1782d44af",
>>    "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>>    "name"       : "Museo Reina Sofia"
>> }
>> |
>>
>> To get the same result in couch, I would have to do one query to get all the
>> artworks, e.g.
>>
>> |function(doc) {
>>   if(doc.type&&    doc.type == 'Artist') {
>>     emit(doc._id,doc);
>>   }
>> }
>> |
>>
>> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
>> returned per |Artwork| respectively to fill in the missing information. This
>> is usually known as the |n+1| queries problem (in this case |2n+1|).
>>
>> I have always worked on the assumption that the database should be leveraged
>> to do as much of the heavy lifting as possible with as few network hits as
>> possible (i.e. SQL is always faster than client manipulation). With a list
>> of 120 artworks, this  means 241 HTTP GETs to the database.
>>
>
> This is your second mistake. And more to the point, this is the first
> mistake I had to overcome myself when switching from relational to
> non-relational data stores. Even more specifically, my stumbling block
> was exactly the same as yours. The assumption that the number of
> requests/queries is to be minimized.
>
> I was taught quite specifically to minimize the number of queries per
> page load. Current ORM layers are still abhorrent to me because of the
> ease in which they translate directly to your worst case, each row in
> the list ends up being three queries per row. Specifically, people
> must understand both the relational model and their ORM layer to avoid
> it from doing the Wrong Thing&trade;.
>
> The conceptual realization that we need to make here is that SQL isn't
> magical. When we issue a query, it goes through the planner and ends
> up doing (seriously advanced) optimizations over the indexes and data
> retrieval methods it has available. The "minimize number of queries"
> mantra is a bit overly broad here. In practical terms it has the
> effect of making sure that we're making the best use of indexes and
> look up patterns. As an analogy, its akin to writing better C because
> we might have some intuition on how the assembler translation might be
> generated.
>
>> I am aware of "linked documents" in views
>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
>> however this seems to only work with one related row.
>>
>> I can't really see how "view collation"
>> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
>> view collation the correlated information ends up being on different rows,
>> and I think that would be better suited for where say a single Artists or
>> Gallery is displayed with associated Artworks. My case is the opposite,
>> whereby Artworks plus corresponding information from Artists and Galleries
>> is required.
>>
>
> These are definitely tools in the arsenal, but don't really speak to
> the core issue. I'll come back to this in a bit.
>
>> I am also aware that one could embed each gallery or artist in each artwork
>> or vice versa, but that seems to be a lot of duplication, and would make it
>> difficult to change artist or gallery information if required. Maybe this is
>> how it should be done?
>>
>
> As Sean pointed out, the second thing to overcome is data duplication.
> Storage is cheaper than CPU and network costs. If you don't expect
> data to change, you should try and exploit that.
>
>> For this small app, this is not going to be a performance issue. I am trying
>> rather to break away from a SQL/relational mindset and understand how this
>> would be better designed in a database like couch.
>>
>
> Your honest questions are quite refreshing because they remind me of
> my own initial exploration. There's lots of people talking but I've
> never found a really concise description of the difference between the
> approaches. There's lots of anecdotal blog posts about "scaling" and
> the such, but nothing I've seen really addresses the core issues.
>
>> Thanks,
>>
>> Steve
>>
>
> So, the bottom line is that the difference between relational and
> non-relational is two things. The first issue that people encounter
> when moving from relational to non is that the query planner is gone.
> This seems benign at fist, but affects a huge number of moving pieces
> in anything but non-trivial applications. If the query planner is
> gone, then all queries are less than optimal. Think as a DBA if
> *every* query was a full table scan. Most DBA's I know would scream at
> the thought. And also be highly confused why anyone would give that
> up.
>
> There are two answers here. One is about "horizontal scalability".
> Basically, SQL query planner optimizations don't generally work well
> when there's a network connection involved. Note that I'm not saying
> that SQL won't work in a clustered environment. There are companies
> that have proven otherwise. The issue is that the difference isn't
> clear to most people. Sure you can make SQL work on multiple machines,
> but the tradeoffs between scaling SQL horizontally vs vertically are
> complex and generally not well understood.
>
> The second issue is that non-relational stores have less information
> to do query optimizations. If you're in the crowd that says "I don't
> need scaling, I need to not be constrained by a schema" you're in this
> group. When you don't define a schema you're not allowing for lots of
> important information to be fed into some really awesome algorithms
> for data retrieval.
>
> There's a rough analogy here between statically and dynamically typed
> programming languages. Anyone that says one is better than the other
> is someone you shouldn't listen to. There are extremes where both are
> obviously better, but in the middle it gets into a complicated mess
> where its more important on the observer's point of view than any sort
> of theoretical points. If you have a well defined data usage, then
> static might be best. A rule of thumb is, if you can't print your data
> model on paper and have it be relevant for more than a couple months,
> then you're probably forcing things a bit.
>
> Granted, there are some obvious optimizations that non-relational
> stores could make by introspecting data, but its quite important to
> note the difference here. Relational stores regardless of scaling have
> the enormous benefit of having the data model defined up front.
>
> The best way I've learned to express this is that relational stores
> require you to invest in defining how you get data into the store.
> Non-relational means you have to invest in how to get data out.
>
> So, the end of the day decision is if the traditional relational
> stores provide you the ability to meet your customer's demands. For a
> lot of people over the last thirty or more years, this is true. It
> might be less than the optimal, but it generally works. Some people
> have recently realized that some of the assumptions in the relational
> model are artificially constraining. NOSQL is basically a response to
> this. A sudden realization of "If I ignore this one thing, my problem
> is a  lot easier." This is the issue that drives people. "What
> constraint am I willing to give up?" The single piece of advice I
> would give any relational-to-non convert is, "Analyze you're
> requirements and ask any new data store what they give up in return
> for their promise."
>
> CouchDB gives up isolation and consistency (in the traditional sense
> of transactions), for distribution through replication. We prioritize
> being consistent over being fast. And we use a funky programming
> language to make it happen.
>
> Other data stores have different reasons for existing. Redis, which
> I'm quite fond of, is roughly our opposite in that its geared towards
> RAM and being fast. There are problems I would solve with either, but
> no single problem for which I would use both interchangeably.

Re: Database design question

Posted by Simon Metson <si...@googlemail.com>.
Excellent post Paul!



On 28 Dec 2011, at 07:38, Paul Davis <pa...@gmail.com> wrote:

> I'll prefix this with the fact that Sean does an excellent point of
> responding your core concerns. I've had a bit of this brewing as of
> late for a various number of reasons. But your email provokes a bit of
> a response so I think this is as good of a place as any to write down
> some thoughts.
> 
> Also, to preface, my nascent background was to be taught SQL and the
> relational model by a couple guys that are deep into it. Before I
> moved to non-relational datastores I was deeply entrenched in a purist
> relational model. Purist beyond even most DBA's because I was taught
> from practical application of theory. Things like "null != null" and
> "null values are a hack around Codd's rules" are relatively benign but
> points that I find few people have completely groked in the real
> world. I was lucky to have these two guys teach me from the "this is
> proper and this is why we ignore that" point of view.
> 
> On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo <go...@stevenringo.com> wrote:
>> Hi all,
>> 
>> (I originally posted this on the mobile-couchbase list, but was recommended
>> to post it here instead, since this is more of a general couchdb question)
>> 
>> I am struggling with something that I know would be quite trivial to do in
>> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
>> of experience with SQL is clouding my judgement and so I am trying rather to
>> break away from the SQL/relational mindset and understand how this would be
>> better designed in couch.
>> 
> 
> This is good. The first thing to realize in learning a new tool is
> that its not the old tool. New tools behave differently and
> differences should be embraced rather than rejected outright.
> 
>> I am developing an iOS art gallery app, and wish to display information
>> about artworks, artists and galleries.
>> 
>> The menu screen requires a list of all artworks together with the gallery
>> they are currently housed at and the artist that created the artwork:
>> 
>>  * artwork title
>>  * artwork medium
>>  * artwork dimensions
>>  * artist full_name
>>  * gallery name,
>>  * gallery city
>> 
>> for each artwork, respectively.
>> 
>> Imagine the following objects in my database (I am not including all the
>> fields for sake of brevity. i.e. only the ones that are important).
>> 
>> |     Artist
>>    ------
>>    _id
>>    full_name
>>    birth_year
>>    birth_place
>> 
>>    Gallery
>>    -------
>>    _id
>>    name
>>    city
>> 
>>    Artwork
>>    -------
>>    _id
>>    title
>>    medium
>>    dimensions
>>    artist_id_ (foreign key)
>>    gallery_id_ (foreign key)
>> |
>> 
>> In SQL it would be as easy as one query with two joins:
>> 
>> |SELECT
>>    artwork.title,
>>    artwork.medium,
>>    artwork.dimensions,
>>    artist.full_name,
>>    gallery.name,
>>    gallery.city
>> FROM
>>    artwork INNER JOIN artist
>>    ON artwork.artist_id_ = artist._id INNER JOIN gallery
>>    ON artwork.gallery_id_ = gallery._id
>> |
>> 
> 
> The fact that you took the time to specify INNER JOIN is a good tip
> you have a thorough understanding of SQL.
> 
>> From this I would be able to get the data I need in one query (i.e. one call
>> to the database).
>> 
>> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>>    Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>>    David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
>> |
>> 
>> Using couchdb I am storing the data in a very similar way to the way I would
>> in a relational database, represented as something along the lines of:
>> 
> 
> Roughly speaking, this is your first mistake.
> 
>> |{
>>   "type"       : "Artwork",
>>   "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>>   "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>>   "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>>   "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>>   "title"      : "Guernica",
>>   "medium"     : "Oil on canvas",
>>   "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
>> }
>> 
>> {
>>   "type"       : "Artist",
>>   "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>>   "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>>   "full_name"  : "Pablo Picasso",
>>   "birth_year" : "1881"
>> }
>> 
>> {
>>   "type"       : "Gallery",
>>   "_id"        : "d665afaedde12b44c5159cf1782d44af",
>>   "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>>   "name"       : "Museo Reina Sofia"
>> }
>> |
>> 
>> To get the same result in couch, I would have to do one query to get all the
>> artworks, e.g.
>> 
>> |function(doc) {
>>  if(doc.type&&  doc.type == 'Artist') {
>>    emit(doc._id,doc);
>>  }
>> }
>> |
>> 
>> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
>> returned per |Artwork| respectively to fill in the missing information. This
>> is usually known as the |n+1| queries problem (in this case |2n+1|).
>> 
>> I have always worked on the assumption that the database should be leveraged
>> to do as much of the heavy lifting as possible with as few network hits as
>> possible (i.e. SQL is always faster than client manipulation). With a list
>> of 120 artworks, this  means 241 HTTP GETs to the database.
>> 
> 
> This is your second mistake. And more to the point, this is the first
> mistake I had to overcome myself when switching from relational to
> non-relational data stores. Even more specifically, my stumbling block
> was exactly the same as yours. The assumption that the number of
> requests/queries is to be minimized.
> 
> I was taught quite specifically to minimize the number of queries per
> page load. Current ORM layers are still abhorrent to me because of the
> ease in which they translate directly to your worst case, each row in
> the list ends up being three queries per row. Specifically, people
> must understand both the relational model and their ORM layer to avoid
> it from doing the Wrong Thing &trade;.
> 
> The conceptual realization that we need to make here is that SQL isn't
> magical. When we issue a query, it goes through the planner and ends
> up doing (seriously advanced) optimizations over the indexes and data
> retrieval methods it has available. The "minimize number of queries"
> mantra is a bit overly broad here. In practical terms it has the
> effect of making sure that we're making the best use of indexes and
> look up patterns. As an analogy, its akin to writing better C because
> we might have some intuition on how the assembler translation might be
> generated.
> 
>> I am aware of "linked documents" in views
>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
>> however this seems to only work with one related row.
>> 
>> I can't really see how "view collation"
>> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
>> view collation the correlated information ends up being on different rows,
>> and I think that would be better suited for where say a single Artists or
>> Gallery is displayed with associated Artworks. My case is the opposite,
>> whereby Artworks plus corresponding information from Artists and Galleries
>> is required.
>> 
> 
> These are definitely tools in the arsenal, but don't really speak to
> the core issue. I'll come back to this in a bit.
> 
>> I am also aware that one could embed each gallery or artist in each artwork
>> or vice versa, but that seems to be a lot of duplication, and would make it
>> difficult to change artist or gallery information if required. Maybe this is
>> how it should be done?
>> 
> 
> As Sean pointed out, the second thing to overcome is data duplication.
> Storage is cheaper than CPU and network costs. If you don't expect
> data to change, you should try and exploit that.
> 
>> For this small app, this is not going to be a performance issue. I am trying
>> rather to break away from a SQL/relational mindset and understand how this
>> would be better designed in a database like couch.
>> 
> 
> Your honest questions are quite refreshing because they remind me of
> my own initial exploration. There's lots of people talking but I've
> never found a really concise description of the difference between the
> approaches. There's lots of anecdotal blog posts about "scaling" and
> the such, but nothing I've seen really addresses the core issues.
> 
>> Thanks,
>> 
>> Steve
>> 
> 
> So, the bottom line is that the difference between relational and
> non-relational is two things. The first issue that people encounter
> when moving from relational to non is that the query planner is gone.
> This seems benign at fist, but affects a huge number of moving pieces
> in anything but non-trivial applications. If the query planner is
> gone, then all queries are less than optimal. Think as a DBA if
> *every* query was a full table scan. Most DBA's I know would scream at
> the thought. And also be highly confused why anyone would give that
> up.
> 
> There are two answers here. One is about "horizontal scalability".
> Basically, SQL query planner optimizations don't generally work well
> when there's a network connection involved. Note that I'm not saying
> that SQL won't work in a clustered environment. There are companies
> that have proven otherwise. The issue is that the difference isn't
> clear to most people. Sure you can make SQL work on multiple machines,
> but the tradeoffs between scaling SQL horizontally vs vertically are
> complex and generally not well understood.
> 
> The second issue is that non-relational stores have less information
> to do query optimizations. If you're in the crowd that says "I don't
> need scaling, I need to not be constrained by a schema" you're in this
> group. When you don't define a schema you're not allowing for lots of
> important information to be fed into some really awesome algorithms
> for data retrieval.
> 
> There's a rough analogy here between statically and dynamically typed
> programming languages. Anyone that says one is better than the other
> is someone you shouldn't listen to. There are extremes where both are
> obviously better, but in the middle it gets into a complicated mess
> where its more important on the observer's point of view than any sort
> of theoretical points. If you have a well defined data usage, then
> static might be best. A rule of thumb is, if you can't print your data
> model on paper and have it be relevant for more than a couple months,
> then you're probably forcing things a bit.
> 
> Granted, there are some obvious optimizations that non-relational
> stores could make by introspecting data, but its quite important to
> note the difference here. Relational stores regardless of scaling have
> the enormous benefit of having the data model defined up front.
> 
> The best way I've learned to express this is that relational stores
> require you to invest in defining how you get data into the store.
> Non-relational means you have to invest in how to get data out.
> 
> So, the end of the day decision is if the traditional relational
> stores provide you the ability to meet your customer's demands. For a
> lot of people over the last thirty or more years, this is true. It
> might be less than the optimal, but it generally works. Some people
> have recently realized that some of the assumptions in the relational
> model are artificially constraining. NOSQL is basically a response to
> this. A sudden realization of "If I ignore this one thing, my problem
> is a  lot easier." This is the issue that drives people. "What
> constraint am I willing to give up?" The single piece of advice I
> would give any relational-to-non convert is, "Analyze you're
> requirements and ask any new data store what they give up in return
> for their promise."
> 
> CouchDB gives up isolation and consistency (in the traditional sense
> of transactions), for distribution through replication. We prioritize
> being consistent over being fast. And we use a funky programming
> language to make it happen.
> 
> Other data stores have different reasons for existing. Redis, which
> I'm quite fond of, is roughly our opposite in that its geared towards
> RAM and being fast. There are problems I would solve with either, but
> no single problem for which I would use both interchangeably.

Re: Database design question

Posted by Paul Davis <pa...@gmail.com>.
I'll prefix this with the fact that Sean does an excellent point of
responding your core concerns. I've had a bit of this brewing as of
late for a various number of reasons. But your email provokes a bit of
a response so I think this is as good of a place as any to write down
some thoughts.

Also, to preface, my nascent background was to be taught SQL and the
relational model by a couple guys that are deep into it. Before I
moved to non-relational datastores I was deeply entrenched in a purist
relational model. Purist beyond even most DBA's because I was taught
from practical application of theory. Things like "null != null" and
"null values are a hack around Codd's rules" are relatively benign but
points that I find few people have completely groked in the real
world. I was lucky to have these two guys teach me from the "this is
proper and this is why we ignore that" point of view.

On Tue, Dec 27, 2011 at 3:31 PM, Steven Ringo <go...@stevenringo.com> wrote:
> Hi all,
>
> (I originally posted this on the mobile-couchbase list, but was recommended
> to post it here instead, since this is more of a general couchdb question)
>
> I am struggling with something that I know would be quite trivial to do in
> SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20 years
> of experience with SQL is clouding my judgement and so I am trying rather to
> break away from the SQL/relational mindset and understand how this would be
> better designed in couch.
>

This is good. The first thing to realize in learning a new tool is
that its not the old tool. New tools behave differently and
differences should be embraced rather than rejected outright.

> I am developing an iOS art gallery app, and wish to display information
> about artworks, artists and galleries.
>
> The menu screen requires a list of all artworks together with the gallery
> they are currently housed at and the artist that created the artwork:
>
>  * artwork title
>  * artwork medium
>  * artwork dimensions
>  * artist full_name
>  * gallery name,
>  * gallery city
>
> for each artwork, respectively.
>
> Imagine the following objects in my database (I am not including all the
> fields for sake of brevity. i.e. only the ones that are important).
>
> |     Artist
>    ------
>    _id
>    full_name
>    birth_year
>    birth_place
>
>    Gallery
>    -------
>    _id
>    name
>    city
>
>    Artwork
>    -------
>    _id
>    title
>    medium
>    dimensions
>    artist_id_ (foreign key)
>    gallery_id_ (foreign key)
> |
>
> In SQL it would be as easy as one query with two joins:
>
> |SELECT
>    artwork.title,
>    artwork.medium,
>    artwork.dimensions,
>    artist.full_name,
>    gallery.name,
>    gallery.city
> FROM
>    artwork INNER JOIN artist
>    ON artwork.artist_id_ = artist._id INNER JOIN gallery
>    ON artwork.gallery_id_ = gallery._id
> |
>

The fact that you took the time to specify INNER JOIN is a good tip
you have a thorough understanding of SQL.

> From this I would be able to get the data I need in one query (i.e. one call
> to the database).
>
> |     Guernica  | Pablo Picasso     | Museo Reina Sofia, Madrid, Spain
>    Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>    David     | Michelangelo      | Uffizi Gallery, Florence, Italy.
> |
>
> Using couchdb I am storing the data in a very similar way to the way I would
> in a relational database, represented as something along the lines of:
>

Roughly speaking, this is your first mistake.

> |{
>   "type"       : "Artwork",
>   "_id"        : "d665afaedde12b44c5159cf1782c5c1e",
>   "_rev"       : "1-446201af8a1ddfa3759164a01be899de",
>   "artist_id"  : "d665afaedde12b44c5159cf1782c2d0b",
>   "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>   "title"      : "Guernica",
>   "medium"     : "Oil on canvas",
>   "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
> }
>
> {
>   "type"       : "Artist",
>   "_id"        : "d665afaedde12b44c5159cf1782c2d0b",
>   "_rev"       : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>   "full_name"  : "Pablo Picasso",
>   "birth_year" : "1881"
> }
>
> {
>   "type"       : "Gallery",
>   "_id"        : "d665afaedde12b44c5159cf1782d44af",
>   "_rev"       : "1-27a139e40a7f88d6465812eec3c73a0f",
>   "name"       : "Museo Reina Sofia"
> }
> |
>
> To get the same result in couch, I would have to do one query to get all the
> artworks, e.g.
>
> |function(doc) {
>  if(doc.type&&  doc.type == 'Artist') {
>    emit(doc._id,doc);
>  }
> }
> |
>
> then I would need to do a |GET| for the _id of each |Gallery| and |Artist|
> returned per |Artwork| respectively to fill in the missing information. This
> is usually known as the |n+1| queries problem (in this case |2n+1|).
>
> I have always worked on the assumption that the database should be leveraged
> to do as much of the heavy lifting as possible with as few network hits as
> possible (i.e. SQL is always faster than client manipulation). With a list
> of 120 artworks, this  means 241 HTTP GETs to the database.
>

This is your second mistake. And more to the point, this is the first
mistake I had to overcome myself when switching from relational to
non-relational data stores. Even more specifically, my stumbling block
was exactly the same as yours. The assumption that the number of
requests/queries is to be minimized.

I was taught quite specifically to minimize the number of queries per
page load. Current ORM layers are still abhorrent to me because of the
ease in which they translate directly to your worst case, each row in
the list ends up being three queries per row. Specifically, people
must understand both the relational model and their ORM layer to avoid
it from doing the Wrong Thing &trade;.

The conceptual realization that we need to make here is that SQL isn't
magical. When we issue a query, it goes through the planner and ends
up doing (seriously advanced) optimizations over the indexes and data
retrieval methods it has available. The "minimize number of queries"
mantra is a bit overly broad here. In practical terms it has the
effect of making sure that we're making the best use of indexes and
look up patterns. As an analogy, its akin to writing better C because
we might have some intuition on how the assembler translation might be
generated.

> I am aware of "linked documents" in views
> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
> however this seems to only work with one related row.
>
> I can't really see how "view collation"
> (http://wiki.apache.org/couchdb/View_collation) would work here either. With
> view collation the correlated information ends up being on different rows,
> and I think that would be better suited for where say a single Artists or
> Gallery is displayed with associated Artworks. My case is the opposite,
> whereby Artworks plus corresponding information from Artists and Galleries
> is required.
>

These are definitely tools in the arsenal, but don't really speak to
the core issue. I'll come back to this in a bit.

> I am also aware that one could embed each gallery or artist in each artwork
> or vice versa, but that seems to be a lot of duplication, and would make it
> difficult to change artist or gallery information if required. Maybe this is
> how it should be done?
>

As Sean pointed out, the second thing to overcome is data duplication.
Storage is cheaper than CPU and network costs. If you don't expect
data to change, you should try and exploit that.

> For this small app, this is not going to be a performance issue. I am trying
> rather to break away from a SQL/relational mindset and understand how this
> would be better designed in a database like couch.
>

Your honest questions are quite refreshing because they remind me of
my own initial exploration. There's lots of people talking but I've
never found a really concise description of the difference between the
approaches. There's lots of anecdotal blog posts about "scaling" and
the such, but nothing I've seen really addresses the core issues.

> Thanks,
>
> Steve
>

So, the bottom line is that the difference between relational and
non-relational is two things. The first issue that people encounter
when moving from relational to non is that the query planner is gone.
This seems benign at fist, but affects a huge number of moving pieces
in anything but non-trivial applications. If the query planner is
gone, then all queries are less than optimal. Think as a DBA if
*every* query was a full table scan. Most DBA's I know would scream at
the thought. And also be highly confused why anyone would give that
up.

There are two answers here. One is about "horizontal scalability".
Basically, SQL query planner optimizations don't generally work well
when there's a network connection involved. Note that I'm not saying
that SQL won't work in a clustered environment. There are companies
that have proven otherwise. The issue is that the difference isn't
clear to most people. Sure you can make SQL work on multiple machines,
but the tradeoffs between scaling SQL horizontally vs vertically are
complex and generally not well understood.

The second issue is that non-relational stores have less information
to do query optimizations. If you're in the crowd that says "I don't
need scaling, I need to not be constrained by a schema" you're in this
group. When you don't define a schema you're not allowing for lots of
important information to be fed into some really awesome algorithms
for data retrieval.

There's a rough analogy here between statically and dynamically typed
programming languages. Anyone that says one is better than the other
is someone you shouldn't listen to. There are extremes where both are
obviously better, but in the middle it gets into a complicated mess
where its more important on the observer's point of view than any sort
of theoretical points. If you have a well defined data usage, then
static might be best. A rule of thumb is, if you can't print your data
model on paper and have it be relevant for more than a couple months,
then you're probably forcing things a bit.

Granted, there are some obvious optimizations that non-relational
stores could make by introspecting data, but its quite important to
note the difference here. Relational stores regardless of scaling have
the enormous benefit of having the data model defined up front.

The best way I've learned to express this is that relational stores
require you to invest in defining how you get data into the store.
Non-relational means you have to invest in how to get data out.

So, the end of the day decision is if the traditional relational
stores provide you the ability to meet your customer's demands. For a
lot of people over the last thirty or more years, this is true. It
might be less than the optimal, but it generally works. Some people
have recently realized that some of the assumptions in the relational
model are artificially constraining. NOSQL is basically a response to
this. A sudden realization of "If I ignore this one thing, my problem
is a  lot easier." This is the issue that drives people. "What
constraint am I willing to give up?" The single piece of advice I
would give any relational-to-non convert is, "Analyze you're
requirements and ask any new data store what they give up in return
for their promise."

CouchDB gives up isolation and consistency (in the traditional sense
of transactions), for distribution through replication. We prioritize
being consistent over being fast. And we use a funky programming
language to make it happen.

Other data stores have different reasons for existing. Redis, which
I'm quite fond of, is roughly our opposite in that its geared towards
RAM and being fast. There are problems I would solve with either, but
no single problem for which I would use both interchangeably.

Re: Database design question

Posted by Steven Ringo <go...@stevenringo.com>.
Wow! A huge, HUGE thank you for explaining this! The a-ha moment has 
come and it all makes sense now.No book or article that I have read so 
far hasmanaged to explain this as well.

In other words, the idea is to denormalize the data that makes the most 
sense to be lumped together even if it ends up being duplicated. Usually 
this is guided by what might be most commonly retrieved together, such 
as the artist/artwork/gallery combo in my situation. Once into the 
detail for say an artwork or an artist, then use the detailed version of 
the same data.

Coming from a SQL and DRY-philosophy background, the idea of 
denormalizing seemed counterintuitive to me. Now that I think about it, 
it makes sense. Storage and computing costs are cheap. And in most cases 
(certainly for artwork/artist), its fine to save the artist data twice 
when editing/adding a new artwork, as this will never get changed. Only 
if there is a spelling mistake or something in the artist for example 
will that be the case. Sure, things may get out of sync, but they're not 
critical, and certainly not for an app like this. Worse things can happen!

Going to go implement now. Thanks again!

Steve


Sean Copenhaver wrote:
> Hmm.. I'll take a stab at giving you some guidelines and suggestions. I hope it helps but I'll warn you I haven't done anything with CouchDB in awhile and I'm not testing my map/reduce and queries. Keep that in mind.
>
> Something that raised an eyebrow is that you are trying to model your document (known for being non-relational) data similar to your tabular relational model data.
>
> So something that you do with relational databases is normalize your data to try and store it only once. You then have to join the data back together piecemeal to get what you want. Document databases are usually naturally denormalized in comparison and you have shallow and few relationships. The documents can tend to represent the core ideas for your application and may match more to how you work with the data.
>
> Naturally the trade off is storage costs, update costs for the duplicated data, but could provide simpler and easy to work with data model.
>
> Another thing to keep in mind is that defining a map function (as in the map part of a view) is defining the index you want to do range queries on. A map function is not a query, but a clustered index on the key/value pair which point back to the original document (you could say they act as a non-clustered index in that manner). From your e-mail I wasn't sure if that was understood.
>
> With that said, I would suggest storing the gallery information inside the artwork document. This duplicates the data but I'm guessing you would rarely modify the galleries information and more likely add new galleries or maybe switch out the gallery info for an artwork document. You could do something similar with the artist information considering that this might not change very often as well. Also keep in mind that you don't have to use GUIDs for your keys. If the artist's full name or something similar provides the uniqueness you could use that.
>
> Perhaps the artist and gallery documents stay the same but the artwork document looks like:
>
> {
>      "type": "artwork",
>      "title": "Guernica",
>      "medium": "Oil on canvas",
>      "dimensions": "349 cm x 776 cm (137.4 in x 305.5 in)",
>      "artist": "Pablo Picasso",
>      "birth_year": 1881,
>      "artist_id": "<some guid>",
>      "gallery_id": "<some guid>",
>      "gallery": "Museo Reina Sofia"
> }
>
> That's just a possibility. I wouldn't store all the info in the artwork document but just the major bits you know are less likely to change. Storing a document like this means you can retrieve all the relevant pieces of information in one query against a map function defined as:
>
> function (doc) {
>      if (doc.type === 'artwork') {
>          emit(doc._id, null);
>      }
> }
>
> Notice that I left out the document as a value for the emit() call. This is because the document's id is always stored with the key/value pair. Emitting the document for the value is a trade off of retrieve speed versus storage/update costs.
>
> Anyway you can then query this and ask CouchDB to retrieve the documents. Something like this with paging:
>
>      http://server/db/_design/art/_view/artwork?start_key="<guid>"&limit=101&include_docs=true
>
> You could even not have a gallery document if you didn't need any extra info with them. You could then do a map/reduce on the artwork documents to get the list of galleries your data represents.
>
> "galleries" : {
>      "map": function (doc) {
>          if (doc.type === 'artwork') {
>              emit(doc.gallery, 1);
>          }
>      },
>
>      "reduce": "_sum"
> }
>
> This example query would retrieve the galleries and the number of artworks for each:
>
>      http://server/db/_design/art/_view/galleries?group=true
>
> Assuming that the name is unique enough to specify the gallery that is. If you did need to add extra data you could give the gallery document an id of it's name to link the two.
>
> As I said before you want shallow and few relationships but you can retrieve them together if you are trying to go the other direction of the relationship. In this situation if you wanted to retrieve an artist with all their artworks.
>
> A map function such as:
>
> function (doc) {
>      if (doc.type === 'artist') {
>          emit([doc._id, 0], null);
>      }
>      else if (doc.type === 'artwork') {
>          emit([doc.artist_id, 1], null);
>      }
> }
>
> would allow queries to retrieve an artist and all their artworks. You just have to know that the 0 is the artist and the 1 is the artwork in the results. I use those numbers to simply order the artist first. A query on the view with the artist's guid would allow you to retrieve all the artworks as well.
>
>      http://server/db/_design/art/_view/artist_and_works?key="[<guid>]"&include_docs=true
>
> Ok, dang. I may have gone off on a tangent but I hope this helps with your CouchDB understanding and data modeling. Here are some links as well that may help.
>
> Relationships in CouchDB:
> https://wiki.apache.org/couchdb/EntityRelationship
>
> Views intro:
> https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
>
> View query options:
> https://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
>
> How your keys sort the data in a view:
> https://wiki.apache.org/couchdb/View_collation
>
> Remember you can retrieve multiple ids at once:
> https://wiki.apache.org/couchdb/HTTP_Bulk_Document_API#Fetch_Multiple_Documents_With_a_Single_Request
>
> The questions many don't think to ask, incomplete but a couple with good info:
> https://wiki.apache.org/couchdb/FUQ
>
> Also I swear MongoDB's doc's had a decent page on document database data modeling but I can't seem to dig it up right now.
>
> If there are any mistakes I hope others with chime in.
>
> --
> Sean Copenhaver
>
>
>
> On Tuesday, December 27, 2011 at 4:31 PM, Steven Ringo wrote:
>
>> Hi all,
>>
>> (I originally posted this on the mobile-couchbase list, but was
>> recommended to post it here instead, since this is more of a general
>> couchdb question)
>>
>> I am struggling with something that I know would be quite trivial to do
>> in SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20
>> years of experience with SQL is clouding my judgement and so I am trying
>> rather to break away from the SQL/relational mindset and understand how
>> this would be better designed in couch.
>>
>> I am developing an iOS art gallery app, and wish to display information
>> about artworks, artists and galleries.
>>
>> The menu screen requires a list of all artworks together with the
>> gallery they are currently housed at and the artist that created the
>> artwork:
>>
>> * artwork title
>> * artwork medium
>> * artwork dimensions
>> * artist full_name
>> * gallery name,
>> * gallery city
>>
>> for each artwork, respectively.
>>
>> Imagine the following objects in my database (I am not including all the
>> fields for sake of brevity. i.e. only the ones that are important).
>>
>> | Artist
>> ------
>> _id
>> full_name
>> birth_year
>> birth_place
>>
>> Gallery
>> -------
>> _id
>> name
>> city
>>
>> Artwork
>> -------
>> _id
>> title
>> medium
>> dimensions
>> artist_id_ (foreign key)
>> gallery_id_ (foreign key)
>> |
>>
>> In SQL it would be as easy as one query with two joins:
>>
>> |SELECT
>> artwork.title,
>> artwork.medium,
>> artwork.dimensions,
>> artist.full_name,
>> gallery.name (http://gallery.name),
>> gallery.city
>> FROM
>> artwork INNER JOIN artist
>> ON artwork.artist_id_ = artist._id INNER JOIN gallery
>> ON artwork.gallery_id_ = gallery._id
>> |
>>
>>  From this I would be able to get the data I need in one query (i.e. one
>> call to the database).
>>
>> | Guernica | Pablo Picasso | Museo Reina Sofia, Madrid, Spain
>> Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
>> David | Michelangelo | Uffizi Gallery, Florence, Italy.
>> |
>>
>> Using couchdb I am storing the data in a very similar way to the way I
>> would in a relational database, represented as something along the lines of:
>>
>> |{
>> "type" : "Artwork",
>> "_id" : "d665afaedde12b44c5159cf1782c5c1e",
>> "_rev" : "1-446201af8a1ddfa3759164a01be899de",
>> "artist_id" : "d665afaedde12b44c5159cf1782c2d0b",
>> "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
>> "title" : "Guernica",
>> "medium" : "Oil on canvas",
>> "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
>> }
>>
>> {
>> "type" : "Artist",
>> "_id" : "d665afaedde12b44c5159cf1782c2d0b",
>> "_rev" : "1-8fa8a6dd50d9d9072b08db1a840128b1",
>> "full_name" : "Pablo Picasso",
>> "birth_year" : "1881"
>> }
>>
>> {
>> "type" : "Gallery",
>> "_id" : "d665afaedde12b44c5159cf1782d44af",
>> "_rev" : "1-27a139e40a7f88d6465812eec3c73a0f",
>> "name" : "Museo Reina Sofia"
>> }
>> |
>>
>> To get the same result in couch, I would have to do one query to get all
>> the artworks, e.g.
>>
>> |function(doc) {
>> if(doc.type&&  doc.type == 'Artist') {
>> emit(doc._id,doc);
>> }
>> }
>> |
>>
>> then I would need to do a |GET| for the _id of each |Gallery| and
>> |Artist| returned per |Artwork| respectively to fill in the missing
>> information. This is usually known as the |n+1| queries problem (in this
>> case |2n+1|).
>>
>> I have always worked on the assumption that the database should be
>> leveraged to do as much of the heavy lifting as possible with as few
>> network hits as possible (i.e. SQL is always faster than client
>> manipulation). With a list of 120 artworks, this means 241 HTTP GETs to
>> the database.
>>
>> I am aware of "linked documents" in views
>> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),
>> however this seems to only work with one related row.
>>
>> I can't really see how "view collation"
>> (http://wiki.apache.org/couchdb/View_collation) would work here either.
>> With view collation the correlated information ends up being on
>> different rows, and I think that would be better suited for where say a
>> single Artists or Gallery is displayed with associated Artworks. My case
>> is the opposite, whereby Artworks plus corresponding information from
>> Artists and Galleries is required.
>>
>> I am also aware that one could embed each gallery or artist in each
>> artwork or vice versa, but that seems to be a lot of duplication, and
>> would make it difficult to change artist or gallery information if
>> required. Maybe this is how it should be done?
>>
>> For this small app, this is not going to be a performance issue. I am
>> trying rather to break away from a SQL/relational mindset and understand
>> how this would be better designed in a database like couch.
>>
>> Thanks,
>>
>> Steve
>
>

Re: Database design question

Posted by Sean Copenhaver <se...@gmail.com>.
Hmm.. I'll take a stab at giving you some guidelines and suggestions. I hope it helps but I'll warn you I haven't done anything with CouchDB in awhile and I'm not testing my map/reduce and queries. Keep that in mind.  

Something that raised an eyebrow is that you are trying to model your document (known for being non-relational) data similar to your tabular relational model data.

So something that you do with relational databases is normalize your data to try and store it only once. You then have to join the data back together piecemeal to get what you want. Document databases are usually naturally denormalized in comparison and you have shallow and few relationships. The documents can tend to represent the core ideas for your application and may match more to how you work with the data.

Naturally the trade off is storage costs, update costs for the duplicated data, but could provide simpler and easy to work with data model.

Another thing to keep in mind is that defining a map function (as in the map part of a view) is defining the index you want to do range queries on. A map function is not a query, but a clustered index on the key/value pair which point back to the original document (you could say they act as a non-clustered index in that manner). From your e-mail I wasn't sure if that was understood.

With that said, I would suggest storing the gallery information inside the artwork document. This duplicates the data but I'm guessing you would rarely modify the galleries information and more likely add new galleries or maybe switch out the gallery info for an artwork document. You could do something similar with the artist information considering that this might not change very often as well. Also keep in mind that you don't have to use GUIDs for your keys. If the artist's full name or something similar provides the uniqueness you could use that.

Perhaps the artist and gallery documents stay the same but the artwork document looks like:

{
    "type": "artwork",
    "title": "Guernica",
    "medium": "Oil on canvas",
    "dimensions": "349 cm x 776 cm (137.4 in x 305.5 in)",
    "artist": "Pablo Picasso",
    "birth_year": 1881,
    "artist_id": "<some guid>",
    "gallery_id": "<some guid>",
    "gallery": "Museo Reina Sofia"
}

That's just a possibility. I wouldn't store all the info in the artwork document but just the major bits you know are less likely to change. Storing a document like this means you can retrieve all the relevant pieces of information in one query against a map function defined as:

function (doc) {
    if (doc.type === 'artwork') {
        emit(doc._id, null);
    }
}

Notice that I left out the document as a value for the emit() call. This is because the document's id is always stored with the key/value pair. Emitting the document for the value is a trade off of retrieve speed versus storage/update costs.  

Anyway you can then query this and ask CouchDB to retrieve the documents. Something like this with paging:

    http://server/db/_design/art/_view/artwork?start_key="<guid>"&limit=101&include_docs=true

You could even not have a gallery document if you didn't need any extra info with them. You could then do a map/reduce on the artwork documents to get the list of galleries your data represents.

"galleries" : {
    "map": function (doc) {
        if (doc.type === 'artwork') {
            emit(doc.gallery, 1);
        }
    },

    "reduce": "_sum"
}

This example query would retrieve the galleries and the number of artworks for each:

    http://server/db/_design/art/_view/galleries?group=true

Assuming that the name is unique enough to specify the gallery that is. If you did need to add extra data you could give the gallery document an id of it's name to link the two.

As I said before you want shallow and few relationships but you can retrieve them together if you are trying to go the other direction of the relationship. In this situation if you wanted to retrieve an artist with all their artworks.

A map function such as:

function (doc) {
    if (doc.type === 'artist') {
        emit([doc._id, 0], null);
    }
    else if (doc.type === 'artwork') {
        emit([doc.artist_id, 1], null);
    }
}

would allow queries to retrieve an artist and all their artworks. You just have to know that the 0 is the artist and the 1 is the artwork in the results. I use those numbers to simply order the artist first. A query on the view with the artist's guid would allow you to retrieve all the artworks as well.

    http://server/db/_design/art/_view/artist_and_works?key="[<guid>]"&include_docs=true

Ok, dang. I may have gone off on a tangent but I hope this helps with your CouchDB understanding and data modeling. Here are some links as well that may help.

Relationships in CouchDB:
https://wiki.apache.org/couchdb/EntityRelationship

Views intro:
https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views

View query options:
https://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options

How your keys sort the data in a view:
https://wiki.apache.org/couchdb/View_collation

Remember you can retrieve multiple ids at once:
https://wiki.apache.org/couchdb/HTTP_Bulk_Document_API#Fetch_Multiple_Documents_With_a_Single_Request

The questions many don't think to ask, incomplete but a couple with good info:
https://wiki.apache.org/couchdb/FUQ

Also I swear MongoDB's doc's had a decent page on document database data modeling but I can't seem to dig it up right now.

If there are any mistakes I hope others with chime in.  

--  
Sean Copenhaver



On Tuesday, December 27, 2011 at 4:31 PM, Steven Ringo wrote:

> Hi all,
>  
> (I originally posted this on the mobile-couchbase list, but was  
> recommended to post it here instead, since this is more of a general  
> couchdb question)
>  
> I am struggling with something that I know would be quite trivial to do  
> in SQL, but seems to be a lot more complex to do in couchdb. Perhaps 20  
> years of experience with SQL is clouding my judgement and so I am trying  
> rather to break away from the SQL/relational mindset and understand how  
> this would be better designed in couch.
>  
> I am developing an iOS art gallery app, and wish to display information  
> about artworks, artists and galleries.
>  
> The menu screen requires a list of all artworks together with the  
> gallery they are currently housed at and the artist that created the  
> artwork:
>  
> * artwork title
> * artwork medium
> * artwork dimensions
> * artist full_name
> * gallery name,
> * gallery city
>  
> for each artwork, respectively.
>  
> Imagine the following objects in my database (I am not including all the  
> fields for sake of brevity. i.e. only the ones that are important).
>  
> | Artist
> ------
> _id
> full_name
> birth_year
> birth_place
>  
> Gallery
> -------
> _id
> name
> city
>  
> Artwork
> -------
> _id
> title
> medium
> dimensions
> artist_id_ (foreign key)
> gallery_id_ (foreign key)
> |
>  
> In SQL it would be as easy as one query with two joins:
>  
> |SELECT
> artwork.title,
> artwork.medium,
> artwork.dimensions,
> artist.full_name,
> gallery.name (http://gallery.name),
> gallery.city
> FROM
> artwork INNER JOIN artist
> ON artwork.artist_id_ = artist._id INNER JOIN gallery
> ON artwork.gallery_id_ = gallery._id
> |
>  
> From this I would be able to get the data I need in one query (i.e. one  
> call to the database).
>  
> | Guernica | Pablo Picasso | Museo Reina Sofia, Madrid, Spain
> Mona Lisa | Leonardo da Vinci | Musée du Louvre, Paris, France
> David | Michelangelo | Uffizi Gallery, Florence, Italy.
> |
>  
> Using couchdb I am storing the data in a very similar way to the way I  
> would in a relational database, represented as something along the lines of:
>  
> |{
> "type" : "Artwork",
> "_id" : "d665afaedde12b44c5159cf1782c5c1e",
> "_rev" : "1-446201af8a1ddfa3759164a01be899de",
> "artist_id" : "d665afaedde12b44c5159cf1782c2d0b",
> "gallery_id" : "d665afaedde12b44c5159cf1782d44af",
> "title" : "Guernica",
> "medium" : "Oil on canvas",
> "dimensions" : "349 cm × 776 cm (137.4 in × 305.5 in)"
> }
>  
> {
> "type" : "Artist",
> "_id" : "d665afaedde12b44c5159cf1782c2d0b",
> "_rev" : "1-8fa8a6dd50d9d9072b08db1a840128b1",
> "full_name" : "Pablo Picasso",
> "birth_year" : "1881"
> }
>  
> {
> "type" : "Gallery",
> "_id" : "d665afaedde12b44c5159cf1782d44af",
> "_rev" : "1-27a139e40a7f88d6465812eec3c73a0f",
> "name" : "Museo Reina Sofia"
> }
> |
>  
> To get the same result in couch, I would have to do one query to get all  
> the artworks, e.g.
>  
> |function(doc) {
> if(doc.type&& doc.type == 'Artist') {
> emit(doc._id,doc);
> }
> }
> |
>  
> then I would need to do a |GET| for the _id of each |Gallery| and  
> |Artist| returned per |Artwork| respectively to fill in the missing  
> information. This is usually known as the |n+1| queries problem (in this  
> case |2n+1|).
>  
> I have always worked on the assumption that the database should be  
> leveraged to do as much of the heavy lifting as possible with as few  
> network hits as possible (i.e. SQL is always faster than client  
> manipulation). With a list of 120 artworks, this means 241 HTTP GETs to  
> the database.
>  
> I am aware of "linked documents" in views  
> (http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Linked_documents),  
> however this seems to only work with one related row.
>  
> I can't really see how "view collation"  
> (http://wiki.apache.org/couchdb/View_collation) would work here either.  
> With view collation the correlated information ends up being on  
> different rows, and I think that would be better suited for where say a  
> single Artists or Gallery is displayed with associated Artworks. My case  
> is the opposite, whereby Artworks plus corresponding information from  
> Artists and Galleries is required.
>  
> I am also aware that one could embed each gallery or artist in each  
> artwork or vice versa, but that seems to be a lot of duplication, and  
> would make it difficult to change artist or gallery information if  
> required. Maybe this is how it should be done?
>  
> For this small app, this is not going to be a performance issue. I am  
> trying rather to break away from a SQL/relational mindset and understand  
> how this would be better designed in a database like couch.
>  
> Thanks,
>  
> Steve