You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Suraj Kumar <su...@inmobi.com> on 2014/04/10 14:54:18 UTC

Modeling Relationships and providing Transactional Integrity

[warning: cross-posted]

Hi,

We're attempting to build a model of a large scale, complex Infrastructure.
That means, every machine their supporting machines report to mothership.
Since our problem is truly that of high concurrency, choosing a solid data
base to keep state of this model became the focus in our erstwhile days. We
zero'ed in on CouchDB: actually, due to the fact that there is Erlang
powering it and that we can pull off other things (not met by CouchDB)
which Couch doesn't provide. One of those things was the notion of
Relationships.

What do I mean by "Relationships" really? Some "types" of Entities have
attributes which may potentially be related some other "types" of Entities
in specific known ways (1:1, 1:*, *:1).

The "Type" becomes the hazy part for schemaless systems like CouchDB.
However, let us now talk in Couch primitives.

Let us set aside the question of how this could potentially still result in
inconsistency in a live distributed database... and imagine if there could
be 'design' documents that describe how some attributes of some "types" of
documents are related to some other attributes of some other "types" of
documents. Imagine, if this could be used by this new 'Relationships'
engine to automatically validate and keep relational integrity of the
database. To describe in couch-terminology, it is a way to automatically
modify certain keys of related document whenever certain keys of a given
'type' of document changes.

I'm now attempting to formally describe two of the basic primitive elements
of every practical schemaless database system, specifically CouchDB:

1. Documents of classifiable 'types' or 'sets'.
2. Attributes (*keys of the JSON hash*) (and a way to address attributes
using a generic, intuitive and a standard "*convention*")

I am of the belief that defining these two formally is the first step to
approach implementing Relationships in CouchDB as a usable general purpose
optional feature (for those who are willing to compromise some things in
return :) ).

Some more thoughts:


   - "types" in a schemaless JSON data structure can be only determined by
   a function that determines the type. Hence, there should be 'type'
   determining functions, or classifiers.
   - Likewise, we have thus far been using a dotted-notation convention to
   address specific attributes. This convention or some similar one can be
   used by the relationship module (ex: "os.version", "
   last_modified.by.user.id"), as long as the 'keys' themselves don't have
   a period ;)
   - every relationship will be kept 'in memory', in much the same way as
   how validate doc update functions are kept 'in memory' and used for every
   write.
   - regular Doc PUT/POST API will fail when a document's (of classifiable
   'type') attribute which is involved in a relationship is changed.
   - To modify an attribute that is involved in a relationship, a
   "transactional update" API must be used. All the related documents for
   those change(s), must also be submitted through this API "bulk_doc"-like
   API (perhaps bulk_docs itself?).
   - The idea is, a client initiating the transaction update will fetch all
   related documents, through a helper API which "denormalizes" all related
   documents and returns as a larger hash.
   - This will also reference the defined relationships and follows a 3PC
   protocol (where an extra metadata field in the document will be used to
   keep state of the ongoing "transaction") to allow potential failures during
   concurrent other transactional updates.

Thus, a design document that describes a relationship would look something
like:

{
  "ClassifiedTypePerson": {
    "classifier": function (doc) {
             if (doc.blah && doc.blah2) {
                 return true;
             }
     },
    "relationships": [ { "from": "my.attribute.to.reference.daddy", "to":
"ClassifiedTypeDaddy", "type": "1:1" },
                             { from":
"my.other.attribute.to.reference.kids", "to": "ClassifiedTypeChildren",
"type": "1:*" }
 ]
}

This is just a sugary way of defining some commonly recurring
auto-validation rules which invariably reference / depend on other
documents and it is not without compromises.

The compromises are:
- one-shard-forever compromise: since this is about infrastructure, the
size of the data-set will fit under 2-4 GB. So even if the entire DB has to
be read by Couch, we don't care. This way, whatever "related" documents
will all be found on the same disk. Unless, we formalize distributed
- unpredictable write times compromise: Every write will involve
predictable number of reads and predictable failure for those attributes
which are defined under a 'relationship' (attributes with relationships can
be modified only through a separate 'special' API where all the related
documents

What do you think about this? Would people here find use for this in your
day-to-day needs? Would the couchdb-devs merge this into mainstream couchdb
if such a patch is submitted?

Regards,

  -Suraj

-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: Modeling Relationships and providing Transactional Integrity

Posted by Robert Samuel Newson <rn...@apache.org>.
_bulk_docs is not transactional at all. all_or_nothing:true used to be but the semantic changed for the reason Jens notes, it doesn’t work in a sharded context. Now it simply forcibly introduces conflicts if it has to.

B.

On 16 Apr 2014, at 15:32, Jens Alfke <je...@couchbase.com> wrote:

> 
> On Apr 15, 2014, at 8:59 PM, Suraj Kumar <su...@inmobi.com> wrote:
> 
>>>  - To modify an attribute that is involved in a relationship, a
>>>  "transactional update" API must be used. All the related documents for
>>>  those change(s), must also be submitted through this API "bulk_doc"-like
>>>  API (perhaps bulk_docs itself?).
> 
> I don’t think _bulk_docs will provide a transactional update even on a single shard/node — the semantics will be the same as issuing a bunch of individual PUT requests one after the other, i.e. someone else can sneak in and update a doc in between two of your updates. There are warnings about this in the API docs (at least there were in the old docs; haven’t checked the new ones.) 
> 
> There used to be an “all or nothing” mode to _bulk_docs a long time ago that did provide a transactional update, but apparently it was removed because it couldn’t be supported in a clustered environment (BigCouch).
> 
> [The usual disclaimer: I’m not familiar with the code inside CouchDB, only its behavior.]
> 
> —Jens
> 
> PS: Removing the dev@ group from the reply because (a) I’m not subscribed to that, and (b) it’s off-topic.


Re: Modeling Relationships and providing Transactional Integrity

Posted by Jens Alfke <je...@couchbase.com>.
On Apr 15, 2014, at 8:59 PM, Suraj Kumar <su...@inmobi.com> wrote:

>>   - To modify an attribute that is involved in a relationship, a
>>   "transactional update" API must be used. All the related documents for
>>   those change(s), must also be submitted through this API "bulk_doc"-like
>>   API (perhaps bulk_docs itself?).

I don’t think _bulk_docs will provide a transactional update even on a single shard/node — the semantics will be the same as issuing a bunch of individual PUT requests one after the other, i.e. someone else can sneak in and update a doc in between two of your updates. There are warnings about this in the API docs (at least there were in the old docs; haven’t checked the new ones.) 

There used to be an “all or nothing” mode to _bulk_docs a long time ago that did provide a transactional update, but apparently it was removed because it couldn’t be supported in a clustered environment (BigCouch).

[The usual disclaimer: I’m not familiar with the code inside CouchDB, only its behavior.]

—Jens

PS: Removing the dev@ group from the reply because (a) I’m not subscribed to that, and (b) it’s off-topic.

Re: Modeling Relationships and providing Transactional Integrity

Posted by Suraj Kumar <su...@inmobi.com>.
Hello,

On second reading, it appears I've made a ton of typos and half-done
sentences in my original post. But leaving those aside, has anybody managed
to read this through and give it a thought? Any questions / clarifications?
We'd really like to get started on the right way that will be useful. So
your advise will be highly useful.

Thanks,

  -Suraj


On Thu, Apr 10, 2014 at 6:24 PM, Suraj Kumar <su...@inmobi.com> wrote:

> [warning: cross-posted]
>
> Hi,
>
> We're attempting to build a model of a large scale, complex
> Infrastructure. That means, every machine their supporting machines report
> to mothership. Since our problem is truly that of high concurrency,
> choosing a solid data base to keep state of this model became the focus in
> our erstwhile days. We zero'ed in on CouchDB: actually, due to the fact
> that there is Erlang powering it and that we can pull off other things (not
> met by CouchDB) which Couch doesn't provide. One of those things was the
> notion of Relationships.
>
> What do I mean by "Relationships" really? Some "types" of Entities have
> attributes which may potentially be related some other "types" of Entities
> in specific known ways (1:1, 1:*, *:1).
>
> The "Type" becomes the hazy part for schemaless systems like CouchDB.
> However, let us now talk in Couch primitives.
>
> Let us set aside the question of how this could potentially still result
> in inconsistency in a live distributed database... and imagine if there
> could be 'design' documents that describe how some attributes of some
> "types" of documents are related to some other attributes of some other
> "types" of documents. Imagine, if this could be used by this new
> 'Relationships' engine to automatically validate and keep relational
> integrity of the database. To describe in couch-terminology, it is a way to
> automatically modify certain keys of related document whenever certain keys
> of a given 'type' of document changes.
>
> I'm now attempting to formally describe two of the basic primitive
> elements of every practical schemaless database system, specifically
> CouchDB:
>
> 1. Documents of classifiable 'types' or 'sets'.
> 2. Attributes (*keys of the JSON hash*) (and a way to address attributes
> using a generic, intuitive and a standard "*convention*")
>
> I am of the belief that defining these two formally is the first step to
> approach implementing Relationships in CouchDB as a usable general purpose
> optional feature (for those who are willing to compromise some things in
> return :) ).
>
> Some more thoughts:
>
>
>    - "types" in a schemaless JSON data structure can be only determined
>    by a function that determines the type. Hence, there should be 'type'
>    determining functions, or classifiers.
>    - Likewise, we have thus far been using a dotted-notation convention
>    to address specific attributes. This convention or some similar one can be
>    used by the relationship module (ex: "os.version", "
>    last_modified.by.user.id"), as long as the 'keys' themselves don't
>    have a period ;)
>    - every relationship will be kept 'in memory', in much the same way as
>    how validate doc update functions are kept 'in memory' and used for every
>    write.
>    - regular Doc PUT/POST API will fail when a document's (of
>    classifiable 'type') attribute which is involved in a relationship is
>    changed.
>    - To modify an attribute that is involved in a relationship, a
>    "transactional update" API must be used. All the related documents for
>    those change(s), must also be submitted through this API "bulk_doc"-like
>    API (perhaps bulk_docs itself?).
>    - The idea is, a client initiating the transaction update will fetch
>    all related documents, through a helper API which "denormalizes" all
>    related documents and returns as a larger hash.
>    - This will also reference the defined relationships and follows a 3PC
>    protocol (where an extra metadata field in the document will be used to
>    keep state of the ongoing "transaction") to allow potential failures during
>    concurrent other transactional updates.
>
> Thus, a design document that describes a relationship would look something
> like:
>
> {
>   "ClassifiedTypePerson": {
>     "classifier": function (doc) {
>              if (doc.blah && doc.blah2) {
>                  return true;
>              }
>      },
>     "relationships": [ { "from": "my.attribute.to.reference.daddy", "to":
> "ClassifiedTypeDaddy", "type": "1:1" },
>                              { from":
> "my.other.attribute.to.reference.kids", "to": "ClassifiedTypeChildren",
> "type": "1:*" }
>  ]
> }
>
> This is just a sugary way of defining some commonly recurring
> auto-validation rules which invariably reference / depend on other
> documents and it is not without compromises.
>
> The compromises are:
> - one-shard-forever compromise: since this is about infrastructure, the
> size of the data-set will fit under 2-4 GB. So even if the entire DB has to
> be read by Couch, we don't care. This way, whatever "related" documents
> will all be found on the same disk. Unless, we formalize distributed
> - unpredictable write times compromise: Every write will involve
> predictable number of reads and predictable failure for those attributes
> which are defined under a 'relationship' (attributes with relationships can
> be modified only through a separate 'special' API where all the related
> documents
>
> What do you think about this? Would people here find use for this in your
> day-to-day needs? Would the couchdb-devs merge this into mainstream couchdb
> if such a patch is submitted?
>
> Regards,
>
>   -Suraj
>
> --
> An Onion is the Onion skin and the Onion under the skin until the Onion
> Skin without any Onion underneath.
>
>


-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: Modeling Relationships and providing Transactional Integrity

Posted by Suraj Kumar <su...@inmobi.com>.
Hello,

On second reading, it appears I've made a ton of typos and half-done
sentences in my original post. But leaving those aside, has anybody managed
to read this through and give it a thought? Any questions / clarifications?
We'd really like to get started on the right way that will be useful. So
your advise will be highly useful.

Thanks,

  -Suraj


On Thu, Apr 10, 2014 at 6:24 PM, Suraj Kumar <su...@inmobi.com> wrote:

> [warning: cross-posted]
>
> Hi,
>
> We're attempting to build a model of a large scale, complex
> Infrastructure. That means, every machine their supporting machines report
> to mothership. Since our problem is truly that of high concurrency,
> choosing a solid data base to keep state of this model became the focus in
> our erstwhile days. We zero'ed in on CouchDB: actually, due to the fact
> that there is Erlang powering it and that we can pull off other things (not
> met by CouchDB) which Couch doesn't provide. One of those things was the
> notion of Relationships.
>
> What do I mean by "Relationships" really? Some "types" of Entities have
> attributes which may potentially be related some other "types" of Entities
> in specific known ways (1:1, 1:*, *:1).
>
> The "Type" becomes the hazy part for schemaless systems like CouchDB.
> However, let us now talk in Couch primitives.
>
> Let us set aside the question of how this could potentially still result
> in inconsistency in a live distributed database... and imagine if there
> could be 'design' documents that describe how some attributes of some
> "types" of documents are related to some other attributes of some other
> "types" of documents. Imagine, if this could be used by this new
> 'Relationships' engine to automatically validate and keep relational
> integrity of the database. To describe in couch-terminology, it is a way to
> automatically modify certain keys of related document whenever certain keys
> of a given 'type' of document changes.
>
> I'm now attempting to formally describe two of the basic primitive
> elements of every practical schemaless database system, specifically
> CouchDB:
>
> 1. Documents of classifiable 'types' or 'sets'.
> 2. Attributes (*keys of the JSON hash*) (and a way to address attributes
> using a generic, intuitive and a standard "*convention*")
>
> I am of the belief that defining these two formally is the first step to
> approach implementing Relationships in CouchDB as a usable general purpose
> optional feature (for those who are willing to compromise some things in
> return :) ).
>
> Some more thoughts:
>
>
>    - "types" in a schemaless JSON data structure can be only determined
>    by a function that determines the type. Hence, there should be 'type'
>    determining functions, or classifiers.
>    - Likewise, we have thus far been using a dotted-notation convention
>    to address specific attributes. This convention or some similar one can be
>    used by the relationship module (ex: "os.version", "
>    last_modified.by.user.id"), as long as the 'keys' themselves don't
>    have a period ;)
>    - every relationship will be kept 'in memory', in much the same way as
>    how validate doc update functions are kept 'in memory' and used for every
>    write.
>    - regular Doc PUT/POST API will fail when a document's (of
>    classifiable 'type') attribute which is involved in a relationship is
>    changed.
>    - To modify an attribute that is involved in a relationship, a
>    "transactional update" API must be used. All the related documents for
>    those change(s), must also be submitted through this API "bulk_doc"-like
>    API (perhaps bulk_docs itself?).
>    - The idea is, a client initiating the transaction update will fetch
>    all related documents, through a helper API which "denormalizes" all
>    related documents and returns as a larger hash.
>    - This will also reference the defined relationships and follows a 3PC
>    protocol (where an extra metadata field in the document will be used to
>    keep state of the ongoing "transaction") to allow potential failures during
>    concurrent other transactional updates.
>
> Thus, a design document that describes a relationship would look something
> like:
>
> {
>   "ClassifiedTypePerson": {
>     "classifier": function (doc) {
>              if (doc.blah && doc.blah2) {
>                  return true;
>              }
>      },
>     "relationships": [ { "from": "my.attribute.to.reference.daddy", "to":
> "ClassifiedTypeDaddy", "type": "1:1" },
>                              { from":
> "my.other.attribute.to.reference.kids", "to": "ClassifiedTypeChildren",
> "type": "1:*" }
>  ]
> }
>
> This is just a sugary way of defining some commonly recurring
> auto-validation rules which invariably reference / depend on other
> documents and it is not without compromises.
>
> The compromises are:
> - one-shard-forever compromise: since this is about infrastructure, the
> size of the data-set will fit under 2-4 GB. So even if the entire DB has to
> be read by Couch, we don't care. This way, whatever "related" documents
> will all be found on the same disk. Unless, we formalize distributed
> - unpredictable write times compromise: Every write will involve
> predictable number of reads and predictable failure for those attributes
> which are defined under a 'relationship' (attributes with relationships can
> be modified only through a separate 'special' API where all the related
> documents
>
> What do you think about this? Would people here find use for this in your
> day-to-day needs? Would the couchdb-devs merge this into mainstream couchdb
> if such a patch is submitted?
>
> Regards,
>
>   -Suraj
>
> --
> An Onion is the Onion skin and the Onion under the skin until the Onion
> Skin without any Onion underneath.
>
>


-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.