You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Ronan Jouchet <ro...@cadensimaging.com> on 2016/01/05 20:05:11 UTC

Storing immutable docs & accessing all docs as a reduction

Hi couchdb-users!

After learning about Datomic [DAT] and reading advice like [CLD], I am 
experimenting with considering my Couch documents immutable and would 
love your feedback.

I'm not talking about Couch's underlying data structures, I'm talking 
about never updating/deleting documents. That is, instead of modeling a 
2-user document modification scenario as:

- PUT doc123 by alice with:
   {text: "hello"}

- PUT doc123 by bob with:
   {text: "hello world", rev: "rev_of_initial_rev"}

I would be modeling the same scenario as a series of "facts":

- PUT doc123-create by alice with:
   {text: "hello", created_at: "2016-01-01"}

- PUT doc123-modification-1fca291d by bob with:
   {text: "hello world", created_at: "2016-01-02"}

Similarly, operations like deletion, un-deletion, etc. can be covered 
through conventions (stored as fields or part of the document _id).

Then, to access document "doc123",

- In the vanilla revision-based world, it's simply:
   GET .../mydb/doc123
       or a call to _all_docs with key/startkey/endkeys clauses as needed

- In the immutable world, a view does the reduction:
   GET .../mydb/_design/case/_view/immutable_docs_?group_level=1
       (with key/startkey/endkey clauses as needed)

   In my implementation, a map function emits composite keys:
     `emit([doc.doc_id, doc.created_at])`

   ... thus yielding (post-map, pre-reduce):
       key: ["doc123", "2016-01-01"], value: {text: "hello"}
       key: ["doc123", "2016-01-02"], value: {text: "hello world"}

   Then with "group_level=1" I am able to reduce these two "facts",
   using a reduce function that starts with an empty object and
   applies successive changes, ending up with the reduced object:
       key: ["doc123"], value: {text: "hello world"}

I have a working prototype that does just that and it seems to work.
Now, considering the following hypotheses:

- I understand it means more work to access data
   (these views are not going to build themselves)

- Volume-wise, I'm not expecting millions of documents, and let's
   assume at worst 100 "facts" per document.

- Unlike mentioned in [CLD], our motivation for trying immutability is
   not frequently-changing data: we anticipate slowly changing data
   (e.g. >10s between changes to the same document). We're more
   interested in traceability benefits for a regulated environment
   (no deletions, and an audit log / history becomes trivial:
    just don't reduce).

- The immutability model seems to have been tried in the Couch
   universe, as it sounds similar to Cloudant's advice [CLD] to
   "Consider Immutable Data" where, as they say, "data models based on
   immutable data require the use of views to summarize the documents
   which comprise the current state". Except, again, we'd be in for the
   traceability, not conflict avoidance.

My questions are:

1. Does it sound like a good idea, or is it perverting CouchDB's model
    and we should we stick with good'ol revisions?

2. *If* that doesn't sound abysmally perverted, I have a key ordering
    problem with my reduction function proof-of-concept:

    a. The results of my map emitting composite keys are correctly
       sorted at group_level=2:
       ["doc123", "2016-01-01"], ["doc123", "2016-01-02"].

    b. But as soon as I start reducing at group_level=1, the `values`
       passed to my reduce function seem to always be in the reverse
       order. That is, a `log(value.created_at);` in the body of my
       reduce function will print:
           2016-01-02
           2016-01-01
       I expected the contrary! And as mentioned by point a. , it used
       to be sorted at group_level=2 (ungrouped)! Note that it's not
       lexicographically sorted, it seems to always be reversed-sorted
       by the group_level+1 key index.

       Now, if I pass `&descending=true` to my view, the view becomes
       reversed (I'll see doc456 before doc123, which I do *not* want),
       but now the same logging in my reduce function correctly prints:
           2016-01-01
           2016-01-02

    -> Any idea what's wrong and how I can work around this? E.g. verify:
       - level=1 order should *not* be reversed
       - level=2 reduction should be done with reduce `values` ordered
         by level=2 composite key (and not reversed, without me
         manually  re-sorting `values`, which was already done by Couch!

Thanks for your help, and happy new year! :)

References ----

[DAT] http://www.datomic.com/

[CLD] 
https://cloudant.com/blog/my-top-5-tips-for-modelling-your-data-to-scale/

-- 
Ronan

Re: Storing immutable docs & accessing all docs as a reduction

Posted by Leontius Adhika Pradhana <le...@gmail.com>.
Although I haven't finished implemented it yet, what I'm thinking about is
to have the app listen to the _changes feed of the "n-fact" docs. Then for
each "n-fact" doc that is being created, the app performs the reduce logic
accordingly to the appropriate docs in another couchdb database. Using the
seq id this should be doable atomically.

As the "n-fact" docs are insert-only (not to be modified), this will make
sure that the "n-fact" docs are processed according to the *time it is
inserted* in the database (this should be true also even if the "n-fact"
comes due to replication if that matters).



Regards,

Leontius Adhika Pradhana (Leon)

On Thu, Jan 7, 2016 at 4:45 AM, Ronan Jouchet <
ronan.jouchet@cadensimaging.com> wrote:

> Hi Leontius, everybody.
>
> On 01/06/2016 12:35 AM, Leontius Adhika Pradhana wrote:
>
>> Re #1. I have been thinking along the same lines lately. My biggest
>> gripe with this concept is that couchdb isn't able to perform further
>> mapreduce to the results of a reduce (i.e. the "latest state of the
>> doc"). Thus, we can't query the resulting docs at all except by key.
>> Of course we could copy the result of the reduce to another doc/db to
>> write views on, but then conflict issues would come up again like
>> normal docs.
>> What's your take on this issue?
>>
>
> Fully agreed, nothing to add.
>
> Re #2. Take a look at
>>
>> http://stackoverflow.com/questions/20303355/why-do-couchdb-reduce-functions-have-to-be-commutative
>> Basically, reduce functions are expected to be commutative
>> (f(a,b) == f(b,a)), so it is perfectly legal for couchdb to shuffle
>> the arguments to reduce functions.
>>
>
> Thanks, didn't know about that. To me that means to:
>
>  - Either (re-)sort the `values` array by date before the reduce body.
>
>  - Or maybe abandon altogether Couch-side reduction and instead get the
>    unreduced (group_level=2) data, then reduce it client-side.
>
>    * That means more data transferred, which is okay in my case but
>      might be overkill/unrealistic for most web apps. Or maybe it's
>      negligible with a moderate number of documents and given that
>      text compresses well.
>
>    * But having the whole historical facts behind each document might
>      be convenient anyway (and occasionally avoid network requests).
>
>  - Something else? Leontius, how did you work around that in your own
>    n-facts -> 1-pseudodoc reduce function?
>

Re: Storing immutable docs & accessing all docs as a reduction

Posted by Ronan Jouchet <ro...@cadensimaging.com>.
Hi Leontius, everybody.

On 01/06/2016 12:35 AM, Leontius Adhika Pradhana wrote:
> Re #1. I have been thinking along the same lines lately. My biggest
> gripe with this concept is that couchdb isn't able to perform further
> mapreduce to the results of a reduce (i.e. the "latest state of the
> doc"). Thus, we can't query the resulting docs at all except by key.
> Of course we could copy the result of the reduce to another doc/db to
> write views on, but then conflict issues would come up again like
> normal docs.
> What's your take on this issue?

Fully agreed, nothing to add.

> Re #2. Take a look at
> http://stackoverflow.com/questions/20303355/why-do-couchdb-reduce-functions-have-to-be-commutative
> Basically, reduce functions are expected to be commutative
> (f(a,b) == f(b,a)), so it is perfectly legal for couchdb to shuffle
> the arguments to reduce functions.

Thanks, didn't know about that. To me that means to:

  - Either (re-)sort the `values` array by date before the reduce body.

  - Or maybe abandon altogether Couch-side reduction and instead get the
    unreduced (group_level=2) data, then reduce it client-side.

    * That means more data transferred, which is okay in my case but
      might be overkill/unrealistic for most web apps. Or maybe it's
      negligible with a moderate number of documents and given that
      text compresses well.

    * But having the whole historical facts behind each document might
      be convenient anyway (and occasionally avoid network requests).

  - Something else? Leontius, how did you work around that in your own
    n-facts -> 1-pseudodoc reduce function?

Re: Storing immutable docs & accessing all docs as a reduction

Posted by Leontius Adhika Pradhana <le...@gmail.com>.
Hi Ronan,

Re #1.
I have been thinking along the same lines lately. My biggest gripe with
this concept is that couchdb isn't able to perform further mapreduce to the
results of a reduce (i.e. the "latest state of the doc"). Thus, we can't
query the resulting docs at all except by key. Of course we could copy the
result of the reduce to another doc/db to write views on, but then conflict
issues would come up again like normal docs.

What's your take on this issue?

Re #2.
Take a look at
http://stackoverflow.com/questions/20303355/why-do-couchdb-reduce-functions-have-to-be-commutative
Basically, reduce functions are expected to be commutative (f(a,b) ==
f(b,a)), so it is perfectly legal for couchdb to shuffle the arguments to
reduce functions.

Disclaimer: my experience with couchdb is severely limited, have not done
any production deployments with it (the first is probably in the near
future).
On 6 Jan 2016 02:05, "Ronan Jouchet" <ro...@cadensimaging.com>
wrote:

> Hi couchdb-users!
>
> After learning about Datomic [DAT] and reading advice like [CLD], I am
> experimenting with considering my Couch documents immutable and would love
> your feedback.
>
> I'm not talking about Couch's underlying data structures, I'm talking
> about never updating/deleting documents. That is, instead of modeling a
> 2-user document modification scenario as:
>
> - PUT doc123 by alice with:
>   {text: "hello"}
>
> - PUT doc123 by bob with:
>   {text: "hello world", rev: "rev_of_initial_rev"}
>
> I would be modeling the same scenario as a series of "facts":
>
> - PUT doc123-create by alice with:
>   {text: "hello", created_at: "2016-01-01"}
>
> - PUT doc123-modification-1fca291d by bob with:
>   {text: "hello world", created_at: "2016-01-02"}
>
> Similarly, operations like deletion, un-deletion, etc. can be covered
> through conventions (stored as fields or part of the document _id).
>
> Then, to access document "doc123",
>
> - In the vanilla revision-based world, it's simply:
>   GET .../mydb/doc123
>       or a call to _all_docs with key/startkey/endkeys clauses as needed
>
> - In the immutable world, a view does the reduction:
>   GET .../mydb/_design/case/_view/immutable_docs_?group_level=1
>       (with key/startkey/endkey clauses as needed)
>
>   In my implementation, a map function emits composite keys:
>     `emit([doc.doc_id, doc.created_at])`
>
>   ... thus yielding (post-map, pre-reduce):
>       key: ["doc123", "2016-01-01"], value: {text: "hello"}
>       key: ["doc123", "2016-01-02"], value: {text: "hello world"}
>
>   Then with "group_level=1" I am able to reduce these two "facts",
>   using a reduce function that starts with an empty object and
>   applies successive changes, ending up with the reduced object:
>       key: ["doc123"], value: {text: "hello world"}
>
> I have a working prototype that does just that and it seems to work.
> Now, considering the following hypotheses:
>
> - I understand it means more work to access data
>   (these views are not going to build themselves)
>
> - Volume-wise, I'm not expecting millions of documents, and let's
>   assume at worst 100 "facts" per document.
>
> - Unlike mentioned in [CLD], our motivation for trying immutability is
>   not frequently-changing data: we anticipate slowly changing data
>   (e.g. >10s between changes to the same document). We're more
>   interested in traceability benefits for a regulated environment
>   (no deletions, and an audit log / history becomes trivial:
>    just don't reduce).
>
> - The immutability model seems to have been tried in the Couch
>   universe, as it sounds similar to Cloudant's advice [CLD] to
>   "Consider Immutable Data" where, as they say, "data models based on
>   immutable data require the use of views to summarize the documents
>   which comprise the current state". Except, again, we'd be in for the
>   traceability, not conflict avoidance.
>
> My questions are:
>
> 1. Does it sound like a good idea, or is it perverting CouchDB's model
>    and we should we stick with good'ol revisions?
>
> 2. *If* that doesn't sound abysmally perverted, I have a key ordering
>    problem with my reduction function proof-of-concept:
>
>    a. The results of my map emitting composite keys are correctly
>       sorted at group_level=2:
>       ["doc123", "2016-01-01"], ["doc123", "2016-01-02"].
>
>    b. But as soon as I start reducing at group_level=1, the `values`
>       passed to my reduce function seem to always be in the reverse
>       order. That is, a `log(value.created_at);` in the body of my
>       reduce function will print:
>           2016-01-02
>           2016-01-01
>       I expected the contrary! And as mentioned by point a. , it used
>       to be sorted at group_level=2 (ungrouped)! Note that it's not
>       lexicographically sorted, it seems to always be reversed-sorted
>       by the group_level+1 key index.
>
>       Now, if I pass `&descending=true` to my view, the view becomes
>       reversed (I'll see doc456 before doc123, which I do *not* want),
>       but now the same logging in my reduce function correctly prints:
>           2016-01-01
>           2016-01-02
>
>    -> Any idea what's wrong and how I can work around this? E.g. verify:
>       - level=1 order should *not* be reversed
>       - level=2 reduction should be done with reduce `values` ordered
>         by level=2 composite key (and not reversed, without me
>         manually  re-sorting `values`, which was already done by Couch!
>
> Thanks for your help, and happy new year! :)
>
> References ----
>
> [DAT] http://www.datomic.com/
>
> [CLD]
> https://cloudant.com/blog/my-top-5-tips-for-modelling-your-data-to-scale/
>
> --
> Ronan
>