You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Tony Sun <to...@gmail.com> on 2020/07/17 16:21:50 UTC

[DISCUSS] Active Tasks Response Change for FDB Layer Branch

Hi all,

   I recently started implementing _active_tasks for our fdb development
branch. At first, I thought it would be trivial, but technical limitations
have led me to modify our response as an interim solution. I'd like to get
more feedback on this solution and start a discussion on a more
accurate/correct solution going forward.

*Problem:*
Most active tasks rely upon a "Total" value to determine progress. This
relies on `count_changes_since/2` :
https://github.com/apache/couchdb/blob/master/src/couch/src/couch_db_engine.erl#L634-L652

I cannot think of an efficient way of implementing this on top of fdb
without it being inefficient. Paul has probably thought about this more
deeply during the initial layer design phase, but I may have missed some of
those discussions.

Since Couch 2.0, our update_seq string does has a snapshot of the number of
changes prepended. This also does not exist in the fdb-layer branch.

Ultimately, there is no way to calculate the total number of changes for
given a update_seq.

*Proposed Solution:*
We simply send out the versionstamp of db sequence we are trying to reach,
and the current versionstamp. So the responses look something like:

[
    {
        "node": "node1@127.0.0.1",
        "pid": "<0.622.0>",
        "changes_done": 199,
        "current_version_stamp": "8131141649532-198",
        "database": "testdb",
        "db_version_stamp": "8131141649532-999",
        "design_document": "_design/example",
        "started_on": 1594703583,
        "type": "indexer",
        "updated_on": 1594703586
    }
]

[
    {
        "node": "node1@127.0.0.1",
        "pid": "<0.1030.0>",
        "changes_done": 1000,
        "current_version_stamp": "8131194932916-999",
        "database": "testdb",
        "db_version_stamp": "8131194932916-999",
        "design_document": "_design/example",
        "started_on": 1594703636,
        "type": "indexer",
        "updated_on": 1594703665
    }
]

The user can utilize the changes_done (this is just a running counter for
that task process), current_version_stamp, and db_version_stamp to gauge if
the task is making progress.

My concern is that this a breaking change for users that rely on the
"total_changes" and "progress" fields.

I've opened a PR for this and have gotten good feedback on some
implementation details but would love to get consensus on the response
format: https://github.com/apache/couchdb/pull/3003

*Moving Forward:*
I've read a few foundationdb forum posts and topic of "Can I get the
changes to the DB, given a versionstamp?" has been discussed a few times.
I'm not sure it will be done on the fdb end anytime soon. I briefly
considered adding another b-tree in memory, but that seems overkill just
for this Total feature.

Thanks,

Tony

Re: [DISCUSS] Active Tasks Response Change for FDB Layer Branch

Posted by Ilya Khlopotov <ii...@apache.org>.
I wanted to share a failed approach just to save some time for anyone who is thinking about the issue. 

I thought we could change the key format of a by_seq index at the expense of 
an extra LAST_LESS_THAN call on every doc write. 

Currently the keys in by_seq index are `("changes", Sequence)`, 
where Sequence is the sequence of the last transaction that modified the document. 

We could change the key to be in the form of `("changes", Sequence, IncNumber)`. 
We would maintain the IncNumber as follows:
on each call to fabric2_db:write_doc we would also retrieve erlfdb_key:last_less_than(`("changes", Sequence)`). 
We would parse the value of returned key to extract last IncNumber. 
Then we would create the by_seq entry under `("changes", Sequence, IncNumber+1)`.

Currently we use by_seq index only in changes feed. We use fold over range in there which means we can just ignore the value of IncNumber.

The count_changes_since functionality would be implemented as follows:

count_changes_since(Db, SinceSeq) ->
    LastSequence = fabric2_util:seq_max_vs(),
    LastKey = erlfdb_key:last_less_than("changes", LastSequence),
    SinceKey = erlfdb_key:first_greater_than("changes", SinceSequence),
    {?DB_CHANGES, _, LastInc} = erlfdb_tuple:unpack(LastKey, DbPrefix), 
    {?DB_CHANGES, _, SinceInc} = erlfdb_tuple:unpack(SinceKey, DbPrefix),
     LastInc - SinceInc.

The downsides of this approach:

1. extra range read for each doc write (critical path)
2. increase in number of read conflicts (we would fail every time when we have updates of different docs in parallel) --- I think this is a show stopper   

Best regards,
iilyak


On 2020/07/17 16:21:50, Tony Sun <to...@gmail.com> wrote: 
> Hi all,
> 
>    I recently started implementing _active_tasks for our fdb development
> branch. At first, I thought it would be trivial, but technical limitations
> have led me to modify our response as an interim solution. I'd like to get
> more feedback on this solution and start a discussion on a more
> accurate/correct solution going forward.
> 
> *Problem:*
> Most active tasks rely upon a "Total" value to determine progress. This
> relies on `count_changes_since/2` :
> https://github.com/apache/couchdb/blob/master/src/couch/src/couch_db_engine.erl#L634-L652
> 
> I cannot think of an efficient way of implementing this on top of fdb
> without it being inefficient. Paul has probably thought about this more
> deeply during the initial layer design phase, but I may have missed some of
> those discussions.
> 
> Since Couch 2.0, our update_seq string does has a snapshot of the number of
> changes prepended. This also does not exist in the fdb-layer branch.
> 
> Ultimately, there is no way to calculate the total number of changes for
> given a update_seq.
> 
> *Proposed Solution:*
> We simply send out the versionstamp of db sequence we are trying to reach,
> and the current versionstamp. So the responses look something like:
> 
> [
>     {
>         "node": "node1@127.0.0.1",
>         "pid": "<0.622.0>",
>         "changes_done": 199,
>         "current_version_stamp": "8131141649532-198",
>         "database": "testdb",
>         "db_version_stamp": "8131141649532-999",
>         "design_document": "_design/example",
>         "started_on": 1594703583,
>         "type": "indexer",
>         "updated_on": 1594703586
>     }
> ]
> 
> [
>     {
>         "node": "node1@127.0.0.1",
>         "pid": "<0.1030.0>",
>         "changes_done": 1000,
>         "current_version_stamp": "8131194932916-999",
>         "database": "testdb",
>         "db_version_stamp": "8131194932916-999",
>         "design_document": "_design/example",
>         "started_on": 1594703636,
>         "type": "indexer",
>         "updated_on": 1594703665
>     }
> ]
> 
> The user can utilize the changes_done (this is just a running counter for
> that task process), current_version_stamp, and db_version_stamp to gauge if
> the task is making progress.
> 
> My concern is that this a breaking change for users that rely on the
> "total_changes" and "progress" fields.
> 
> I've opened a PR for this and have gotten good feedback on some
> implementation details but would love to get consensus on the response
> format: https://github.com/apache/couchdb/pull/3003
> 
> *Moving Forward:*
> I've read a few foundationdb forum posts and topic of "Can I get the
> changes to the DB, given a versionstamp?" has been discussed a few times.
> I'm not sure it will be done on the fdb end anytime soon. I briefly
> considered adding another b-tree in memory, but that seems overkill just
> for this Total feature.
> 
> Thanks,
> 
> Tony
>