You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@couchdb.apache.org by Steven Le Roux <le...@gmail.com> on 2019/03/18 17:32:13 UTC

FoundationDB & Multi tenancy model

Hi everyone.

I'm new here and just discovered the ongoing proposition for CouchDB to
rely upon FDB.

With my team, we were considering providing an HTTP API over FDB in the
form of the CouchDB API definition, so I'm very pleased to see there is
already an ongoing effort for this (even if still a proposition). I've
tried to catch up with all the good discussions on how you could make this
work, mapping to the K/V model, but sorry if I could have missed a point.

I'm curious on how you're considering to manage multi tenancy while
ensuring a good scalability and avoiding hotspotting.

I've read an idea from Mickael with CryptoHash to map the model this way :

{bucket_id}/{cryptohash}  : value

We currently use this CryptoHash mecanism to manage some data in a multi
tenancy context applied to Time Series.

Here is a simple diagram that summarize it :

{raw_data} -> ingress component -> {hashed_metadata+data} -> HBase
                                -> {crypted_metadata}     -> HBase
                                -> {crypted_metadata}     -> Directory service

Query -> egress component -> HBase

raw_data is in the metric{tags} format, like in Prometheus/OpenTSDB/Warp10
style.
hashed metadata is a double 64 or 128 bits hashes of hash(metric) +
hash(tags).
Default is 64bits but it can lead to collision in the keyspace above 1B
unique series where 128bits hashes are safer.
egress will query the Directoy service to get the series list to be read in
the store.

While authenticating, a custom "application" label is embedded into a label
that ends in the data model, then hashed that avoid conflict between
users.Hashed metadata are suffixed with a timestamp because it's convenient
for Time Series data.
What makes it very useful is :
 - it can still use scans per series (metrics+tags)
 - it avoids hotspotting the cluster and ensures a very good distributions
among nodes
 - it provides authentication through a directory service that act as an
indirection
 - keys are consistent while metrics or tags can be very long

I think this kind of model can perfectly apply to FDB for documents given
that Namespace would be a user application/bucket/...  :

hash ( {NS} + {...} + {DOC_ID} ) / fields / ...

Drawbacks are that it may require a bit more storage for keys, but hashing
could be adjusted given the use case. Moreover, managing rights at the
document level would also require additional fields or few bytes to manage
this, while using a directory index (could be as memory inside CouchDB,
outside relying on something like Elastic, or available directly inside FDB)

I realize that just FDB as a backend is a considerable amount of work and
pushing multi tenancy adds even more work maybe into CouchDB itself. For
example, Tokens could embed rights and buckets ids, that would be used by
CouchDB to authorize and build the underlying data model for storing with
scalability and optimizations in mind. Also, did anyone considered reaching
the FDB guys to try to align CouchDB document representation to the
Document Layer (
https://foundationdb.github.io/fdb-document-layer/data-modeling.html ).
This would make CouchDB to be also MongoDB API compatible.

I don't where discussions are, but maybe we could help :)

Re: FoundationDB & Multi tenancy model

Posted by Adam Kocoloski <ko...@apache.org>.

> On Mar 18, 2019, at 10:25 PM, Alex Miller <al...@apple.com.INVALID> wrote:
> 
> 
>> On Mar 18, 2019, at 10:32 AM, Steven Le Roux <le...@gmail.com> wrote:
>> 
>> Also, did anyone considered reaching
>> the FDB guys to try to align CouchDB document representation to the
>> Document Layer ( https://foundationdb.github.io/fdb-document-layer/data-modeling.html ).
>> This would make CouchDB to be also MongoDB API compatible.
> 
> 
> In reading through the proposals, I’ve been left with the impression that although the FDB document layer and proposed CouchDB layer would potentially overlap in how to persist a JSON object to FDB, the higher level goals are sufficiently different to make this level of sharing seem difficult to achieve.  The document layer would know nothing of revisions or change feeds, and the CouchDB layer would know nothing of indexes or extensions like GridFS.
> 
> Overall, I’ve seen a lot of excitement in being able to have data that’s useable via multiple different APIs, and I agree it would be cool, but from the implementation side I haven’t yet managed to sketch out how to make it work in a modular, extendable way.
> 
>> I'm curious on how you're considering to manage multi tenancy while
>> ensuring a good scalability and avoiding hotspotting.
> 
> I’ll issue a similar note of caution that FoundationDB isn’t a natively multi-tenant database either.  The transaction rate limiting done in FoundationDB applies globally, so a small set of clients focusing a read hotspot on a single key, or trying to bulk load data in faster than FoundationDB can keep up will impact the latencies seen by other clients connected to the same database.

In fact indexes are next on the docket here, and the DocumentLayer code provided some nice patterns on how to add indexes to existing collections :)

But I completely agree with your larger point — I feel a lot more confident defining a compact data model to specifically support each API and query language than to come up with something sufficiently flexible and extensible to switch back and forth.

Also, Steven - welcome! I know there’s been some discussion of possibly using a hash-based level of indirection in the core data storage to minimize the possibility of write hotspots and keep a predictable key length, and as Bob points out we haven’t quite bottomed out into an RFC on that topic yet. My gut instinct is that this is not the right tradeoff for CouchDB, but I could be convinced otherwise.

Most discussions happen here and in the #couchdb-dev channel on freenode. Would be happy to see if there’s some way we can collaborate. Cheers,

Adam

Re: FoundationDB & Multi tenancy model

Posted by Alex Miller <al...@apple.com.INVALID>.

> On Mar 18, 2019, at 10:32 AM, Steven Le Roux <le...@gmail.com> wrote:
> 
> Also, did anyone considered reaching
> the FDB guys to try to align CouchDB document representation to the
> Document Layer ( https://foundationdb.github.io/fdb-document-layer/data-modeling.html ).
> This would make CouchDB to be also MongoDB API compatible.

In reading through the proposals, I’ve been left with the impression that although the FDB document layer and proposed CouchDB layer would potentially overlap in how to persist a JSON object to FDB, the higher level goals are sufficiently different to make this level of sharing seem difficult to achieve.  The document layer would know nothing of revisions or change feeds, and the CouchDB layer would know nothing of indexes or extensions like GridFS.

Overall, I’ve seen a lot of excitement in being able to have data that’s useable via multiple different APIs, and I agree it would be cool, but from the implementation side I haven’t yet managed to sketch out how to make it work in a modular, extendable way.

> I'm curious on how you're considering to manage multi tenancy while
> ensuring a good scalability and avoiding hotspotting.

I’ll issue a similar note of caution that FoundationDB isn’t a natively multi-tenant database either.  The transaction rate limiting done in FoundationDB applies globally, so a small set of clients focusing a read hotspot on a single key, or trying to bulk load data in faster than FoundationDB can keep up will impact the latencies seen by other clients connected to the same database.

Re: FoundationDB & Multi tenancy model

Posted by Robert Newson <rn...@apache.org>.

Hi,

Firstly, CouchDB today does not have multi-tenancy as a feature. Cloudant does and achieves this by inserting the tenant's name as a prefix on the database name (so "rnewson/db1" is a different database to "sleroux/db1"), with appropriate stripping of the prefix in various responses. I would like to see multi-tenancy carried into CouchDB as first-level feature, though.

With that preamble done, each tenant will have a unique label pretty much by definition, and this would be included in all the keys. Running that, or other properties, through a cryptographically secure message digest algorithm achieves nothing but obfuscation and, as you note, the possibility (however remote) of a collision. Crypto isn't magic, even if it looks like magic.

FDB provides the notion of a "Directory" which is a mechanism to help with very long keys, given the key length constraint of 10k.

So, instead of representing a doc of {"foo":12} in "db1" of my "rnewson" account simply as;

/couchdb/rnewson/db1/doc1/foo => 12

we could create a Directory for the prefix "/couchdb/rnewson/db1" instead;

dirspace/couchdb/rnewson/db1 => 0x01
0x01/doc1/foo => 12

We're overdue for the Document Model RFC that would make this explicit.

Finally, I think we're passed the "proposition" stage as there is broad agreement (and no disagreement) from the conversations already had. We are a little behind on writing and publishing the RFC's that will describe the full work, though.

B.

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Mon, 18 Mar 2019, at 17:32, Steven Le Roux wrote:
> Hi everyone.
> 
> I'm new here and just discovered the ongoing proposition for CouchDB to
> rely upon FDB.
> 
> With my team, we were considering providing an HTTP API over FDB in the
> form of the CouchDB API definition, so I'm very pleased to see there is
> already an ongoing effort for this (even if still a proposition). I've
> tried to catch up with all the good discussions on how you could make this
> work, mapping to the K/V model, but sorry if I could have missed a point.
> 
> I'm curious on how you're considering to manage multi tenancy while
> ensuring a good scalability and avoiding hotspotting.
> 
> I've read an idea from Mickael with CryptoHash to map the model this way :
> 
> {bucket_id}/{cryptohash}  : value
> 
> We currently use this CryptoHash mecanism to manage some data in a multi
> tenancy context applied to Time Series.
> 
> Here is a simple diagram that summarize it :
> 
> {raw_data} -> ingress component -> {hashed_metadata+data} -> HBase
>                                 -> {crypted_metadata}     -> HBase
>                                 -> {crypted_metadata}     -> Directory service
> 
> Query -> egress component -> HBase
> 
> raw_data is in the metric{tags} format, like in Prometheus/OpenTSDB/Warp10
> style.
> hashed metadata is a double 64 or 128 bits hashes of hash(metric) +
> hash(tags).
> Default is 64bits but it can lead to collision in the keyspace above 1B
> unique series where 128bits hashes are safer.
> egress will query the Directoy service to get the series list to be read in
> the store.
> 
> While authenticating, a custom "application" label is embedded into a label
> that ends in the data model, then hashed that avoid conflict between
> users.Hashed metadata are suffixed with a timestamp because it's convenient
> for Time Series data.
> What makes it very useful is :
>  - it can still use scans per series (metrics+tags)
>  - it avoids hotspotting the cluster and ensures a very good distributions
> among nodes
>  - it provides authentication through a directory service that act as an
> indirection
>  - keys are consistent while metrics or tags can be very long
> 
> I think this kind of model can perfectly apply to FDB for documents given
> that Namespace would be a user application/bucket/...  :
> 
> hash ( {NS} + {...} + {DOC_ID} ) / fields / ...
> 
> Drawbacks are that it may require a bit more storage for keys, but hashing
> could be adjusted given the use case. Moreover, managing rights at the
> document level would also require additional fields or few bytes to manage
> this, while using a directory index (could be as memory inside CouchDB,
> outside relying on something like Elastic, or available directly inside FDB)
> 
> I realize that just FDB as a backend is a considerable amount of work and
> pushing multi tenancy adds even more work maybe into CouchDB itself. For
> example, Tokens could embed rights and buckets ids, that would be used by
> CouchDB to authorize and build the underlying data model for storing with
> scalability and optimizations in mind. Also, did anyone considered reaching
> the FDB guys to try to align CouchDB document representation to the
> Document Layer (
> https://foundationdb.github.io/fdb-document-layer/data-modeling.html ).
> This would make CouchDB to be also MongoDB API compatible.
> 
> I don't where discussions are, but maybe we could help :)
>