You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Suraj Kumar <su...@inmobi.com> on 2014/03/25 08:41:22 UTC

A million databases

I constantly see "one DB per user" being proposed as a solution. But I'm
not entirely convinced whether this will truly work for a large scale
setup.

The reason people choose CouchDB is for high scale use where one could
potentially end up with a million users. Then, what good is a database  hat
only relies on the underlying filesystem to do the job of index keeping? If
there are a million "*.couch" files under var/lib/couchdb/, I'd expect the
performance to be very poor / unpredictable since it now depends on the
underlying file system's logic. How can this be partitioned?

What is the "right" way to handle million users with need for isolated
documents within each DB? How will replication solutions cope in
distributing these million databases? 2 million replicating connections
between two servers doesn't sound right.

Regards,

  -Suraj

-- 
An Onion is the Onion skin and the Onion under the skin until the Onion
Skin without any Onion underneath.

-- 
_____________________________________________________________
The information contained in this communication is intended solely for the 
use of the individual or entity to whom it is addressed and others 
authorized to receive it. It may contain confidential or legally privileged 
information. If you are not the intended recipient you are hereby notified 
that any disclosure, copying, distribution or taking any action in reliance 
on the contents of this information is strictly prohibited and may be 
unlawful. If you have received this communication in error, please notify 
us immediately by responding to this email and then delete it from your 
system. The firm is neither liable for the proper and complete transmission 
of the information contained in this communication nor for any delay in its 
receipt.

Re: A million databases

Posted by "itlists@schrievkrom.de" <it...@schrievkrom.de>.

Am 25.03.2014 19:27, schrieb Jens Alfke:

> 
> Do modern filesystems still have performance problems with large directories? 
> 

Yes, Windows definitly has problems with a large number of entries
within a directory.


-- 
Marten Feldtmann

Re: A million databases

Posted by Alexander Shorin <kx...@gmail.com>.

On Tue, Mar 25, 2014 at 10:27 PM, Jens Alfke <je...@couchbase.com> wrote:
> On Mar 25, 2014, at 12:41 AM, Suraj Kumar <su...@inmobi.com> wrote:
>
>> If there are a million "*.couch" files under var/lib/couchdb/, I'd expect the
>> performance to be very poor / unpredictable since it now depends on the
>> underlying file system's logic.
>
> Do modern filesystems still have performance problems with large directories? I’m sure none of them are representing directories as linear arrays of inodes anymore. I’ve been wondering if this is just folk knowledge that’s no longer relevant.

The most part of issues comes from tools that aren't able to operate
with such amounts of files effectively. That's ruins all usability of
having billion files in single directory.

Also: http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler.pdf

As for Windows...never try to open a directory with thousands files
inside with the default file manager called Explorer.

--
,,,^..^,,,

Re: A million databases

Posted by Jens Alfke <je...@couchbase.com>.

On Mar 25, 2014, at 11:36 AM, Stanley Iriele <si...@gmail.com> wrote:

> What is the difference between filtered changes feed and sync gateway?. Are
> they that comparable?

The Sync Gateway has a notion called “channels”, which are sort of like tags that get applied to a document revision when it’s added to the database, by an app-supplied JS “sync function”.

Clients can then filter their pull replications by channel, by using the magic filter name “sync_gateway/bychannel” and a query parameter “channels” whose value is a comma-separated list of channel names. This works with CouchDB, Cloudant, PouchDB, etc. as well as Couchbase Lite.

The big difference from traditional filtered replication is that the _changes feed can filter by channels very efficiently. It doesn’t have to load (or call a JS function on) every revision; it’s got per-channel change indexes it can consult.

Channels are also used for access control. Every user account has a set of channels that it’s allowed to access, and it can’t read documents that aren’t tagged with one or more of its channels. This is the thing that lets you avoid having to create a database per user. Also, an unfiltered pull replication will automatically be limited to the set of the user’s channels, so if you ignore filters you get all the docs you’re allowed to see and nothing more.

(Hopefully this isn’t off-topic. I believe the Sync Gateway is a valid member of the CouchDB family of databases, albeit perhaps that red-headed cousin off in the corner…)

—Jens

Re: A million databases

Posted by Stanley Iriele <si...@gmail.com>.

What is the difference between filtered changes feed and sync gateway?. Are
they that comparable?
On Mar 25, 2014 11:27 AM, "Jens Alfke" <je...@couchbase.com> wrote:

>
> On Mar 25, 2014, at 12:41 AM, Suraj Kumar <su...@inmobi.com> wrote:
>
> > If there are a million "*.couch" files under var/lib/couchdb/, I'd
> expect the
> > performance to be very poor / unpredictable since it now depends on the
> > underlying file system's logic.
>
> Do modern filesystems still have performance problems with large
> directories? I'm sure none of them are representing directories as linear
> arrays of inodes anymore. I've been wondering if this is just folk
> knowledge that's no longer relevant.
>
> > What is the "right" way to handle million users with need for isolated
> > documents within each DB? How will replication solutions cope in
> > distributing these million databases? 2 million replicating connections
> > between two servers doesn't sound right.
>
> The exploding number of replications is the main scalability problem,
> IMHO. Especially since filtered replications don't scale well.
>
> One solution to this is the Couchbase Sync Gateway, which adds support for
> isolated and semi-isolated documents within a single database without
> straying too far from the basic CouchDB model. (Disclaimer: I'm the lead
> implementor of this, so I may be biased in its favor :)
>
> --Jens

Re: A million databases

Posted by Jens Alfke <je...@couchbase.com>.

On Mar 25, 2014, at 12:41 AM, Suraj Kumar <su...@inmobi.com> wrote:

> If there are a million "*.couch" files under var/lib/couchdb/, I'd expect the
> performance to be very poor / unpredictable since it now depends on the
> underlying file system's logic.

Do modern filesystems still have performance problems with large directories? I’m sure none of them are representing directories as linear arrays of inodes anymore. I’ve been wondering if this is just folk knowledge that’s no longer relevant.

> What is the "right" way to handle million users with need for isolated
> documents within each DB? How will replication solutions cope in
> distributing these million databases? 2 million replicating connections
> between two servers doesn't sound right.

The exploding number of replications is the main scalability problem, IMHO. Especially since filtered replications don’t scale well.

One solution to this is the Couchbase Sync Gateway, which adds support for isolated and semi-isolated documents within a single database without straying too far from the basic CouchDB model. (Disclaimer: I’m the lead implementor of this, so I may be biased in its favor :)

—Jens

Re: A million databases

Posted by Alexander Shorin <kx...@gmail.com>.

On Tue, Mar 25, 2014 at 11:41 AM, Suraj Kumar <su...@inmobi.com> wrote:
> I constantly see "one DB per user" being proposed as a solution. But I'm
> not entirely convinced whether this will truly work for a large scale
> setup.
>
> The reason people choose CouchDB is for high scale use where one could
> potentially end up with a million users. Then, what good is a database  hat
> only relies on the underlying filesystem to do the job of index keeping? If
> there are a million "*.couch" files under var/lib/couchdb/, I'd expect the
> performance to be very poor / unpredictable since it now depends on the
> underlying file system's logic. How can this be partitioned?
>
> What is the "right" way to handle million users with need for isolated
> documents within each DB? How will replication solutions cope in
> distributing these million databases? 2 million replicating connections
> between two servers doesn't sound right.

This would be fixed with BigCouch merge where information of databases
is stored inside special database.

As for plain CouchDB, you may use / (slash) character in name to group
databases on filesystem layer. Say, you name users databases with some
hash name like: 00007fda . Then you may insert this character in the
middle to produce the next structure:

0000/
---- 4ea7.couch
---- 7fda.couch
---- dd47.couch
0016/
---- bbc1.couch

and so on. This will reduce negative FS behavior for million files
inside single directory.

--
,,,^..^,,,