You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Jan Lehnardt <ja...@apache.org> on 2008/04/29 19:09:08 UTC

Fw: CouchDB, RackLabs, MailTrust, etc.

Heya list,
this a conversation that went off-list that might
be of value here and could be continued here.



Begin forwarded message:

> [...]
>> If you have the time to write, I do have a few general questions  
>> for you. I'll try to quickly describe our situation here and where  
>> I think CouchDB might fit, at least initially.
>> We have lots of fun trying to scale everything around here with our  
>> fast growth. Our local dev teams here in San Antonio work on a  
>> ticketing system with millions of tickets using Postgres and Python  
>> mostly. We're also just deploying a Xapian search index  
>> infrastructure to try to offload some of the workload now weighing  
>> heavily on Postgres.
>>
>> I'm specifically looking at CouchDB for a workflow system addition,  
>> where the workflow definitions themselves and the instances of  
>> those workflows in action will be stored.
>
> I know of a research project here near Berlin
> that deals with exactly this, a workflow system,
> possibly built on top of CouchDB (they've been
> researching for a while now and only recently
> discovered CouchDB, but they had so much
> issues with getting things up on Postgres that
> they immediately considered CouchDB). Anyway,
> maybe I put you guys in contact?

[Hagen, that is for you.]


>> The activity would somewhat mirror our ticketing system, though not  
>> in data size volume, certainly in data change volume. I can get  
>> better stats for you later, but I would guess 2-4 thousand new  
>> workflow instances a day and probably that many older workflow  
>> instances updated per day as well. Likely, there'd be about 50  
>> thousand "recently updated (within a few weeks)" workflow instances  
>> and eventually a few million inactive instances for archival  
>> querying. I'm afraid I can't even make a good guess at the read  
>> volume right now. The scary part for us is that our volume has been  
>> doubling each year so far.
>
> Scary indeed, but good to know ;)
>
>
>> I know CouchDB has been built for scalability, but what sorts of  
>> use cases have you guys tried out, especially with regard to large  
>> data sets? What other items do you think might affect the system I  
>> generally described above?
>
> [...]
>
> The core feature to scale CouchDB is replication.
> That is an asynchronous N-directional, rsync-like
> operation to bring N nodes on the same level
> data-wise.
>
> This allows you to build any combination of
> master-slave(-slave-slaveā€¦), master-master
> and N-master replication topologies. With the latter
> effectively being a distributed peer to peer replication
> that even works with nodes only occasionally being
> online.
>
> CouchDB uses a HTTP API which allows you to use
> all the usual suspects (proxies, load balancers etc.)
> in front of it.
>
> Since multi-master replication helps you only with
> scaling writes and master-slave replication scaling
> reads, there's nothing to help with large data-sets.
> Except for large drives in each node :)
>
> What we (Damien) have in mind for post 1.0 is
> built in database partitioning/sharding that would
> split up a DB onto multiple nodes and handle
> requests automagically for you. For now, you would
> have to built such a system with an intermediate
> HTTP proxy service that would decide which
> documents go where and vice versa (that is,
> it would have to do the hashing and distribution).
>
> That said, a few million workflows, both live and
> archived should not be a problem for a single
> Node as far as I can see. You might want to
> slave-out reads, if there're more coming in than
> your disk I/O can handle. CouchDB should not
> be the bottleneck here. Again, a reverse proxy
> in front of CouchDB should help a great deal.
>
> CouchDB is designed to handle a lot (as in
> A LOT) of concurrent requests instead of
> optimising for single query speed. Single queries
> are still reasonably fast, but a RDBMS is surely
> faster. But we can easily handle 10x and more
> concurrent requests than the average RDBMS
> (for reads that is).
>
> A little number bragging: My Mac Mini that I use
> as a workstation does at max around ~100 random
> reads per second on MySQL. With CouchDB I can
> go to ~1000.
>
> That is with an unoptimised CouchDB. CouchDB
> does not yet perform any sort of caching and
> we didn't do any profiling, so things will speed
> up significantly before 1.0.
>
>
>> How do you envision data distributed over the world (both reading  
>> and updating)?
>
> Have at least a single master server (and a backup)
> in each location. This makes each location independent
> of anyone else in case the Internet goes away.
>
> Additionally, you can have these master servers
> replicate with each other. So your Japan office
> can operate on the US offices data without round-
> tripping to the US.
>
>
>> If replication, how fast would that replication be (we have fall  
>> over support structures where people in different countries can  
>> help teams in other countries)?
>
> Replication works on the document level at
> the moment (with attribute-level in the future)
> and works in the way that only the documents
> that were added, have changed or were deleted
> get replicated. That is, only diffs are exchanged
> and eventually all participating DBs end up with the
> same data. How fast that is depends on the amount
> of data, number of changes and your connection
> speed. At the moment operations are not batched
> and single documents are fetched and applied
> and that slows us down a bit. We have some
> plans to further optimise replication by fixing
> this and other things.
>
>
>> What sort of support structure do you envision for CouchDB?
>
> What do you mean here? Support for the
> 'product' CouchDB? This is open source so
> you are on your own :) We have a lively, small,
> yet growing community that helps out each
> other on IRC and our mailing lists. Apart from
> that, I'm a freelancer and available for money
> :) There might be others' too.
>
>
>> You mentioned hitting beta this summer, is that still on track? Do  
>> you expect to hit 1.0 by the end of the year, or something  
>> recommended for production systems? I know these time lines are  
>> impossible to really know, but I have to ask. =)
>
> We are still on track with this, yea, but I won't
> promise anything :) One thing to point out though
> is that we have the alpha label, because we don't
> yet have all the features we'd like to see in 1.0.
> Those who are in are relatively stable and we haven't
> gotten a lot of serious problem reports.
>
>
>> Will authentication and permission schemes be part of 1.0?
>
> We will have validation, that is a mechanism that can
> deny a write on terms you define. That is not exactly
> authentication, and any permission system, you'd
> have to build on top of that. Authentication will be a
> post 1.0 and we suggest using an HTTP proxy to
> solve that for you.
>
> [...]

Cheers
Jan
--