You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Rob Stewart <ro...@googlemail.com> on 2010/06/01 16:43:20 UTC

CouchDB Partitioning Proposal

Hi, CouchDB devs.

This email is to offer my time and services to the CouchDB project, and I am
specifically interested in the proposal set out:
http://wiki.apache.org/couchdb/Partitioning_proposal

I notice that it was last updated in September, 2009.

First of all, though, I'll share with you my background, and why I am
wanting to do this.

Having completed a Masters in Software Engineering in the UK, I am about to
commence work towards a PhD. My Masters project was fundamentally a
performance and programming comparison between the MapReduce high level
languages: Pig, Hive and JAQL. If you so wish, the paper can be found here:
http://www.macs.hw.ac.uk/~rs46/publications.html

My PhD supervisor has previously worked on GpH (Glasgow Parallel Haskell),
and I am in the first months of my PhD. He has urged me to look into Erlang
implementations, suggesting CouchDB as an excellent example. Having looking
into distributed processing in my dissertation (namely the MapReduce
implementation found in Hadoop), the area I am most interested in within the
CouchDB software is the partitioning of database over more than one (and
many) nodes in a cluster.

So, my first, naive, questions might be something like:
1. Is the proposal mentioned in the CouchDB wiki page still a valid problem
(database partitioning).
2. If so,  are the plans underway to solve this problem? I notice that there
was a proposal for the Google Summer of Code in 2009 to provide a solution:
http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster .
3. If this is still an open problem for the CouchDB dev team, how would one
get involved in the design of a partitioning architecture for CouchDB ?

If this is no longer a valid problem to solve, I would remain keen to use
CouchDB as a platform on which to develop a system as part of my PhD work.
Is the "CouchDB proposals" page on the wiki still valid? Any other comments,
or suggestions, would be greatly appreciated at this early stage.

Regards,

Rob Stewart
http://www.macs.hw.ac.uk/~rs46/

Re: CouchDB Partitioning Proposal

Posted by Randall Leeds <ra...@gmail.com>.
On Tue, Jun 1, 2010 at 07:43, Rob Stewart <ro...@googlemail.com> wrote:
> Hi, CouchDB devs.

Hi, Rob. Welcome!

>
> So, my first, naive, questions might be something like:
> 1. Is the proposal mentioned in the CouchDB wiki page still a valid problem
> (database partitioning).

Yes. That was an easy answer! :)

> 2. If so,  are the plans underway to solve this problem? I notice that there
> was a proposal for the Google Summer of Code in 2009 to provide a solution:
> http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster .

That's me! Since the time I wrote that proposal I've gone to work for
Meebo as one of the developers of the Lounge project, the canonical
source of which is my repository on github[1]. The Lounge deviates
from my GSoC proposal and the one outlined on the wiki, though. As in
the wiki proposal, the Lounge uses a tree-like structure of CouchDB
databases created through a proxy layer that handles the hashing and
distribution of keys. However, unlike both proposals the proxies work
on the HTTP layer and do not communicate via Erlang message passing.
This solution incurs the cost of extra JSON overhead in exchange for
keeping the software itself relatively simple and completely separate
from CouchDB itself.

In addition to the Lounge, Cloudant[2] is offering clustered CouchDB
hosting using in-house modifications to the CouchDB code. I cannot
speak authoritatively on their work so I won't try to compare it to
the Lounge other than to say that I believe it is written in Erlang.
For this reason it's possible pieces of their system could wind up in
CouchDB some day if they decide to license it for inclusion.

I've had a few discussions with Benoît Chesneau about implementing an
Erlang solution, but as I recall it mostly revolved around what
architectural changes we'd want to see to the internal APIs to make
the addition of partitioning as clean as possible. Little to no code
has been produced to this end on our part, though Paul Davis has done
a little bit of hacking[3] toward separating the HTTP layer more
cleanly while replacing MochiWeb with Basho's webmachine[4].

Finally, I've toyed around with the idea of re-implementing the Lounge
using Node.js[5] and Robert Newson has recently started to hack on it
as well. There is some (mostly useless so far) code on github[6].

> 3. If this is still an open problem for the CouchDB dev team, how would one
> get involved in the design of a partitioning architecture for CouchDB ?

Since there has been no consensus on the best way to go forward there
is clearly room for different approaches and several projects to
fulfilled different requirements. For my part, I help maintain the
Lounge for the day-to-day operations at Meebo. However, I would like
to see a project that tackles CouchDB clustering with a peer-to-peer
structure instead of a fixed tree, eliminating the operational
headache of manually distributing a fixed number of shards and taking
some lessons from Dynamo, Cassandra and Riak. My work on Lode is
mostly stalled while I hack on a structured overlay project for
Node.js, though I haven't released any source.

To get involved, keep the conversation going here or come to #couchdb
on freenode. Everyone I mentioned tends to frequent that channel. My
nick is the same as my github account: tilgovi.

I think that's a good overview of the state of CouchDB partitioning
solutions. Bring on the questions and discussion!

Kind regards,
Randall

[1] http://github.com/tilgovi/couchdb-lounge
[2] https://cloudant.com/
[3] http://github.com/davisp/couchdb/tree/webmachine
[4] http://webmachine.basho.com/
[5] http://nodejs.org
[6] http://github.com/tilgovi/lode