You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Rob Stewart <ro...@googlemail.com> on 2009/10/07 14:53:02 UTC

couchdb Comparison

Hi all.
Two things about me:
1. I am new to couchdb (and erlang)
2. I am doing a Masters in Software Engineering, and for my dissertation
study, I am keen to give comparisons of distributed systems.

At my university, I have a cluster that I am going to be able to use for my
research. I am intending to deploy Hadoop across the cluster.

I have some ideas about what I am wanting to analyse in detail, although I
am really needing some sound advise from you guys, as to what is relevant,
interesting, and feasible. I have had a look at Pig, the abstraction layer
on top of Hadoop, which is, for those of you who are less familiar with it,
ideal for data log processing (web server access log files and so forth).
Pig taps into the Hadoop cluster, and intelligently distributes the
execution and data across the DataNodes.

So... couchDB. I have had a look over the last few days into CouchDB, what
its ideal functions are, but more importantly, how this can be distributed
across a cluster, for comparitive analysis. Unfortunately, what I have found
is a mixed bag. As far as I can tell, there is no de-facto routine to
distribute both the data *and* the execution of a CouchDB job seamlessly
across a cluster. Instead, there currently exists methods of using a Proxy
server. See the Google Summer of Code project for a much better explanation
of CouchDB's lack of formal distribution features  (
http://socghop.appspot.com/document/show/user/rleeds/couchdb_cluster )

So here's my question, or need for advise. What would be the best way to
give a relevant comparative study between Pig (or another Hadoop package if
necessary (HBase for e.g.) against CouchDB, in its current form. Ideally I
would want to compare the performance of the two, when each has full use of
the cluster at university, although with CouchDB's current limitations, I
realise this may not be possible.

Failing that, is it an unfair assessment to compare performance of CouchDB
running on a single CouchDB server, against the computation performance of
Pig, running on a Hadoop cluster? (Is that an unfair test, or reasonable?).

Finally.. the dumbest question of all. I realise that CouchDB is a document
based database system. But, what are the tools at a users' peril to
manipulate the returned data. e.g. In a typical DBMS, one could: " select *
from CarList list where list.colour='blue'  ". Or perhaps a SQL join
statement, or perhaps a filter statement similar to the Pig filter command.
 So is there functionality within CouchDB for this sort of record
manipulation, or do I need to code in some other format to create these
sorts of functions?


Many thanks. I look forward to follow this mailing list over the course of
the year, and do indeed look forward to my study over the coming year, where
ever that may take me.

Thanks

Rob Stewart

Re: couchdb Comparison

Posted by Rob Stewart <ro...@googlemail.com>.

Hi Randall,

thanks for getting in touch. I hope you're still able to contribute to
CouchDB in some way or another. Maybe the community just isn't yet ready to
commit itself to a de-facto, formalised way to distribute the execution of
CouchDB just yet? (lol)

I will keep you upto date with my progress, but I am certainly looking at my
project from a parallel distribution problem, as opposed to a DBMS exclusive
project, and I have a university cluster at my peril. But I'll keep you
updated.

@ Jesse - You confirmed some of my suspicions about CouchDB, with regards to
its mission, its scalability and its similarity to a distributed system such
as Hadoop. It is very useful to be aware of the explicit map-reduce nature
with respect to CouchDB, and is not something that will be overlooked in my
study for sure (Map-reduce has a vital role in Hadoop (it is the very core
of the distribution of processing/data)).  Perhaps, in a time not so far
away, there could be a study on the scalability and parallel performance on
CouchDB where CouchDB offers a developer these things for free ! (?)

Rob

2009/10/7 Jesse Hallett <ha...@gmail.com>

> One issue is that Hadoop and CouchDB are very different tools.
>
> Hadoop is great at intensive, high-latency data analysis.  It doesn't
> matter
> how complicated the computation you want is - Hadoop will do it for you
> because it is a data processing engine.
>
> CouchDB is a database.  It is designed for low-latency, high-availability
> operations.  CouchDB is not a data processing engine, it is a data
> retrieval
> engine.  It should be faster than Hadoop for tasks that both systems can
> handle; and CouchDB can perform some powerful analysis via its map-reduce
> capability.  But the analysis you can perform with CouchDB will ultimately
> be limited by its low-latency design philosophy.
>
> What can be misleading is that while both Hadoop and CouchDB use
> map-reduce,
> they use it for very different things.  It is analogous to saying "these
> two
> programs both use iteration over tree structures."  One detail on choice of
> algorithm does not tell you what a program is designed for or what it is
> good at.
>
> CouchDB uses map-reduce to build pre-computed views of data.  The
> map-reduce
> pattern enforces data isolation which allows CouchDB to incrementally
> update
> views.  CouchDB does not (yet) take advantage of parallel processing when
> generating views.  Though you can get parallelism by distributing data over
> a cluster and splitting queries with a proxy.
>
> Hadoop uses map-reduce to run computation in parallel and to distribute
> computation across multiple machines.  The same data isolation that CouchDB
> relies on allows this.  But Hadoop takes advantage of that feature
> differently.
>
> On Oct 7, 2009 7:29 AM, "Göran Krampe" <go...@krampe.se> wrote:
>
> Nicholas Orr wrote: > > On Wed, Oct 7, 2009 at 11:53 PM, Rob Stewart > <
> robstewart57@googlemail.com>...
> Using an intermediate library in your language of choice you can get
> queries
> etc to look rather similar, take a look at this C# example program for
> using
> Divan:
>
> http://github.com/gokr/Divan/blob/master/samples/Trivial/Program.cs
>
> ...funny enough it also uses "Cars" as an example :). Note the LINQ
> integration which actually makes it possible to write:
>
> var fastCars = from c in linqCars where c.HorsePowers >= 175 select c;
>
> (given a view for it)
>
> regards, Göran
>

Re: couchdb Comparison

Posted by Jesse Hallett <ha...@gmail.com>.

One issue is that Hadoop and CouchDB are very different tools.

Hadoop is great at intensive, high-latency data analysis.  It doesn't matter
how complicated the computation you want is - Hadoop will do it for you
because it is a data processing engine.

CouchDB is a database.  It is designed for low-latency, high-availability
operations.  CouchDB is not a data processing engine, it is a data retrieval
engine.  It should be faster than Hadoop for tasks that both systems can
handle; and CouchDB can perform some powerful analysis via its map-reduce
capability.  But the analysis you can perform with CouchDB will ultimately
be limited by its low-latency design philosophy.

What can be misleading is that while both Hadoop and CouchDB use map-reduce,
they use it for very different things.  It is analogous to saying "these two
programs both use iteration over tree structures."  One detail on choice of
algorithm does not tell you what a program is designed for or what it is
good at.

CouchDB uses map-reduce to build pre-computed views of data.  The map-reduce
pattern enforces data isolation which allows CouchDB to incrementally update
views.  CouchDB does not (yet) take advantage of parallel processing when
generating views.  Though you can get parallelism by distributing data over
a cluster and splitting queries with a proxy.

Hadoop uses map-reduce to run computation in parallel and to distribute
computation across multiple machines.  The same data isolation that CouchDB
relies on allows this.  But Hadoop takes advantage of that feature
differently.

On Oct 7, 2009 7:29 AM, "Göran Krampe" <go...@krampe.se> wrote:

Nicholas Orr wrote: > > On Wed, Oct 7, 2009 at 11:53 PM, Rob Stewart > <
robstewart57@googlemail.com>...
Using an intermediate library in your language of choice you can get queries
etc to look rather similar, take a look at this C# example program for using
Divan:

http://github.com/gokr/Divan/blob/master/samples/Trivial/Program.cs

...funny enough it also uses "Cars" as an example :). Note the LINQ
integration which actually makes it possible to write:

var fastCars = from c in linqCars where c.HorsePowers >= 175 select c;

(given a view for it)

regards, Göran

Re: couchdb Comparison

Posted by Göran Krampe <go...@krampe.se>.

Nicholas Orr wrote:
> On Wed, Oct 7, 2009 at 11:53 PM, Rob Stewart
> <ro...@googlemail.com> wrote:
>> <snip>
>> Finally.. the dumbest question of all. I realise that CouchDB is a document
>> based database system. But, what are the tools at a users' peril to
>> manipulate the returned data. e.g. In a typical DBMS, one could: " select *
>> from CarList list where list.colour='blue'  ". Or perhaps a SQL join
>> statement, or perhaps a filter statement similar to the Pig filter command.
>>  So is there functionality within CouchDB for this sort of record
>> manipulation, or do I need to code in some other format to create these
>> sorts of functions?
> 
> You need to create "views" via a Map/Reduce function:
> http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
> Where an RDBMS like MySQL/Oracle/MSSQL have SQL. CouchDB has
> Map/Reduce to query the data

Using an intermediate library in your language of choice you can get 
queries etc to look rather similar, take a look at this C# example 
program for using Divan:

http://github.com/gokr/Divan/blob/master/samples/Trivial/Program.cs

...funny enough it also uses "Cars" as an example :). Note the LINQ 
integration which actually makes it possible to write:

var fastCars = from c in linqCars where c.HorsePowers >= 175 select c;

(given a view for it)

regards, Göran

Re: couchdb Comparison

Posted by Nicholas Orr <ni...@zxgen.net>.

On Wed, Oct 7, 2009 at 11:53 PM, Rob Stewart
<ro...@googlemail.com> wrote:
> <snip>
> Finally.. the dumbest question of all. I realise that CouchDB is a document
> based database system. But, what are the tools at a users' peril to
> manipulate the returned data. e.g. In a typical DBMS, one could: " select *
> from CarList list where list.colour='blue'  ". Or perhaps a SQL join
> statement, or perhaps a filter statement similar to the Pig filter command.
>  So is there functionality within CouchDB for this sort of record
> manipulation, or do I need to code in some other format to create these
> sorts of functions?

You need to create "views" via a Map/Reduce function:
http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views
Where an RDBMS like MySQL/Oracle/MSSQL have SQL. CouchDB has
Map/Reduce to query the data

Nick