You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Christopher Bare <cb...@systemsbiology.org> on 2010/09/19 20:37:09 UTC

distributed map-reduce views

Hi Couch-potatoes,

I'm investigating using CouchDB for a data mining application and
could use some advice.

What I have in mind is sharding a collection of documents between
several instances of CouchDB each running on their own nodes. Then, I
want to run distributed map-reduce queries over the whole collection
of documents. Do I understand correctly that Lounge is currently the
way to do this?

How would doing something like this with CouchDB and Lounge compare
with using Hadoop and HBase?

Thanks for any insight!

-Chris

Re: distributed map-reduce views

Posted by Randall Leeds <ra...@gmail.com>.

On Sun, Sep 19, 2010 at 20:37, Christopher Bare
<cb...@systemsbiology.org> wrote:
> Hi Couch-potatoes,
>
> I'm investigating using CouchDB for a data mining application and
> could use some advice.

Cool! Welcome to the party.

>
> What I have in mind is sharding a collection of documents between
> several instances of CouchDB each running on their own nodes. Then, I
> want to run distributed map-reduce queries over the whole collection
> of documents. Do I understand correctly that Lounge is currently the
> way to do this?

Lounge is one way. BigCouch (just released) is another.

>
> How would doing something like this with CouchDB and Lounge compare
> with using Hadoop and HBase?

I do not know that much about HBase/Hadoop. I bet someone else on the
list can add more differences, but I know at least there is a data
model difference: CouchDB uses JSON documents but HBase is column
oriented. Also, if HBase relies on HDFS then I think the HDFS name
node is a single point of failure, whereas you can configure BigCouch
and Lounge with redundancy at every level of the system.

-Randall

Re: distributed map-reduce views

Posted by Robert Newson <ro...@gmail.com>.

patent lawyers are deterministic in my experience. all things
involving computers can be patented as long as your explanation is
impenetrable (assertion void outside the US).

B.

On Mon, Sep 20, 2010 at 10:45 PM, Paul Davis
<pa...@gmail.com> wrote:
> On Mon, Sep 20, 2010 at 5:30 PM, Robert Dionne
> <di...@dionne-associates.com> wrote:
>>
>>
>>
>> On Sep 20, 2010, at 5:13 PM, Paul Davis wrote:
>>
>>> On Mon, Sep 20, 2010 at 5:01 PM, Robert Newson <ro...@gmail.com> wrote:
>>>> A nice explanation. I've never quite known how to respond to people
>>>> that, when I discuss CouchDB with them, say "why not use Hadoop?".
>>>> Admittedly it's mostly because I'm trying to hold back a biting
>>>> comment, since there's really no commonality besides the use of
>>>> (distinct variants of) the Map/Reduce (family of) algorithm(s).
>>>>
>>>> B.
>>>
>>> M/R := Map/Reduce
>>>
>>> Generally, when I hear people comparing CouchDB M/R to Google M/R, I
>>> remind them that Google M/R isn't real M/R.
>>
>> that would explain why they received a patent for it :)
>>
>
> Well, that and patents involve lawyers which are about as
> deterministic as Schrödinger's cat.
>
>>
>>
>>
>>> That generally shocks
>>> people enough that they're able to reconsider the differences and how
>>> the two things really aren't the same. Although if someone disagrees
>>> with you because they read the Google white paper on M/R you should
>>> feel free to make extended use of biting comments.
>>>
>>> HTH,
>>> Paul Davis
>>
>>
>

Re: distributed map-reduce views

Posted by Paul Davis <pa...@gmail.com>.

On Mon, Sep 20, 2010 at 5:30 PM, Robert Dionne
<di...@dionne-associates.com> wrote:
>
>
>
> On Sep 20, 2010, at 5:13 PM, Paul Davis wrote:
>
>> On Mon, Sep 20, 2010 at 5:01 PM, Robert Newson <ro...@gmail.com> wrote:
>>> A nice explanation. I've never quite known how to respond to people
>>> that, when I discuss CouchDB with them, say "why not use Hadoop?".
>>> Admittedly it's mostly because I'm trying to hold back a biting
>>> comment, since there's really no commonality besides the use of
>>> (distinct variants of) the Map/Reduce (family of) algorithm(s).
>>>
>>> B.
>>
>> M/R := Map/Reduce
>>
>> Generally, when I hear people comparing CouchDB M/R to Google M/R, I
>> remind them that Google M/R isn't real M/R.
>
> that would explain why they received a patent for it :)
>

Well, that and patents involve lawyers which are about as
deterministic as Schrödinger's cat.

>
>
>
>> That generally shocks
>> people enough that they're able to reconsider the differences and how
>> the two things really aren't the same. Although if someone disagrees
>> with you because they read the Google white paper on M/R you should
>> feel free to make extended use of biting comments.
>>
>> HTH,
>> Paul Davis
>
>

Re: distributed map-reduce views

Posted by Robert Dionne <di...@dionne-associates.com>.



On Sep 20, 2010, at 5:13 PM, Paul Davis wrote:

> On Mon, Sep 20, 2010 at 5:01 PM, Robert Newson <ro...@gmail.com> wrote:
>> A nice explanation. I've never quite known how to respond to people
>> that, when I discuss CouchDB with them, say "why not use Hadoop?".
>> Admittedly it's mostly because I'm trying to hold back a biting
>> comment, since there's really no commonality besides the use of
>> (distinct variants of) the Map/Reduce (family of) algorithm(s).
>> 
>> B.
> 
> M/R := Map/Reduce
> 
> Generally, when I hear people comparing CouchDB M/R to Google M/R, I
> remind them that Google M/R isn't real M/R.

that would explain why they received a patent for it :)




> That generally shocks
> people enough that they're able to reconsider the differences and how
> the two things really aren't the same. Although if someone disagrees
> with you because they read the Google white paper on M/R you should
> feel free to make extended use of biting comments.
> 
> HTH,
> Paul Davis

Re: distributed map-reduce views

Posted by Paul Davis <pa...@gmail.com>.

On Mon, Sep 20, 2010 at 5:01 PM, Robert Newson <ro...@gmail.com> wrote:
> A nice explanation. I've never quite known how to respond to people
> that, when I discuss CouchDB with them, say "why not use Hadoop?".
> Admittedly it's mostly because I'm trying to hold back a biting
> comment, since there's really no commonality besides the use of
> (distinct variants of) the Map/Reduce (family of) algorithm(s).
>
> B.

M/R := Map/Reduce

Generally, when I hear people comparing CouchDB M/R to Google M/R, I
remind them that Google M/R isn't real M/R. That generally shocks
people enough that they're able to reconsider the differences and how
the two things really aren't the same. Although if someone disagrees
with you because they read the Google white paper on M/R you should
feel free to make extended use of biting comments.

HTH,
Paul Davis

Re: distributed map-reduce views

Posted by Robert Newson <ro...@gmail.com>.

A nice explanation. I've never quite known how to respond to people
that, when I discuss CouchDB with them, say "why not use Hadoop?".
Admittedly it's mostly because I'm trying to hold back a biting
comment, since there's really no commonality besides the use of
(distinct variants of) the Map/Reduce (family of) algorithm(s).

B.

On Mon, Sep 20, 2010 at 9:51 PM, Paul Davis <pa...@gmail.com> wrote:
>> How would doing something like this with CouchDB and Lounge compare
>> with using Hadoop and HBase?
>
> Remember that CouchDB and Hadoop serve different purposes. CouchDB is
> a data store, where as Hadoop is a data processing platform. While
> they both have "MapReduce" functionality they aren't quite the same
> thing.
>
> In CouchDB, when we use Map/Reduce, we create a single persistent
> index of data using map and reduce operators. These indexes can then
> be queried using single key or range lookups. Because of the
> properties of Map/Reduce we're capable of updating these indexes
> incrementally.
>
> Hadoop on the other hand is meant to handle arbitrary pipelines of
> data processing. Ie, users can configure Hadoop to run multiple stages
> of Map/Reduce in order to produce some desired output. The
> intermediate stages are not intended to be persistent and query-able.
> I'm not familiar enough to know how people use HBase in conjunction
> with Hadoop other than I believe its generally a data source. I don't
> know if it stores intermediate results or not. As far as I know,
> Hadoop doesn't provide incremental indexing.
>
> As Randal points out, there are various differences in implementation,
> but its also important to understand the data store vs. data
> processing differences of the two systems.
>
> HTH,
> Paul Davis
>

Re: distributed map-reduce views

Posted by Paul Davis <pa...@gmail.com>.

> How would doing something like this with CouchDB and Lounge compare
> with using Hadoop and HBase?

Remember that CouchDB and Hadoop serve different purposes. CouchDB is
a data store, where as Hadoop is a data processing platform. While
they both have "MapReduce" functionality they aren't quite the same
thing.

In CouchDB, when we use Map/Reduce, we create a single persistent
index of data using map and reduce operators. These indexes can then
be queried using single key or range lookups. Because of the
properties of Map/Reduce we're capable of updating these indexes
incrementally.

Hadoop on the other hand is meant to handle arbitrary pipelines of
data processing. Ie, users can configure Hadoop to run multiple stages
of Map/Reduce in order to produce some desired output. The
intermediate stages are not intended to be persistent and query-able.
I'm not familiar enough to know how people use HBase in conjunction
with Hadoop other than I believe its generally a data source. I don't
know if it stores intermediate results or not. As far as I know,
Hadoop doesn't provide incremental indexing.

As Randal points out, there are various differences in implementation,
but its also important to understand the data store vs. data
processing differences of the two systems.

HTH,
Paul Davis