You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Christopher Bare <cb...@systemsbiology.org> on 2010/09/28 22:40:33 UTC

CouchDB for data mining

Hi,

I'm looking into CouchDB for a data mining application. I'm a noob, so
I'm just getting an appreciation for the new (and very creative)
approach taken with Couch. Please let me first verify that I have a
few things straight:

A view is a lot more like an index than a query in SQL terms. The keys
emitted from the mapper are used to construct a b-tree. Aggregate
values computed in the reducer may be hung on the higher nodes of the
tree. Constructing this tree is an expensive operation, but read
access is fast and it can be updated incrementally as the underlying
data changes. (Baron Schwartz's A Gentle Introduction to CouchDB for
Relational Practitioners explains this nicely.)

A view is formulated using the map-reduce (MR) pattern, which
essentially divides a big job into lots of small independent subtasks.
In Hadoop and Google's MR, that independence is used for parallelism
in distributed environment. Couch's use of MR is very different. I'm
not sure how parallelism comes into play in Couch, but it seems to me
a key feature of Couch is that the independence of MR is exploited to
compute and cache partial results in the b-tree and to update them
incrementally.

The targeted here is the "shit-loads of users" scenario where the cost
of building and maintaining the view can be amortized over lots of
read operations.

Now, if that's all more-or-less right, how does that apply to data mining?

In a data mining app, you typically have lots of ad-hoc queries.
You'll read that Couch doesn't do ad-hoc queries, but I have a feeling
that, if you're smart about it, you can create views that will serve
as the basis for whole classes of queries. The view will do part of
the work and your client code will have to do part as well. I haven't
quite gotten my head around how this is done, nor around how Couch's
list functions might fit into the picture.

It would be great to have an example data mining app for Couch. The
classic textbook example is co-occurrence of items in a large database
of grocery store shopping baskets. You ask questions like, "If a
customer buys diapers, do they also buy beer?" It will come as little
surprise to any new parents that, in fact, they do. In this case,
you're documents would consist of a set of purchased items and
associated information like customer demographics, geographic
information, sales and promotions, etc. which are usually modeled in
terms of a star schema in an RDBMS. The task is then to ask the same
basic questions about what people buy sliced and diced by or
conditioned on the associated data, like, "Do males in the pacific
northwest buy diapers and beer when beer is on sale?"

Is something like that an appropriate use case for Couch? It would be
awesome to have some guidance from the gurus on applications like
this, which are very different from either transaction processing or
the highly-available eventual-consistency use-cases often associated
with NoSQL.

Thanks!

--  Chris

Re: CouchDB for data mining

Posted by Randall Leeds <ra...@gmail.com>.
Remarks inline.

On Tue, Sep 28, 2010 at 13:40, Christopher Bare
<cb...@systemsbiology.org> wrote:
> Hi,

Hey there!

>
> I'm looking into CouchDB for a data mining application. I'm a noob, so
> I'm just getting an appreciation for the new (and very creative)
> approach taken with Couch. Please let me first verify that I have a
> few things straight:
>
> A view is a lot more like an index than a query in SQL terms. The keys
> emitted from the mapper are used to construct a b-tree. Aggregate
> values computed in the reducer may be hung on the higher nodes of the
> tree. Constructing this tree is an expensive operation, but read
> access is fast and it can be updated incrementally as the underlying
> data changes. (Baron Schwartz's A Gentle Introduction to CouchDB for
> Relational Practitioners explains this nicely.)

Exactly right. It's clear you've done your reading :).

>
> A view is formulated using the map-reduce (MR) pattern, which
> essentially divides a big job into lots of small independent subtasks.
> In Hadoop and Google's MR, that independence is used for parallelism
> in distributed environment. Couch's use of MR is very different. I'm
> not sure how parallelism comes into play in Couch, but it seems to me
> a key feature of Couch is that the independence of MR is exploited to
> compute and cache partial results in the b-tree and to update them
> incrementally.

Correct. However, there's nothing about the design of couch's MR that
prevents parallelization, though right now it's only exploited be the
3rd-party clustering solutions.

>
> The targeted here is the "shit-loads of users" scenario where the cost
> of building and maintaining the view can be amortized over lots of
> read operations.
>
> Now, if that's all more-or-less right, how does that apply to data mining?
>
> In a data mining app, you typically have lots of ad-hoc queries.
> You'll read that Couch doesn't do ad-hoc queries, but I have a feeling
> that, if you're smart about it, you can create views that will serve
> as the basis for whole classes of queries. The view will do part of
> the work and your client code will have to do part as well. I haven't
> quite gotten my head around how this is done, nor around how Couch's
> list functions might fit into the picture.

You're absolutely right about smart view design. Most questions get
resolved with some kind of smart view.
Anything that doesn't fit this can generally be solved with a little
more work and some addons. For example, FTI gets you a bunch of ad-hoc
queries you can't otherwise do and there are ways to add this to
CouchDB today, though nothing in the official source.

>
> It would be great to have an example data mining app for Couch. The
> classic textbook example is co-occurrence of items in a large database
> of grocery store shopping baskets. You ask questions like, "If a
> customer buys diapers, do they also buy beer?" It will come as little
> surprise to any new parents that, in fact, they do. In this case,
> you're documents would consist of a set of purchased items and
> associated information like customer demographics, geographic
> information, sales and promotions, etc. which are usually modeled in
> terms of a star schema in an RDBMS. The task is then to ask the same
> basic questions about what people buy sliced and diced by or
> conditioned on the associated data, like, "Do males in the pacific
> northwest buy diapers and beer when beer is on sale?"

I don't have any links offhand, but I know there have been some blog
posts about some of these topics.
If you do the research and come up with a nice list please start a
wiki page to collect them. I think it'd be a great resource.

>
> Is something like that an appropriate use case for Couch? It would be
> awesome to have some guidance from the gurus on applications like
> this, which are very different from either transaction processing or
> the highly-available eventual-consistency use-cases often associated
> with NoSQL.

I don't see any reason yet why you *shouldn't* use CouchDB. However, I
won't say you're not still pretty early to the big data party, so
it'll probably take some trailblazing.

>
> Thanks!
>
> --  Chris
>

You're totally welcome!
Feel free to keep the questions coming or hop on #couchdb on freenode
if you want more real-time feedback from the community and devs.

-Randall