You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@couchdb.apache.org by Rob Crowell <ro...@gmail.com> on 2011/11/15 18:17:19 UTC

Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Hello everyone,

I'm writing some log-parsing code which is currently running on a
MySQL backend.  We're doing a huge amount of aggregates on this data
right now, but performance is suffering and I'm looking for
alternatives.  The idea of incremental map/reduce initially seemed
like the exact right thing, but I can't seem to express some of the
most important queries we are currently running in our production
system.

We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
"2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
pre-compute these aggregates (it checks modified timestamps and pulls
in only new records) and write them to a summary table.  I initially
believed I could use CouchDB's incremental map/reduce to effortlessly
build and update our "summary information" as it changes, but I'm
stuck.  I'm trying to relax, but I can't figure out exactly how :)

In our example, our user "boris" listens to the same song many times
each month, and we're interested in the number of distinct songs he's
listened to during a specified time period (NOT the number of song
plays, but the number of distinct songs played).  In CouchDB it isn't
much trouble to get all of the unique songs that he's listened to
during a period.  Here's our document:

{
  song_id: "happy birthday",
  user_id: "boris",
  date_played: [2011, 11, 14, 00, 12, 55],
  _id: ...
}

To get the unique values, all we need to do is emit([doc.user_id,
doc.date_played, doc.song_id], null), reduce with _count, and query
with a startkey=["boris", "2011-01-01"]&endkey=["boris",
"2011-02-01"]&group_level=1.  This query will yield results like:

["boris", "happy birthday"], 20
["boris", "yesterday"], 14
...

However, if our user has listened to 50,000 songs during the date
range, we'll get back 50,000 rows which seems expensive.  What I want
is just the scalar 50,000.  I've tried writing a reduce that returns
the set of distinct song_ids for each user (turning the values list
into a dictionary and back again), but CouchDB complains that I am not
reducing my values fast enough :-/  I'm also not sure how to reduce
this list to a scalar at the end without returning the whole thing to
my client (which defeats the purpose of all this anyways).

Is this possible to do in CouchDB today?  If not, is it something that
is on the roadmap, or does the internal structure of CouchDB's b-tree
make this really hard to do?  It would of course be possible for me to
implement this myself (subscribe to the update notifications and
update my counts as appropriate in a custom script), but I wanted to
move to CouchDB so that I wouldn't have to do all this myself.

Thanks for any advice!

--Rob

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Alon Keren <al...@gmail.com>.

Hi,

Here's an incremental alternative, using Couch Incarnate (
https://github.com/afters/Couch-Incarnate). Keep in mind that the project
is still experimental.

I'd first use Couch Incarnate to create a DB holding only the unique names:
{
  "source": YOUR_ORIGINAL_DB_URL,
  "map": "function (doc) { emit([doc.user_id, doc.date, date_played[0],
date_played[1], date_played[2], doc.song_id], null); }",
  "reduces": {
    "count": {
      "function": "function (key, values, rereduce) { return null; }",
      "group_levels": ['exact']
    }
  }
}

Documents in this DB would look like this:
{
  key: ["boris", 2011, 11, 14, "happy birthday"],
  value: null,
}

Then I'd simply count the docs for a specific date-range with this view:

  map: function (doc) {
    emit(doc.key, null);
  }

  reduce: _count

querying with:

  /?start_key='["boris", 2011, 11, 14]'&end_key='["boris", 2011, 11, 14,
{}]'&group_level=4


The above would work when the time frame is days. If you want to do the
same with months (or years), you'd have to create another DB using
Couch-Incarnate.

  Alon

On 15 November 2011 22:03, Rob Crowell <ro...@gmail.com> wrote:

> Hey Zach,
>
> That looks really good!  Thanks for the pointer, I'd read about list
> functions but never thought about using them in this way.
>
> From what I read, it seems there is no way to build this list function
> incrementally?
>
> --Rob
>
>
> On Tue, Nov 15, 2011 at 1:37 PM, Zachary Zolton
> <za...@gmail.com> wrote:
> > Rob
> >
> > Since you've already got a view that'll give you Boris's 50k unique
> > songs, you could use a _list function to return the number of rows.
> >
> > Something like this should do the trick:
> >
> > function() {
> >  var count = 0;
> >  while(getRow()) count++;
> >  return JSON.stringify({count: count});
> > }
> >
> > If you query this list function, with the same view, key range and
> > group level, it'll just respond with a bit of JSON, such as:
> > {"count":"50612"}
> >
> > Is that more like what you're looking for?
> >
> > You can read up more here:
> > http://guide.couchdb.org/draft/transforming.html
> > http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
> >
> > --Zach
> >
> > On Tue, Nov 15, 2011 at 11:39 AM, Rob Crowell <ro...@gmail.com>
> wrote:
> >> I don't think this works, unless I am misunderstanding.
> >>
> >> If our user "boris" listened to the same song 20 times, and only
> >> listened to that one song, the _count reduce would return 20 would it
> >> not?  I would like the value 1 instead (only 1 distinct song listened
> >> to).
> >>
> >> --Rob
> >>
> >> On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <rn...@apache.org>
> wrote:
> >>> then just emit(doc.user_id, null) and use _count?
> >>>
> >>> B.
> >>>
> >>> On 15 November 2011 17:17, Rob Crowell <ro...@gmail.com> wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I'm writing some log-parsing code which is currently running on a
> >>>> MySQL backend.  We're doing a huge amount of aggregates on this data
> >>>> right now, but performance is suffering and I'm looking for
> >>>> alternatives.  The idea of incremental map/reduce initially seemed
> >>>> like the exact right thing, but I can't seem to express some of the
> >>>> most important queries we are currently running in our production
> >>>> system.
> >>>>
> >>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
> >>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
> >>>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job
> to
> >>>> pre-compute these aggregates (it checks modified timestamps and pulls
> >>>> in only new records) and write them to a summary table.  I initially
> >>>> believed I could use CouchDB's incremental map/reduce to effortlessly
> >>>> build and update our "summary information" as it changes, but I'm
> >>>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
> >>>>
> >>>> In our example, our user "boris" listens to the same song many times
> >>>> each month, and we're interested in the number of distinct songs he's
> >>>> listened to during a specified time period (NOT the number of song
> >>>> plays, but the number of distinct songs played).  In CouchDB it isn't
> >>>> much trouble to get all of the unique songs that he's listened to
> >>>> during a period.  Here's our document:
> >>>>
> >>>> {
> >>>>  song_id: "happy birthday",
> >>>>  user_id: "boris",
> >>>>  date_played: [2011, 11, 14, 00, 12, 55],
> >>>>  _id: ...
> >>>> }
> >>>>
> >>>> To get the unique values, all we need to do is emit([doc.user_id,
> >>>> doc.date_played, doc.song_id], null), reduce with _count, and query
> >>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
> >>>> "2011-02-01"]&group_level=1.  This query will yield results like:
> >>>>
> >>>> ["boris", "happy birthday"], 20
> >>>> ["boris", "yesterday"], 14
> >>>> ...
> >>>>
> >>>> However, if our user has listened to 50,000 songs during the date
> >>>> range, we'll get back 50,000 rows which seems expensive.  What I want
> >>>> is just the scalar 50,000.  I've tried writing a reduce that returns
> >>>> the set of distinct song_ids for each user (turning the values list
> >>>> into a dictionary and back again), but CouchDB complains that I am not
> >>>> reducing my values fast enough :-/  I'm also not sure how to reduce
> >>>> this list to a scalar at the end without returning the whole thing to
> >>>> my client (which defeats the purpose of all this anyways).
> >>>>
> >>>> Is this possible to do in CouchDB today?  If not, is it something that
> >>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
> >>>> make this really hard to do?  It would of course be possible for me to
> >>>> implement this myself (subscribe to the update notifications and
> >>>> update my counts as appropriate in a custom script), but I wanted to
> >>>> move to CouchDB so that I wouldn't have to do all this myself.
> >>>>
> >>>> Thanks for any advice!
> >>>>
> >>>> --Rob
> >>>>
> >>>
> >>
> >
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Rob Crowell <ro...@gmail.com>.

Hey Zach,

That looks really good!  Thanks for the pointer, I'd read about list
functions but never thought about using them in this way.

>From what I read, it seems there is no way to build this list function
incrementally?

--Rob


On Tue, Nov 15, 2011 at 1:37 PM, Zachary Zolton
<za...@gmail.com> wrote:
> Rob
>
> Since you've already got a view that'll give you Boris's 50k unique
> songs, you could use a _list function to return the number of rows.
>
> Something like this should do the trick:
>
> function() {
>  var count = 0;
>  while(getRow()) count++;
>  return JSON.stringify({count: count});
> }
>
> If you query this list function, with the same view, key range and
> group level, it'll just respond with a bit of JSON, such as:
> {"count":"50612"}
>
> Is that more like what you're looking for?
>
> You can read up more here:
> http://guide.couchdb.org/draft/transforming.html
> http://wiki.apache.org/couchdb/Formatting_with_Show_and_List
>
> --Zach
>
> On Tue, Nov 15, 2011 at 11:39 AM, Rob Crowell <ro...@gmail.com> wrote:
>> I don't think this works, unless I am misunderstanding.
>>
>> If our user "boris" listened to the same song 20 times, and only
>> listened to that one song, the _count reduce would return 20 would it
>> not?  I would like the value 1 instead (only 1 distinct song listened
>> to).
>>
>> --Rob
>>
>> On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <rn...@apache.org> wrote:
>>> then just emit(doc.user_id, null) and use _count?
>>>
>>> B.
>>>
>>> On 15 November 2011 17:17, Rob Crowell <ro...@gmail.com> wrote:
>>>> Hello everyone,
>>>>
>>>> I'm writing some log-parsing code which is currently running on a
>>>> MySQL backend.  We're doing a huge amount of aggregates on this data
>>>> right now, but performance is suffering and I'm looking for
>>>> alternatives.  The idea of incremental map/reduce initially seemed
>>>> like the exact right thing, but I can't seem to express some of the
>>>> most important queries we are currently running in our production
>>>> system.
>>>>
>>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>>>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>>>> pre-compute these aggregates (it checks modified timestamps and pulls
>>>> in only new records) and write them to a summary table.  I initially
>>>> believed I could use CouchDB's incremental map/reduce to effortlessly
>>>> build and update our "summary information" as it changes, but I'm
>>>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>>>
>>>> In our example, our user "boris" listens to the same song many times
>>>> each month, and we're interested in the number of distinct songs he's
>>>> listened to during a specified time period (NOT the number of song
>>>> plays, but the number of distinct songs played).  In CouchDB it isn't
>>>> much trouble to get all of the unique songs that he's listened to
>>>> during a period.  Here's our document:
>>>>
>>>> {
>>>>  song_id: "happy birthday",
>>>>  user_id: "boris",
>>>>  date_played: [2011, 11, 14, 00, 12, 55],
>>>>  _id: ...
>>>> }
>>>>
>>>> To get the unique values, all we need to do is emit([doc.user_id,
>>>> doc.date_played, doc.song_id], null), reduce with _count, and query
>>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>>>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>>>
>>>> ["boris", "happy birthday"], 20
>>>> ["boris", "yesterday"], 14
>>>> ...
>>>>
>>>> However, if our user has listened to 50,000 songs during the date
>>>> range, we'll get back 50,000 rows which seems expensive.  What I want
>>>> is just the scalar 50,000.  I've tried writing a reduce that returns
>>>> the set of distinct song_ids for each user (turning the values list
>>>> into a dictionary and back again), but CouchDB complains that I am not
>>>> reducing my values fast enough :-/  I'm also not sure how to reduce
>>>> this list to a scalar at the end without returning the whole thing to
>>>> my client (which defeats the purpose of all this anyways).
>>>>
>>>> Is this possible to do in CouchDB today?  If not, is it something that
>>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>>>> make this really hard to do?  It would of course be possible for me to
>>>> implement this myself (subscribe to the update notifications and
>>>> update my counts as appropriate in a custom script), but I wanted to
>>>> move to CouchDB so that I wouldn't have to do all this myself.
>>>>
>>>> Thanks for any advice!
>>>>
>>>> --Rob
>>>>
>>>
>>
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Zachary Zolton <za...@gmail.com>.

Rob

Since you've already got a view that'll give you Boris's 50k unique
songs, you could use a _list function to return the number of rows.

Something like this should do the trick:

function() {
  var count = 0;
  while(getRow()) count++;
  return JSON.stringify({count: count});
}

If you query this list function, with the same view, key range and
group level, it'll just respond with a bit of JSON, such as:
{"count":"50612"}

Is that more like what you're looking for?

You can read up more here:
http://guide.couchdb.org/draft/transforming.html
http://wiki.apache.org/couchdb/Formatting_with_Show_and_List

--Zach

On Tue, Nov 15, 2011 at 11:39 AM, Rob Crowell <ro...@gmail.com> wrote:
> I don't think this works, unless I am misunderstanding.
>
> If our user "boris" listened to the same song 20 times, and only
> listened to that one song, the _count reduce would return 20 would it
> not?  I would like the value 1 instead (only 1 distinct song listened
> to).
>
> --Rob
>
> On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <rn...@apache.org> wrote:
>> then just emit(doc.user_id, null) and use _count?
>>
>> B.
>>
>> On 15 November 2011 17:17, Rob Crowell <ro...@gmail.com> wrote:
>>> Hello everyone,
>>>
>>> I'm writing some log-parsing code which is currently running on a
>>> MySQL backend.  We're doing a huge amount of aggregates on this data
>>> right now, but performance is suffering and I'm looking for
>>> alternatives.  The idea of incremental map/reduce initially seemed
>>> like the exact right thing, but I can't seem to express some of the
>>> most important queries we are currently running in our production
>>> system.
>>>
>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>>> pre-compute these aggregates (it checks modified timestamps and pulls
>>> in only new records) and write them to a summary table.  I initially
>>> believed I could use CouchDB's incremental map/reduce to effortlessly
>>> build and update our "summary information" as it changes, but I'm
>>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>>
>>> In our example, our user "boris" listens to the same song many times
>>> each month, and we're interested in the number of distinct songs he's
>>> listened to during a specified time period (NOT the number of song
>>> plays, but the number of distinct songs played).  In CouchDB it isn't
>>> much trouble to get all of the unique songs that he's listened to
>>> during a period.  Here's our document:
>>>
>>> {
>>>  song_id: "happy birthday",
>>>  user_id: "boris",
>>>  date_played: [2011, 11, 14, 00, 12, 55],
>>>  _id: ...
>>> }
>>>
>>> To get the unique values, all we need to do is emit([doc.user_id,
>>> doc.date_played, doc.song_id], null), reduce with _count, and query
>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>>
>>> ["boris", "happy birthday"], 20
>>> ["boris", "yesterday"], 14
>>> ...
>>>
>>> However, if our user has listened to 50,000 songs during the date
>>> range, we'll get back 50,000 rows which seems expensive.  What I want
>>> is just the scalar 50,000.  I've tried writing a reduce that returns
>>> the set of distinct song_ids for each user (turning the values list
>>> into a dictionary and back again), but CouchDB complains that I am not
>>> reducing my values fast enough :-/  I'm also not sure how to reduce
>>> this list to a scalar at the end without returning the whole thing to
>>> my client (which defeats the purpose of all this anyways).
>>>
>>> Is this possible to do in CouchDB today?  If not, is it something that
>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>>> make this really hard to do?  It would of course be possible for me to
>>> implement this myself (subscribe to the update notifications and
>>> update my counts as appropriate in a custom script), but I wanted to
>>> move to CouchDB so that I wouldn't have to do all this myself.
>>>
>>> Thanks for any advice!
>>>
>>> --Rob
>>>
>>
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Rob Crowell <ro...@gmail.com>.

I don't think this works, unless I am misunderstanding.

If our user "boris" listened to the same song 20 times, and only
listened to that one song, the _count reduce would return 20 would it
not?  I would like the value 1 instead (only 1 distinct song listened
to).

--Rob

On Tue, Nov 15, 2011 at 12:21 PM, Robert Newson <rn...@apache.org> wrote:
> then just emit(doc.user_id, null) and use _count?
>
> B.
>
> On 15 November 2011 17:17, Rob Crowell <ro...@gmail.com> wrote:
>> Hello everyone,
>>
>> I'm writing some log-parsing code which is currently running on a
>> MySQL backend.  We're doing a huge amount of aggregates on this data
>> right now, but performance is suffering and I'm looking for
>> alternatives.  The idea of incremental map/reduce initially seemed
>> like the exact right thing, but I can't seem to express some of the
>> most important queries we are currently running in our production
>> system.
>>
>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>> pre-compute these aggregates (it checks modified timestamps and pulls
>> in only new records) and write them to a summary table.  I initially
>> believed I could use CouchDB's incremental map/reduce to effortlessly
>> build and update our "summary information" as it changes, but I'm
>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>
>> In our example, our user "boris" listens to the same song many times
>> each month, and we're interested in the number of distinct songs he's
>> listened to during a specified time period (NOT the number of song
>> plays, but the number of distinct songs played).  In CouchDB it isn't
>> much trouble to get all of the unique songs that he's listened to
>> during a period.  Here's our document:
>>
>> {
>>  song_id: "happy birthday",
>>  user_id: "boris",
>>  date_played: [2011, 11, 14, 00, 12, 55],
>>  _id: ...
>> }
>>
>> To get the unique values, all we need to do is emit([doc.user_id,
>> doc.date_played, doc.song_id], null), reduce with _count, and query
>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>
>> ["boris", "happy birthday"], 20
>> ["boris", "yesterday"], 14
>> ...
>>
>> However, if our user has listened to 50,000 songs during the date
>> range, we'll get back 50,000 rows which seems expensive.  What I want
>> is just the scalar 50,000.  I've tried writing a reduce that returns
>> the set of distinct song_ids for each user (turning the values list
>> into a dictionary and back again), but CouchDB complains that I am not
>> reducing my values fast enough :-/  I'm also not sure how to reduce
>> this list to a scalar at the end without returning the whole thing to
>> my client (which defeats the purpose of all this anyways).
>>
>> Is this possible to do in CouchDB today?  If not, is it something that
>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>> make this really hard to do?  It would of course be possible for me to
>> implement this myself (subscribe to the update notifications and
>> update my counts as appropriate in a custom script), but I wanted to
>> move to CouchDB so that I wouldn't have to do all this myself.
>>
>> Thanks for any advice!
>>
>> --Rob
>>
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Robert Newson <rn...@apache.org>.

then just emit(doc.user_id, null) and use _count?

B.

On 15 November 2011 17:17, Rob Crowell <ro...@gmail.com> wrote:
> Hello everyone,
>
> I'm writing some log-parsing code which is currently running on a
> MySQL backend.  We're doing a huge amount of aggregates on this data
> right now, but performance is suffering and I'm looking for
> alternatives.  The idea of incremental map/reduce initially seemed
> like the exact right thing, but I can't seem to express some of the
> most important queries we are currently running in our production
> system.
>
> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
> pre-compute these aggregates (it checks modified timestamps and pulls
> in only new records) and write them to a summary table.  I initially
> believed I could use CouchDB's incremental map/reduce to effortlessly
> build and update our "summary information" as it changes, but I'm
> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>
> In our example, our user "boris" listens to the same song many times
> each month, and we're interested in the number of distinct songs he's
> listened to during a specified time period (NOT the number of song
> plays, but the number of distinct songs played).  In CouchDB it isn't
> much trouble to get all of the unique songs that he's listened to
> during a period.  Here's our document:
>
> {
>  song_id: "happy birthday",
>  user_id: "boris",
>  date_played: [2011, 11, 14, 00, 12, 55],
>  _id: ...
> }
>
> To get the unique values, all we need to do is emit([doc.user_id,
> doc.date_played, doc.song_id], null), reduce with _count, and query
> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
> "2011-02-01"]&group_level=1.  This query will yield results like:
>
> ["boris", "happy birthday"], 20
> ["boris", "yesterday"], 14
> ...
>
> However, if our user has listened to 50,000 songs during the date
> range, we'll get back 50,000 rows which seems expensive.  What I want
> is just the scalar 50,000.  I've tried writing a reduce that returns
> the set of distinct song_ids for each user (turning the values list
> into a dictionary and back again), but CouchDB complains that I am not
> reducing my values fast enough :-/  I'm also not sure how to reduce
> this list to a scalar at the end without returning the whole thing to
> my client (which defeats the purpose of all this anyways).
>
> Is this possible to do in CouchDB today?  If not, is it something that
> is on the roadmap, or does the internal structure of CouchDB's b-tree
> make this really hard to do?  It would of course be possible for me to
> implement this myself (subscribe to the update notifications and
> update my counts as appropriate in a custom script), but I wanted to
> move to CouchDB so that I wouldn't have to do all this myself.
>
> Thanks for any advice!
>
> --Rob
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Marcello Nuccio <ma...@gmail.com>.

In this case, I would ask if the date ranges are predetermined. For
example, if you want the number of distinct songs for a user in a
given month, then you can:

    emit([month, user, song], 1);

and then count when song change.

Marcello

2011/11/16 Zachary Zolton <za...@gmail.com>:
> Except that the OP specified the need to query for these counts within
> a date range. So, you have to collate by listening time, not the song.
>
> On Wed, Nov 16, 2011 at 2:45 PM, Marcello Nuccio
> <ma...@gmail.com> wrote:
>> Hi Rob,
>> I remember I've done something similar a while ago, but I cannot find
>> the code right now and I don't have time to rewrite it right now...
>> however the trick is to only count when the song name changes. This
>> works because view rows are sorted.
>>
>> HTH,
>>  Marcello
>>
>> 2011/11/15 Rob Crowell <ro...@gmail.com>:
>>> Hello everyone,
>>>
>>> I'm writing some log-parsing code which is currently running on a
>>> MySQL backend.  We're doing a huge amount of aggregates on this data
>>> right now, but performance is suffering and I'm looking for
>>> alternatives.  The idea of incremental map/reduce initially seemed
>>> like the exact right thing, but I can't seem to express some of the
>>> most important queries we are currently running in our production
>>> system.
>>>
>>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>>> pre-compute these aggregates (it checks modified timestamps and pulls
>>> in only new records) and write them to a summary table.  I initially
>>> believed I could use CouchDB's incremental map/reduce to effortlessly
>>> build and update our "summary information" as it changes, but I'm
>>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>>
>>> In our example, our user "boris" listens to the same song many times
>>> each month, and we're interested in the number of distinct songs he's
>>> listened to during a specified time period (NOT the number of song
>>> plays, but the number of distinct songs played).  In CouchDB it isn't
>>> much trouble to get all of the unique songs that he's listened to
>>> during a period.  Here's our document:
>>>
>>> {
>>>  song_id: "happy birthday",
>>>  user_id: "boris",
>>>  date_played: [2011, 11, 14, 00, 12, 55],
>>>  _id: ...
>>> }
>>>
>>> To get the unique values, all we need to do is emit([doc.user_id,
>>> doc.date_played, doc.song_id], null), reduce with _count, and query
>>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>>
>>> ["boris", "happy birthday"], 20
>>> ["boris", "yesterday"], 14
>>> ...
>>>
>>> However, if our user has listened to 50,000 songs during the date
>>> range, we'll get back 50,000 rows which seems expensive.  What I want
>>> is just the scalar 50,000.  I've tried writing a reduce that returns
>>> the set of distinct song_ids for each user (turning the values list
>>> into a dictionary and back again), but CouchDB complains that I am not
>>> reducing my values fast enough :-/  I'm also not sure how to reduce
>>> this list to a scalar at the end without returning the whole thing to
>>> my client (which defeats the purpose of all this anyways).
>>>
>>> Is this possible to do in CouchDB today?  If not, is it something that
>>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>>> make this really hard to do?  It would of course be possible for me to
>>> implement this myself (subscribe to the update notifications and
>>> update my counts as appropriate in a custom script), but I wanted to
>>> move to CouchDB so that I wouldn't have to do all this myself.
>>>
>>> Thanks for any advice!
>>>
>>> --Rob
>>>
>>
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Zachary Zolton <za...@gmail.com>.

Except that the OP specified the need to query for these counts within
a date range. So, you have to collate by listening time, not the song.

On Wed, Nov 16, 2011 at 2:45 PM, Marcello Nuccio
<ma...@gmail.com> wrote:
> Hi Rob,
> I remember I've done something similar a while ago, but I cannot find
> the code right now and I don't have time to rewrite it right now...
> however the trick is to only count when the song name changes. This
> works because view rows are sorted.
>
> HTH,
>  Marcello
>
> 2011/11/15 Rob Crowell <ro...@gmail.com>:
>> Hello everyone,
>>
>> I'm writing some log-parsing code which is currently running on a
>> MySQL backend.  We're doing a huge amount of aggregates on this data
>> right now, but performance is suffering and I'm looking for
>> alternatives.  The idea of incremental map/reduce initially seemed
>> like the exact right thing, but I can't seem to express some of the
>> most important queries we are currently running in our production
>> system.
>>
>> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
>> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
>> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
>> pre-compute these aggregates (it checks modified timestamps and pulls
>> in only new records) and write them to a summary table.  I initially
>> believed I could use CouchDB's incremental map/reduce to effortlessly
>> build and update our "summary information" as it changes, but I'm
>> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>>
>> In our example, our user "boris" listens to the same song many times
>> each month, and we're interested in the number of distinct songs he's
>> listened to during a specified time period (NOT the number of song
>> plays, but the number of distinct songs played).  In CouchDB it isn't
>> much trouble to get all of the unique songs that he's listened to
>> during a period.  Here's our document:
>>
>> {
>>  song_id: "happy birthday",
>>  user_id: "boris",
>>  date_played: [2011, 11, 14, 00, 12, 55],
>>  _id: ...
>> }
>>
>> To get the unique values, all we need to do is emit([doc.user_id,
>> doc.date_played, doc.song_id], null), reduce with _count, and query
>> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
>> "2011-02-01"]&group_level=1.  This query will yield results like:
>>
>> ["boris", "happy birthday"], 20
>> ["boris", "yesterday"], 14
>> ...
>>
>> However, if our user has listened to 50,000 songs during the date
>> range, we'll get back 50,000 rows which seems expensive.  What I want
>> is just the scalar 50,000.  I've tried writing a reduce that returns
>> the set of distinct song_ids for each user (turning the values list
>> into a dictionary and back again), but CouchDB complains that I am not
>> reducing my values fast enough :-/  I'm also not sure how to reduce
>> this list to a scalar at the end without returning the whole thing to
>> my client (which defeats the purpose of all this anyways).
>>
>> Is this possible to do in CouchDB today?  If not, is it something that
>> is on the roadmap, or does the internal structure of CouchDB's b-tree
>> make this really hard to do?  It would of course be possible for me to
>> implement this myself (subscribe to the update notifications and
>> update my counts as appropriate in a custom script), but I wanted to
>> move to CouchDB so that I wouldn't have to do all this myself.
>>
>> Thanks for any advice!
>>
>> --Rob
>>
>

Re: Writing Advanced Aggregate Queries -- Equivalent of SELECT COUNT(DISTINCT field) in CouchDB?

Posted by Marcello Nuccio <ma...@gmail.com>.

Hi Rob,
I remember I've done something similar a while ago, but I cannot find
the code right now and I don't have time to rewrite it right now...
however the trick is to only count when the song name changes. This
works because view rows are sorted.

HTH,
  Marcello

2011/11/15 Rob Crowell <ro...@gmail.com>:
> Hello everyone,
>
> I'm writing some log-parsing code which is currently running on a
> MySQL backend.  We're doing a huge amount of aggregates on this data
> right now, but performance is suffering and I'm looking for
> alternatives.  The idea of incremental map/reduce initially seemed
> like the exact right thing, but I can't seem to express some of the
> most important queries we are currently running in our production
> system.
>
> We're running a lot of queries of the SELECT COUNT(DISTINCT song_id)
> WHERE user_id = "boris" AND created >= "2010-01-01" AND created <
> "2010-02-01" variety.  Currently in MySQL-land we've got a cron job to
> pre-compute these aggregates (it checks modified timestamps and pulls
> in only new records) and write them to a summary table.  I initially
> believed I could use CouchDB's incremental map/reduce to effortlessly
> build and update our "summary information" as it changes, but I'm
> stuck.  I'm trying to relax, but I can't figure out exactly how :)
>
> In our example, our user "boris" listens to the same song many times
> each month, and we're interested in the number of distinct songs he's
> listened to during a specified time period (NOT the number of song
> plays, but the number of distinct songs played).  In CouchDB it isn't
> much trouble to get all of the unique songs that he's listened to
> during a period.  Here's our document:
>
> {
>  song_id: "happy birthday",
>  user_id: "boris",
>  date_played: [2011, 11, 14, 00, 12, 55],
>  _id: ...
> }
>
> To get the unique values, all we need to do is emit([doc.user_id,
> doc.date_played, doc.song_id], null), reduce with _count, and query
> with a startkey=["boris", "2011-01-01"]&endkey=["boris",
> "2011-02-01"]&group_level=1.  This query will yield results like:
>
> ["boris", "happy birthday"], 20
> ["boris", "yesterday"], 14
> ...
>
> However, if our user has listened to 50,000 songs during the date
> range, we'll get back 50,000 rows which seems expensive.  What I want
> is just the scalar 50,000.  I've tried writing a reduce that returns
> the set of distinct song_ids for each user (turning the values list
> into a dictionary and back again), but CouchDB complains that I am not
> reducing my values fast enough :-/  I'm also not sure how to reduce
> this list to a scalar at the end without returning the whole thing to
> my client (which defeats the purpose of all this anyways).
>
> Is this possible to do in CouchDB today?  If not, is it something that
> is on the roadmap, or does the internal structure of CouchDB's b-tree
> make this really hard to do?  It would of course be possible for me to
> implement this myself (subscribe to the update notifications and
> update my counts as appropriate in a custom script), but I wanted to
> move to CouchDB so that I wouldn't have to do all this myself.
>
> Thanks for any advice!
>
> --Rob
>