You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Borja Martín <bo...@dagi3d.net> on 2010/03/01 14:33:46 UTC

counting tags within a date range

Hi,
I have these documents :

{ "created_at": "20100301", "tag": "foo" },
{ "created_at": "20100301", "tag": "bar" },
{ "created_at": "20100301", "tag": "foo-bar" },
{ "created_at": "20100302", "tag": "foo" }

and what I want is to retrieve the documents within a date range and count
how many times does each tag appear globally, not just by its date. I should
get something like this:
{ "foo" : 2, "bar" : 1, "foo-bar" : 1}

So, in the first attempt, I wrote the following map/reduce functions:
// map
function(doc) {
  emit([doc.created_at,doc.tag],1);
}
// reduce
function (key, values, rereduce) {
  return sum(values);
}

Obviously this didn't work as the documents are grouped by the whole key and
if I set the group_level to 1, the documents are grouped only by the date:

/_design/tags/_view/popular?startkey=["20100301",null]&endkey=["20100302",{}]&group=true&group_level=1

{"rows":[
  {"key":["20100301"],"value":3},
  {"key":["20100302"],"value":1}
]}

Then I changed the emit function by setting the tag as the first position
for the key:
...
emit([doc.tag,doc.created_at],1);

Now I can group the results and get how many times each tag appears:

/_design/tags/_view/popular?group=true&group_level=1

{"rows":[
{"key":["bar"],"value":1},
{"key":["foo"],"value":2},
{"key":["foo-bar"],"value":1}
]}


The problem with this, is that I can't restrict the query to certain date
ranges as I get always all the documents in spite of the second part of the
key:

/_design/tags/_view/popular?group=true&group_level=1&startkey=[null,"20100302"]&endkey=[{},"20100302"]

{"rows":[
{"key":["bar"],"value":1},
{"key":["foo"],"value":2},
{"key":["foo-bar"],"value":1}
]}

when I should get something like
{"rows":[
{"key":["foo"],"value":1}
]}

I was wondering if this is the normal collation behaviour or I'm missing
anything and if there is any way to achieve this. I also tried to add the
collation option to 'raw' in the view definition just in case, but I got the
same results.

Thanks in advance.

Regards



-- 
def dagi3d(me)
 case me
   when :web then  "http://dagi3d.net"
   when :twitter then "http://twitter.com/dagi3d"
 end
end

Re: Relational-esque behavior in CouchDB

Posted by Paul Davis <pa...@gmail.com>.
Mike,

It'd be quite difficult if not impossible to be able to build the
derivative view incrementally. Allowing map functions to pull data
from a random location would break the referential transparency
requirement.

HTH,
Paul Davis

On Mon, Mar 1, 2010 at 10:41 AM, Mike Keen <mk...@visiture.com> wrote:
> Hey,
>
> I am managing a database that has hundreds of thousands of documents (so far), all containing a field called "keyword_id". I am not storing the "keyword_name" inside of each individual document for several reasons. The biggest reason being that the name could change at any time, but I don't want old documents to become outdated. I still want them to be associated with one another.
>
> Anyway, I am wondering if there is currently a special function within the CouchDB JavaScript view server to pull data across views. I have a view that serves as an index of keyword_ids/names. I also have another view that aggregates keyword data, but can't return the keyword names. Currently, I just run two requests to CouchDB. One to retrieve all aggregated keyword data, and another to retrieve the relevant keyword names as they relate to the ids.
>
> I am assuming that the way I am doing it is currently the best way, but wanted to see if it was possible to run something like:
>
> function(doc) {
>  var keyword_name = view("/lu_persistence/_design/keywords/_view/keyword_id_name_map?key=" + doc.keyword_id);
>  doc.keyword_name = keyword_name;
>  emit(null, doc);
> }
>
> Maybe I'm just nuts, but I think this would be ridiculously useful. And if it's already in there, and someone could point me in the right direction, I'd greatly appreciate it.
>
> Thanks a million,
> Mike Keen
>

Re: Relational-esque behavior in CouchDB

Posted by J Chris Anderson <jc...@gmail.com>.
On Mar 1, 2010, at 9:44 AM, Mike Keen wrote:

> Chris,
> 
> It's not a feature of our application. It's a consequence of dealing with the Google AdWords API.
> 

The relational model strikes back! ;)

> Thanks anyway,
> Mike
> 
> On Mar 1, 2010, at 11:55 AM, J Chris Anderson wrote:
> 
>> 
>> On Mar 1, 2010, at 7:41 AM, Mike Keen wrote:
>> 
>>> Hey,
>>> 
>>> I am managing a database that has hundreds of thousands of documents (so far), all containing a field called "keyword_id". I am not storing the "keyword_name" inside of each individual document for several reasons. The biggest reason being that the name could change at any time, but I don't want old documents to become outdated. I still want them to be associated with one another.
>>> 
>> 
>> I'd suggest, if possible (and I know you already have a bunch of data, so maybe not for you, but for people coming along on new projects) to store your keywords on your docs and drop the "feature" that keywords can be renamed at any time.
>> 
>> Chris
>> 
>>> Thanks a million,
>>> Mike Keen
>> 
> 


Re: Relational-esque behavior in CouchDB

Posted by Mike Keen <mk...@visiture.com>.
Chris,

It's not a feature of our application. It's a consequence of dealing with the Google AdWords API.

Thanks anyway,
Mike

On Mar 1, 2010, at 11:55 AM, J Chris Anderson wrote:

> 
> On Mar 1, 2010, at 7:41 AM, Mike Keen wrote:
> 
>> Hey,
>> 
>> I am managing a database that has hundreds of thousands of documents (so far), all containing a field called "keyword_id". I am not storing the "keyword_name" inside of each individual document for several reasons. The biggest reason being that the name could change at any time, but I don't want old documents to become outdated. I still want them to be associated with one another.
>> 
> 
> I'd suggest, if possible (and I know you already have a bunch of data, so maybe not for you, but for people coming along on new projects) to store your keywords on your docs and drop the "feature" that keywords can be renamed at any time.
> 
> Chris
> 
>> Thanks a million,
>> Mike Keen
> 


Re: Relational-esque behavior in CouchDB

Posted by J Chris Anderson <jc...@gmail.com>.
On Mar 1, 2010, at 7:41 AM, Mike Keen wrote:

> Hey,
> 
> I am managing a database that has hundreds of thousands of documents (so far), all containing a field called "keyword_id". I am not storing the "keyword_name" inside of each individual document for several reasons. The biggest reason being that the name could change at any time, but I don't want old documents to become outdated. I still want them to be associated with one another.
> 

I'd suggest, if possible (and I know you already have a bunch of data, so maybe not for you, but for people coming along on new projects) to store your keywords on your docs and drop the "feature" that keywords can be renamed at any time.

Chris

> Thanks a million,
> Mike Keen


Relational-esque behavior in CouchDB

Posted by Mike Keen <mk...@visiture.com>.
Hey,

I am managing a database that has hundreds of thousands of documents (so far), all containing a field called "keyword_id". I am not storing the "keyword_name" inside of each individual document for several reasons. The biggest reason being that the name could change at any time, but I don't want old documents to become outdated. I still want them to be associated with one another.

Anyway, I am wondering if there is currently a special function within the CouchDB JavaScript view server to pull data across views. I have a view that serves as an index of keyword_ids/names. I also have another view that aggregates keyword data, but can't return the keyword names. Currently, I just run two requests to CouchDB. One to retrieve all aggregated keyword data, and another to retrieve the relevant keyword names as they relate to the ids.

I am assuming that the way I am doing it is currently the best way, but wanted to see if it was possible to run something like:

function(doc) {
  var keyword_name = view("/lu_persistence/_design/keywords/_view/keyword_id_name_map?key=" + doc.keyword_id);
  doc.keyword_name = keyword_name;
  emit(null, doc);
}

Maybe I'm just nuts, but I think this would be ridiculously useful. And if it's already in there, and someone could point me in the right direction, I'd greatly appreciate it.

Thanks a million,
Mike Keen

Re: counting tags within a date range

Posted by Borja Martín <bo...@dagi3d.net>.
Hi,
thank your answers. I finally used a _list function and I even used it to
normalize the tags count, so it worked fine.

Mario,
if I set group_level option to 2, I still get undesired results as the
second part of the key seems to be ignored due the view collation behaviour
And it shouldn't matter if i use a string or integer type for the date
because the order should be the same in this case(maybe it more efficient
speaking in terms of disk usage, but it should affect to the results
ordering)

Regards

did you try group_level=2 ? so you get it for every null+created
combination?

> The other thing i am looking at, is that you are using numbers as strings i
> guess.
>
>
>
>> {"rows":[
>> {"key":["bar"],"value":1},
>> {"key":["foo"],"value":2},
>> {"key":["foo-bar"],"value":1}
>> ]}
>>
>> when I should get something like
>> {"rows":[
>> {"key":["foo"],"value":1}
>> ]}
>>
>> I was wondering if this is the normal collation behaviour or I'm missing
>> anything and if there is any way to achieve this. I also tried to add the
>> collation option to 'raw' in the view definition just in case, but I got
>> the
>> same results.
>>
>> Thanks in advance.
>>
>> Regards
>>
>>
>>
>> --
>> def dagi3d(me)
>> case me
>>  when :web then  "http://dagi3d.net"
>>  when :twitter then "http://twitter.com/dagi3d"
>> end
>> end
>>
>
>
> --
> Sourcegarden GmbH HR: B-104357
> Steuernummer: 37/167/21214 USt-ID: DE814784953
> Geschaeftsfuehrer: Mario Scheliga, Rene Otto
> Bank: Deutsche Bank, BLZ: 10070024, KTO: 0810929
> Schoenhauser Allee 51, 10437 Berlin
>
>


-- 
def dagi3d(me)
 case me
   when :web then  "http://dagi3d.net"
   when :twitter then "http://twitter.com/dagi3d"
 end
end

Re: counting tags within a date range

Posted by Mario Scheliga <ma...@sourcegarden.de>.
Am 01.03.2010 um 14:33 schrieb Borja Martín:

> Hi,
> I have these documents :
>
> { "created_at": "20100301", "tag": "foo" },
> { "created_at": "20100301", "tag": "bar" },
> { "created_at": "20100301", "tag": "foo-bar" },
> { "created_at": "20100302", "tag": "foo" }
>
> and what I want is to retrieve the documents within a date range and  
> count
> how many times does each tag appear globally, not just by its date.  
> I should
> get something like this:
> { "foo" : 2, "bar" : 1, "foo-bar" : 1}
>
> So, in the first attempt, I wrote the following map/reduce functions:
> // map
> function(doc) {
>  emit([doc.created_at,doc.tag],1);
> }
> // reduce
> function (key, values, rereduce) {
>  return sum(values);
> }
>
> Obviously this didn't work as the documents are grouped by the whole  
> key and
> if I set the group_level to 1, the documents are grouped only by the  
> date:
>
> /_design/tags/_view/popular? 
> startkey=["20100301",null]&endkey=["20100302", 
> {}]&group=true&group_level=1
>
> {"rows":[
>  {"key":["20100301"],"value":3},
>  {"key":["20100302"],"value":1}
> ]}
>
> Then I changed the emit function by setting the tag as the first  
> position
> for the key:
> ...
> emit([doc.tag,doc.created_at],1);
>
> Now I can group the results and get how many times each tag appears:
>
> /_design/tags/_view/popular?group=true&group_level=1
>
> {"rows":[
> {"key":["bar"],"value":1},
> {"key":["foo"],"value":2},
> {"key":["foo-bar"],"value":1}
> ]}
>
>
> The problem with this, is that I can't restrict the query to certain  
> date
> ranges as I get always all the documents in spite of the second part  
> of the
> key:
>
> /_design/tags/_view/popular? 
> group 
> =true&group_level=1&startkey=[null,"20100302"]&endkey=[{},"20100302"]

did you try group_level=2 ? so you get it for every null+created  
combination?
The other thing i am looking at, is that you are using numbers as  
strings i guess.

>
> {"rows":[
> {"key":["bar"],"value":1},
> {"key":["foo"],"value":2},
> {"key":["foo-bar"],"value":1}
> ]}
>
> when I should get something like
> {"rows":[
> {"key":["foo"],"value":1}
> ]}
>
> I was wondering if this is the normal collation behaviour or I'm  
> missing
> anything and if there is any way to achieve this. I also tried to  
> add the
> collation option to 'raw' in the view definition just in case, but I  
> got the
> same results.
>
> Thanks in advance.
>
> Regards
>
>
>
> -- 
> def dagi3d(me)
> case me
>   when :web then  "http://dagi3d.net"
>   when :twitter then "http://twitter.com/dagi3d"
> end
> end


--
Sourcegarden GmbH HR: B-104357
Steuernummer: 37/167/21214 USt-ID: DE814784953
Geschaeftsfuehrer: Mario Scheliga, Rene Otto
Bank: Deutsche Bank, BLZ: 10070024, KTO: 0810929
Schoenhauser Allee 51, 10437 Berlin


Re: counting tags within a date range

Posted by Brian Candler <B....@pobox.com>.
On Mon, Mar 01, 2010 at 02:33:46PM +0100, Borja Martín wrote:
> I have these documents :
> 
> { "created_at": "20100301", "tag": "foo" },
> { "created_at": "20100301", "tag": "bar" },
> { "created_at": "20100301", "tag": "foo-bar" },
> { "created_at": "20100302", "tag": "foo" }
> 
> and what I want is to retrieve the documents within a date range and count
> how many times does each tag appear globally, not just by its date. I should
> get something like this:
> { "foo" : 2, "bar" : 1, "foo-bar" : 1}

If the number of distinct tags in your database is small (say < 100), then
you can use a reduce function to build a map of {tag:count} explicitly. Then
a grouped query across any range of dates will give you the map you are
looking for.

Otherwise, I suggest you group by [date,tag] as before and do the summation
on the client side. That is, with a _count reduce function and
startkey=["20100301"]&endkey=["20100302",{}}&group=true
you should get
  "key":["20100301","foo"], "value":1
  "key":["20100301","bar"], "value":1
  "key":["20100301","foo-bar"], "value":1
  "key":["20100302","foo"], "value":1
and you can add the counts from the [x,tag] rows yourself. If you want to do
this server-side you can use a _list function to do the accumulation.

Depending on how large your date ranges are, you can make more complex
solutions using larger buckets.  For example, have another view which emits
["201003","foo"] as a key, to allow you to sum all the tags in March 2010. 
So searching from 6 April 2009 to 5 April 2010 might require three queries
(one each for the partial months at each end, and one for the whole months
in between)

HTH,

Brian.