You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Calle Dybedahl <ca...@init.se> on 2011/09/19 13:27:48 UTC

Post-filtering reduced results?

Hello.

I have a pretty simple pair of map and reduce functions. The first is basically just emitting a key and a 1, and the reduce is the built-in _sum function. This works fine, and tells me how many times every key has been seen.

Now, the problem is that I'm actually only interested in the handful of keys that have been seen the most often. The data fits a power-law distribution, which means that there is a long tail that I'm not at all interested in. And by "long" here I'm talking about tens of thousands of rows. At the moment, my client-side code spends more than 99.9% of its runtime receiving and parsing JSON from the CouchDB server, very nearly all of which it will promptly throw away as soon as it's been parsed. This is annoying and silly.

Is there any way at all to filter the results of a reduced query on the CouchDB end? Alternatively, is there a way for a reduce function to know that it's the final stage in the re-reduce chain (if I could drop all keys with a final value of 1, I'd save an order of magnitude of runtime)?

I can't be the first one ever to run into a problem like this, but I've failed to find any solutions on the net.
-- 
Calle Dybedahl
calle@init.se -*- +46 703 - 970 612




Re: Post-filtering reduced results?

Posted by Mehdi El Fadil <me...@mango-is.com>.
Hi Calle,

At least you could move the post-processing to server side using a list
function: http://wiki.apache.org/couchdb/Formatting_with_Show_and_List

A better option for performance is to do the filtering inside the reduce
function. Try to look at this snippet, looks close to what you are trying to
achieve:
http://wiki.apache.org/couchdb/View_Snippets#Retrieve_the_top_N_tags.

Good luck,

Mehdi



On Mon, Sep 19, 2011 at 1:27 PM, Calle Dybedahl <ca...@init.se>wrote:

> Hello.
>
> I have a pretty simple pair of map and reduce functions. The first is
> basically just emitting a key and a 1, and the reduce is the built-in _sum
> function. This works fine, and tells me how many times every key has been
> seen.
>
> Now, the problem is that I'm actually only interested in the handful of
> keys that have been seen the most often. The data fits a power-law
> distribution, which means that there is a long tail that I'm not at all
> interested in. And by "long" here I'm talking about tens of thousands of
> rows. At the moment, my client-side code spends more than 99.9% of its
> runtime receiving and parsing JSON from the CouchDB server, very nearly all
> of which it will promptly throw away as soon as it's been parsed. This is
> annoying and silly.
>
> Is there any way at all to filter the results of a reduced query on the
> CouchDB end? Alternatively, is there a way for a reduce function to know
> that it's the final stage in the re-reduce chain (if I could drop all keys
> with a final value of 1, I'd save an order of magnitude of runtime)?
>
> I can't be the first one ever to run into a problem like this, but I've
> failed to find any solutions on the net.
> --
> Calle Dybedahl
> calle@init.se -*- +46 703 - 970 612
>
>
>
>


-- 
*
Mehdi El Fadil
twitter: @mango_info <http://www.twitter.com/mango_info>
website: http://www.mango-is.com
linkedin: http://be.linkedin.com/in/elfadme
*
<http://www.mango-is.com>