You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@couchdb.apache.org by Aroj George <ar...@gmail.com> on 2011/06/16 22:38:35 UTC

How best to workaround time filtering + aggregation

Hi All,

We have this requirement where we want to do aggregations over a set of time
series data after we have filtered it by time, in response to the time range
values input by a user.

Size: 10 million documents.
Aggregations to be applied: Sum, Count, Average, Max, Min, Latest etc..
Grouped By: Clinic Id,Field Name
*
Doc Structure:*
{ clinic id: 10, data :  { beds : 10, meds: 20, doctors : 30 }, timestamp :
T1 }

*View Structure:*
key: timestamp, clinic id, field name ( T1, 10, beds )
value: field value (10)

The above view will allow us to filter by a time range, but not group by
clinic id + field name, because of the timestamp being the first part of the
key.

So the only option it seems is to do the time filtering in Couch and do the
aggregations on the filtered set outside of Couch.

*Options:*

1.  *Python*
Load the filtered set in the client, (python) and use pythons's built in
reduce functions like sum, count etc.

Concerns: Loading a huge set of data into the python client may not be
efficient.

2. *List functions*
Use the Couch view to do time filter + do the aggregation in a list
function.
The key advantage here could be that everything is being done in Couch. This
not what a list function is meant for,
but given that aggregation via map/reduce is not possible in this case,
could this be a reasonable solution?

Concerns:
Will the javascript list function be slower than even doing the aggregation
in python?
Can I switch to an erlang list function to speed things up?

3. *SQL*
Load the data into a SQL table at regular intervals or using the _changes
view, with the below columns,
and use SQL's group by and sum,count functions to return the query results.

Entity ID, Field Name, Field Value, Timestamp

Select sum(field value) group by entity id, field name where timestamp > T1
and timestamp < T2

4. *Lucene*
I am not sure, but can something like Elastic Search (
http://www.elasticsearch.org/) or Solr help here?

5. *Multiple Queries*
Another option I could think of is, to change the view key to

*[clinic id, field name, timestamp]* ie add the timestamp at the end.

Now do a query for each clinic and field, like start_key = [clinic id, field
name, T1] and end_key = [clinic id, field name, T2] and group = 2.
This will give me the result, but I will have to do a query for (number of
clinics * number of fields) times.
Which again will be too inefficient because of that many HTTP calls. Unless
I can make all multiple such queries in a single HTTP GET?
Can I optimise this somehow?

What do you guys think, which of the above is the best option? How best
would you solve this use case?

The question essentially being, what's the most efficient  and scalable way
to perform real time aggregations on a set of data with time as filter
criteria?
The primary impact of the time filter being that we can't really do the
aggregations using the map reduce views.

Rgds,
Aroj

Re: How best to workaround time filtering + aggregation

Posted by Aroj George <ar...@gmail.com>.
Thanks Cory.

I think list function has an advantage as it has this getRow iterator which
only loads one row at a time, if I am correct?
And because I am doing an aggregations like sum, the size of the data
buffered by the list function as it works through
the rest of the rows in the view will always be very small.  So even for a
million rows, the list function will never be loading all of them at once,
as the python client would have to do.

Guys, is the above a right assumption??

The overhead with the list function that I have found out is that, its quite
slow, say compared to load + aggregate in python.
And this I believe has more to do with the fact that the list function is
executed via the external js view server which makes it quite slow.

So the question is, would it be a good idea to write the list function in
Erlang to get a perf boost?
With the added advantage that all that data is not loaded into the client
and everything gets done at the Couch side.

Rgds,
Aroj

On Fri, Jun 17, 2011 at 3:43 AM, Cory Zue <cz...@dimagi.com> wrote:

> Hi Aroj,
>
> Welcome to my least favorite thing in couch!
>
> If you anticipate high volumes of data I'd rule out (1). Likewise list
> functions (2), I don't think will help you unless I misunderstand how
> they work. You're still going to have to load all that data into the
> list function which I doubt saves you much overhead as an alternative
> to (1).
>
> 3, 4, and 5 are all solid options. The right answer for you will
> likely be a tradeoff between how much you can tailor your use case to
> fit in a relatively simple view (5), versus the desire to do more
> complex and freeform querying in the future (3, 4). As for the choice
> between SQL and Lucene, it just depends how comfortable you are with
> the two technologies. Lucene will likely be less work to setup. SQL
> will likely be easier to query.
>
> [plug] If you end up going with option 3 you may find the blog post I
> just wrote about almost this exact use case useful:
>
> http://www.dimagi.com/pulling-data-from-couchdb-to-a-relational-database-made-easy-with-_changes/
>
> Cory
>
> --
> Cory L. Zue
> Dimagi, Inc
> http://www.dimagi.com/
>
>
>
> On Thu, Jun 16, 2011 at 4:38 PM, Aroj George <ar...@gmail.com> wrote:
> > Hi All,
> >
> > We have this requirement where we want to do aggregations over a set of
> time
> > series data after we have filtered it by time, in response to the time
> range
> > values input by a user.
> >
> > Size: 10 million documents.
> > Aggregations to be applied: Sum, Count, Average, Max, Min, Latest etc..
> > Grouped By: Clinic Id,Field Name
> > *
> > Doc Structure:*
> > { clinic id: 10, data :  { beds : 10, meds: 20, doctors : 30 }, timestamp
> :
> > T1 }
> >
> > *View Structure:*
> > key: timestamp, clinic id, field name ( T1, 10, beds )
> > value: field value (10)
> >
> > The above view will allow us to filter by a time range, but not group by
> > clinic id + field name, because of the timestamp being the first part of
> the
> > key.
> >
> > So the only option it seems is to do the time filtering in Couch and do
> the
> > aggregations on the filtered set outside of Couch.
> >
> > *Options:*
> >
> > 1.  *Python*
> > Load the filtered set in the client, (python) and use pythons's built in
> > reduce functions like sum, count etc.
> >
> > Concerns: Loading a huge set of data into the python client may not be
> > efficient.
> >
> > 2. *List functions*
> > Use the Couch view to do time filter + do the aggregation in a list
> > function.
> > The key advantage here could be that everything is being done in Couch.
> This
> > not what a list function is meant for,
> > but given that aggregation via map/reduce is not possible in this case,
> > could this be a reasonable solution?
> >
> > Concerns:
> > Will the javascript list function be slower than even doing the
> aggregation
> > in python?
> > Can I switch to an erlang list function to speed things up?
> >
> > 3. *SQL*
> > Load the data into a SQL table at regular intervals or using the _changes
> > view, with the below columns,
> > and use SQL's group by and sum,count functions to return the query
> results.
> >
> > Entity ID, Field Name, Field Value, Timestamp
> >
> > Select sum(field value) group by entity id, field name where timestamp >
> T1
> > and timestamp < T2
> >
> > 4. *Lucene*
> > I am not sure, but can something like Elastic Search (
> > http://www.elasticsearch.org/) or Solr help here?
> >
> > 5. *Multiple Queries*
> > Another option I could think of is, to change the view key to
> >
> > *[clinic id, field name, timestamp]* ie add the timestamp at the end.
> >
> > Now do a query for each clinic and field, like start_key = [clinic id,
> field
> > name, T1] and end_key = [clinic id, field name, T2] and group = 2.
> > This will give me the result, but I will have to do a query for (number
> of
> > clinics * number of fields) times.
> > Which again will be too inefficient because of that many HTTP calls.
> Unless
> > I can make all multiple such queries in a single HTTP GET?
> > Can I optimise this somehow?
> >
> > What do you guys think, which of the above is the best option? How best
> > would you solve this use case?
> >
> > The question essentially being, what's the most efficient  and scalable
> way
> > to perform real time aggregations on a set of data with time as filter
> > criteria?
> > The primary impact of the time filter being that we can't really do the
> > aggregations using the map reduce views.
> >
> > Rgds,
> > Aroj
> >
>

Re: How best to workaround time filtering + aggregation

Posted by Cory Zue <cz...@dimagi.com>.
Hi Aroj,

Welcome to my least favorite thing in couch!

If you anticipate high volumes of data I'd rule out (1). Likewise list
functions (2), I don't think will help you unless I misunderstand how
they work. You're still going to have to load all that data into the
list function which I doubt saves you much overhead as an alternative
to (1).

3, 4, and 5 are all solid options. The right answer for you will
likely be a tradeoff between how much you can tailor your use case to
fit in a relatively simple view (5), versus the desire to do more
complex and freeform querying in the future (3, 4). As for the choice
between SQL and Lucene, it just depends how comfortable you are with
the two technologies. Lucene will likely be less work to setup. SQL
will likely be easier to query.

[plug] If you end up going with option 3 you may find the blog post I
just wrote about almost this exact use case useful:
http://www.dimagi.com/pulling-data-from-couchdb-to-a-relational-database-made-easy-with-_changes/

Cory

--
Cory L. Zue
Dimagi, Inc
http://www.dimagi.com/



On Thu, Jun 16, 2011 at 4:38 PM, Aroj George <ar...@gmail.com> wrote:
> Hi All,
>
> We have this requirement where we want to do aggregations over a set of time
> series data after we have filtered it by time, in response to the time range
> values input by a user.
>
> Size: 10 million documents.
> Aggregations to be applied: Sum, Count, Average, Max, Min, Latest etc..
> Grouped By: Clinic Id,Field Name
> *
> Doc Structure:*
> { clinic id: 10, data :  { beds : 10, meds: 20, doctors : 30 }, timestamp :
> T1 }
>
> *View Structure:*
> key: timestamp, clinic id, field name ( T1, 10, beds )
> value: field value (10)
>
> The above view will allow us to filter by a time range, but not group by
> clinic id + field name, because of the timestamp being the first part of the
> key.
>
> So the only option it seems is to do the time filtering in Couch and do the
> aggregations on the filtered set outside of Couch.
>
> *Options:*
>
> 1.  *Python*
> Load the filtered set in the client, (python) and use pythons's built in
> reduce functions like sum, count etc.
>
> Concerns: Loading a huge set of data into the python client may not be
> efficient.
>
> 2. *List functions*
> Use the Couch view to do time filter + do the aggregation in a list
> function.
> The key advantage here could be that everything is being done in Couch. This
> not what a list function is meant for,
> but given that aggregation via map/reduce is not possible in this case,
> could this be a reasonable solution?
>
> Concerns:
> Will the javascript list function be slower than even doing the aggregation
> in python?
> Can I switch to an erlang list function to speed things up?
>
> 3. *SQL*
> Load the data into a SQL table at regular intervals or using the _changes
> view, with the below columns,
> and use SQL's group by and sum,count functions to return the query results.
>
> Entity ID, Field Name, Field Value, Timestamp
>
> Select sum(field value) group by entity id, field name where timestamp > T1
> and timestamp < T2
>
> 4. *Lucene*
> I am not sure, but can something like Elastic Search (
> http://www.elasticsearch.org/) or Solr help here?
>
> 5. *Multiple Queries*
> Another option I could think of is, to change the view key to
>
> *[clinic id, field name, timestamp]* ie add the timestamp at the end.
>
> Now do a query for each clinic and field, like start_key = [clinic id, field
> name, T1] and end_key = [clinic id, field name, T2] and group = 2.
> This will give me the result, but I will have to do a query for (number of
> clinics * number of fields) times.
> Which again will be too inefficient because of that many HTTP calls. Unless
> I can make all multiple such queries in a single HTTP GET?
> Can I optimise this somehow?
>
> What do you guys think, which of the above is the best option? How best
> would you solve this use case?
>
> The question essentially being, what's the most efficient  and scalable way
> to perform real time aggregations on a set of data with time as filter
> criteria?
> The primary impact of the time filter being that we can't really do the
> aggregations using the map reduce views.
>
> Rgds,
> Aroj
>