You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Jan Lehnardt <ja...@apache.org> on 2009/01/29 23:42:28 UTC

Statistics Module

Hi list,

Alex Lang and I have been working on a statistics module this week.
We'd like to share our intermediate results and discuss further steps
and acceptance of the module into the CouchDB codebase. The code
lives on GitHub:

     http://github.com/janl/couchdb/tree/stats

A full diff against trunk from a few hours ago lives on Friendpaste:

     http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe

Be gentle :)

The driving idea was to make it easy for 3rd party monitoring solutions
to pick up runtime statistics of a single CouchDB node. The stats module
does not collect long-term stats persistently. It does not draw pretty  
graphs
and it does not come with a pony.

The stats module is rather simple, it consists of two Erlang modules,
a collector and an aggregator. The collector is a `gen_server` holding
and `ets` table that holds integer values associated with keys. For now
these are simple counters. All CouchDB modules can send a message
to the collector with any key they want to get a metric counted. E.g.
`couch_httpd`:

   couch_stats_collector:increment({httpd, requests}).

to count the total number of requests made.

We plan to add support for minimum and maximum values of
different metrics soon.

The aggregator has two jobs: It exposes the raw counter values from
the collector to the HTTP plugin (`couch_httpd_stats_handlers.erl`)
and to aggregate counter values to more meaningful numbers like
average requests per second over 60 seconds, 300 seconds, 900
seconds (cron-style).

Reading values is done through a single API endpoint `/_stats`. The
canonical format is `/_stats/modulename/statkey` e.g.:

   /_stats/httpd/requests

will return a JSON string:

   {"httpd": {"requests":3241}}

We'll have a combined view for all stats under `GET /_stats`.


A few notes:

  - Stats URLs are read-only.

  - This is a first iteration, we can add as many fancy stats and  
metrics as
    we like, we went with the ones that felt most useful for us.

  - The `{module, key}` identifiers for stats are not yet finalized,  
we realized
     early on that we'd like to have a fair number of stats before  
looking
     into a good naming scheme. We don't have a good naming scheme
     yet, recommendations are very welcome.

  - We'll add more inline-comments to key parts of the code.

  - We're not yet fully integrated into the build system (see below).  
eunit
    offers a variety of options to conditionally compile code for  
testing and
    moving test code to separate files if that is preferred.

  - The way we calculate averages works, but is a bit clunky and can be
    improved. Also, different kinds of averages can be added.

  - Stats are lost on server shutdown or when restartServer(); is  
called wich
    the JavaScript test suite does.

  - Some lines exceed the soft limit of 79 chars. We'll address that  
in the
    future, if required.

  - Basic testing shows no significant slowdowns so far. Proper load  
testing
    has not been done yet.

Test Driven

  - We took a test driven approach developing this. Alex is an expert on
    TDD (with high success rates, but that's another story). We added
    an extensive number of tests to `couch_tests.js` that differ in  
style
    from the existing tests and are more oriented towards proven  
practices.
    They are also more verbose and try to test only one thing at a time.
    We hope you like the new style and we'd like to nudge the devs into
    taking over our style. The infrastructure for the new-style tests  
is not
    final and we can still change things around, if needed.

  - For things that were hard or impossible to test through the REST
    API we added eunit-style tests to our core modules that ensure their
    functionality. We added a temporary make target `make t` to launch
    our tests. You need Erlang/OTP R12B-5 for that or an existing euint
    installation. We'd like to work with Noah to integrate that into the
    existing `make test` target (and integrate the existing bits).


Review

We're looking for a validation of our architectural approach as well as
integrating our `increment` calls into the existing CouchDB codebase.


Legal

We're releasing the source under the Apache 2.0 license. I (Jan) have
a CLA on file, Alex does not. If I remember correctly, my CLA is enough
to get the code into the hands of the ASF. Alex will receive honourable
mentions in our THANKS file. :)

The stats module has been developed under a contract with the BBC.
They need to be able to integrate CouchDB with their existing monitoring
solution and we hope this will get them there. The BBC has no interest
in maintaining the code as a patch and has subsequently allowed us
to release the code as open source under the Apache 2.0 license and
asked if we could offer it to the CouchDB community for potential  
inclusion.
By ASF law there still might have to a occur a proper software grant by
the BBC, but take that as a given.


Next Steps

- We'd like to hammer out any architectural issues you might find.

- We'll work on a decent naming scheme for stats.

- We'll clean up code, add tests, and comments.

- We'll add more stats and the missing bits we mentioned earlier.

- We might end up adding SNMP support.

- Once in some shape (and with all legal issues cleared), we'd like
   to move the code to an ASF subversion branch and finish integration
   into trunk from there. Hopefully before 0.9.


Thanks for your time,

Cheers
Alex & Jan
--
PS from my PMC-hat: I think the stats module is a valuable addition
to CouchDB and I welcome any effort to integrate it.


Re: Statistics Module

Posted by Ulises <ul...@gmail.com>.
> Code looks awesome. For reference, they're storing the last N samples
> with each time point over a given time period. Ie, they store the
> current counter values once a second.

I haven't looked at the code and might be speaking out of my arse here
but have you looked at how rrdtool stores samples? It's quite
interesting as it uses constant space and doesn't throw away anything.
To achieve constant space it sacrifices granularity as in: once the
time window as passed instead of dropping old samples it averages them
and stores a single data point for the average (or something along
those lines). FWIW it might be worth looking at how rrdtool does it :)

http://oss.oetiker.ch/rrdtool/doc/rrdtool.en.html <-- look at the "How
does rrdtool work?" section.

U

Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
On 31 Jan 2009, at 02:12, Paul Davis wrote:
>>
>>
>
> I find fault with the logic that long names in any way lead to
> success. The two things that this style does offer is the focus on
> small independent tests and a hierarchical categorization of tests.
>
> Just pulling out the first section:
>
>  stats: function(debug) {
>    var open_databases_tests = {
>      'should increment the number of open databases when creating a
> db': function...
>      'should increment the number of open databases when opening a
> db': function...
>      'should decrement the number of open databases when deleting':  
> function...
>      'should keep the same number of open databases when reaching the
> max_dbs_open limit': function...
>      'should return 0 for number of open databases after call to
> restartServer()': function....
>
> Contrast that with:
>
> stats: {
>    database: {
>        create_inc: function() .....
>        open_inc: function() ...
>        delete_dec: function() ...
>        max_open: function() ...
>        reset: function() ....
>    }
> }

Sorry, but if if you read "test 'stats -> database -> create_inc'  
failed",
you don't know what is going on and have to dig into test code. If
you read test "test 'should increment open_databases counter' failed".

The idea is to have natural language identifiers here.


> Everything in your source code should be trying to convey meaning as
> quickly and efficiently as possible. All these do is add redundant
> information and make the tests a pain to read.

"A pain" as in "for a human", your proposal is sure easy to parse for a
computer (just teasing here :-).

I think we kicked off at the extreme end of a spectrum and you're
advocating the other one. I'm sure we'll meet on middle ground :)


Cheers
Jan
--


> I definitely agree that we could spend some time organizing our test
> suite into smaller files. Perhaps one file per current test, and then
> split up some of the bigger tests into smaller bits. eunit tests for
> the internals is also made of win. I am in no way against stealing
> good ideas, but long names just isn't one of them.
>
>
> HTH,
> Paul Davis
>
>> Cheers
>> Jan
>> --
>> Also: Fucking.
>>
>>> On Thu, Jan 29, 2009 at 5:42 PM, Jan Lehnardt <ja...@apache.org>  
>>> wrote:
>>>>
>>>> Hi list,
>>>>
>>>> Alex Lang and I have been working on a statistics module this week.
>>>> We'd like to share our intermediate results and discuss further  
>>>> steps
>>>> and acceptance of the module into the CouchDB codebase. The code
>>>> lives on GitHub:
>>>>
>>>> http://github.com/janl/couchdb/tree/stats
>>>>
>>>> A full diff against trunk from a few hours ago lives on  
>>>> Friendpaste:
>>>>
>>>> http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe
>>>>
>>>> Be gentle :)
>>>>
>>>> The driving idea was to make it easy for 3rd party monitoring  
>>>> solutions
>>>> to pick up runtime statistics of a single CouchDB node. The stats  
>>>> module
>>>> does not collect long-term stats persistently. It does not draw  
>>>> pretty
>>>> graphs
>>>> and it does not come with a pony.
>>>>
>>>> The stats module is rather simple, it consists of two Erlang  
>>>> modules,
>>>> a collector and an aggregator. The collector is a `gen_server`  
>>>> holding
>>>> and `ets` table that holds integer values associated with keys.  
>>>> For now
>>>> these are simple counters. All CouchDB modules can send a message
>>>> to the collector with any key they want to get a metric counted.  
>>>> E.g.
>>>> `couch_httpd`:
>>>>
>>>> couch_stats_collector:increment({httpd, requests}).
>>>>
>>>> to count the total number of requests made.
>>>>
>>>> We plan to add support for minimum and maximum values of
>>>> different metrics soon.
>>>>
>>>> The aggregator has two jobs: It exposes the raw counter values from
>>>> the collector to the HTTP plugin (`couch_httpd_stats_handlers.erl`)
>>>> and to aggregate counter values to more meaningful numbers like
>>>> average requests per second over 60 seconds, 300 seconds, 900
>>>> seconds (cron-style).
>>>>
>>>> Reading values is done through a single API endpoint `/_stats`. The
>>>> canonical format is `/_stats/modulename/statkey` e.g.:
>>>>
>>>> /_stats/httpd/requests
>>>>
>>>> will return a JSON string:
>>>>
>>>> {"httpd": {"requests":3241}}
>>>>
>>>> We'll have a combined view for all stats under `GET /_stats`.
>>>>
>>>>
>>>> A few notes:
>>>>
>>>> - Stats URLs are read-only.
>>>>
>>>> - This is a first iteration, we can add as many fancy stats and  
>>>> metrics
>>>> as
>>>> we like, we went with the ones that felt most useful for us.
>>>>
>>>> - The `{module, key}` identifiers for stats are not yet  
>>>> finalized, we
>>>> realized
>>>> early on that we'd like to have a fair number of stats before  
>>>> looking
>>>> into a good naming scheme. We don't have a good naming scheme
>>>> yet, recommendations are very welcome.
>>>>
>>>> - We'll add more inline-comments to key parts of the code.
>>>>
>>>> - We're not yet fully integrated into the build system (see  
>>>> below). eunit
>>>> offers a variety of options to conditionally compile code for  
>>>> testing
>>>> and
>>>> moving test code to separate files if that is preferred.
>>>>
>>>> - The way we calculate averages works, but is a bit clunky and  
>>>> can be
>>>> improved. Also, different kinds of averages can be added.
>>>>
>>>> - Stats are lost on server shutdown or when restartServer(); is  
>>>> called
>>>> wich
>>>> the JavaScript test suite does.
>>>>
>>>> - Some lines exceed the soft limit of 79 chars. We'll address  
>>>> that in the
>>>> future, if required.
>>>>
>>>> - Basic testing shows no significant slowdowns so far. Proper load
>>>> testing
>>>> has not been done yet.
>>>>
>>>> Test Driven
>>>>
>>>> - We took a test driven approach developing this. Alex is an  
>>>> expert on
>>>> TDD (with high success rates, but that's another story). We added
>>>> an extensive number of tests to `couch_tests.js` that differ in  
>>>> style
>>>> from the existing tests and are more oriented towards proven  
>>>> practices.
>>>> They are also more verbose and try to test only one thing at a  
>>>> time.
>>>> We hope you like the new style and we'd like to nudge the devs into
>>>> taking over our style. The infrastructure for the new-style tests  
>>>> is not
>>>> final and we can still change things around, if needed.
>>>>
>>>> - For things that were hard or impossible to test through the REST
>>>> API we added eunit-style tests to our core modules that ensure  
>>>> their
>>>> functionality. We added a temporary make target `make t` to launch
>>>> our tests. You need Erlang/OTP R12B-5 for that or an existing euint
>>>> installation. We'd like to work with Noah to integrate that into  
>>>> the
>>>> existing `make test` target (and integrate the existing bits).
>>>>
>>>>
>>>> Review
>>>>
>>>> We're looking for a validation of our architectural approach as  
>>>> well as
>>>> integrating our `increment` calls into the existing CouchDB  
>>>> codebase.
>>>>
>>>>
>>>> Legal
>>>>
>>>> We're releasing the source under the Apache 2.0 license. I (Jan)  
>>>> have
>>>> a CLA on file, Alex does not. If I remember correctly, my CLA is  
>>>> enough
>>>> to get the code into the hands of the ASF. Alex will receive  
>>>> honourable
>>>> mentions in our THANKS file. :)
>>>>
>>>> The stats module has been developed under a contract with the BBC.
>>>> They need to be able to integrate CouchDB with their existing  
>>>> monitoring
>>>> solution and we hope this will get them there. The BBC has no  
>>>> interest
>>>> in maintaining the code as a patch and has subsequently allowed us
>>>> to release the code as open source under the Apache 2.0 license and
>>>> asked if we could offer it to the CouchDB community for potential
>>>> inclusion.
>>>> By ASF law there still might have to a occur a proper software  
>>>> grant by
>>>> the BBC, but take that as a given.
>>>>
>>>>
>>>> Next Steps
>>>>
>>>> - We'd like to hammer out any architectural issues you might find.
>>>>
>>>> - We'll work on a decent naming scheme for stats.
>>>>
>>>> - We'll clean up code, add tests, and comments.
>>>>
>>>> - We'll add more stats and the missing bits we mentioned earlier.
>>>>
>>>> - We might end up adding SNMP support.
>>>>
>>>> - Once in some shape (and with all legal issues cleared), we'd like
>>>> to move the code to an ASF subversion branch and finish integration
>>>> into trunk from there. Hopefully before 0.9.
>>>>
>>>>
>>>> Thanks for your time,
>>>>
>>>> Cheers
>>>> Alex & Jan
>>>> --
>>>> PS from my PMC-hat: I think the stats module is a valuable addition
>>>> to CouchDB and I welcome any effort to integrate it.
>>>>
>>>>
>>>
>>
>>
>


Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
Hi Dirk,

thanks for your comments.

On 31 Jan 2009, at 12:40, Dirk-Willem van Gulik wrote:

> Paul Davis wrote:
>
>> >> The way that stats are calculated currently with the dependent
>> >> variable being time could cause some issues in implementing more
>> >> statistics. With my extremely limited knowledge of stats I think
>> >> moving that to be dependent on the number of requests might be  
>> better.
>
>> Code looks awesome. For reference, they're storing the last N samples
>> with each time point over a given time period. Ie, they store the
>> current counter values once a second.
>
> Eh - do you really want to build RRDtool/cacti/nagios/zenoss/mrtg  
> like capabilities in your database ?

No.


> I'd say - focus on getting the data out (ideally as _counters_) and  
> let the monitoring tools figure out the rest. They know their  
> polling intervals and what not (some dynimically adjust) and can  
> figure out how to sample, calculate rates, do stats and what not.  
> And can have fairly refined windowing techniques to aggregate (e.g.  
> rrdtool).
>
> So do not try to make a stats module - try to make a module that  
> output the data on which you can base your starts :).
>
> I.e. when you can have:
>
> 	database_opened		counter
> 	database_closed		counter
>
> you can work out 1) rate of open/close, 2) the number currently open  
> and do all sorts of post processing. Whereas a float giving you some  
> rate over some unknown window is not nearly as useful.

We have pure counters and raw data for all the tools. After all,
this was done after your spec which asked for some aggregate
values, so we put it in. The aggregates are well defined: 900
second window, 1 second resolution. We don't plan to make
this any fancier, just add more counters.

Cheers
Jan
--


Re: Statistics Module

Posted by Dirk-Willem van Gulik <Di...@bbc.co.uk>.
Paul Davis wrote:

>  >> The way that stats are calculated currently with the dependent
>  >> variable being time could cause some issues in implementing more
>  >> statistics. With my extremely limited knowledge of stats I think
>  >> moving that to be dependent on the number of requests might be better.

> Code looks awesome. For reference, they're storing the last N samples
> with each time point over a given time period. Ie, they store the
> current counter values once a second.

Eh - do you really want to build RRDtool/cacti/nagios/zenoss/mrtg like 
capabilities in your database ?

I'd say - focus on getting the data out (ideally as _counters_) and let 
the monitoring tools figure out the rest. They know their polling 
intervals and what not (some dynimically adjust) and can figure out how 
to sample, calculate rates, do stats and what not. And can have fairly 
refined windowing techniques to aggregate (e.g. rrdtool).

So do not try to make a stats module - try to make a module that output 
the data on which you can base your starts :).

I.e. when you can have:

	database_opened		counter
	database_closed		counter

you can work out 1) rate of open/close, 2) the number currently open and 
do all sorts of post processing. Whereas a float giving you some rate 
over some unknown window is not nearly as useful.

Dw

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
					

Re: Statistics Module

Posted by Chris Anderson <jc...@apache.org>.
On Fri, Jan 30, 2009 at 5:12 PM, Paul Davis <pa...@gmail.com> wrote:
>
> Everything in your source code should be trying to convey meaning as
> quickly and efficiently as possible. All these do is add redundant
> information and make the tests a pain to read.
>
> I definitely agree that we could spend some time organizing our test
> suite into smaller files. Perhaps one file per current test, and then
> split up some of the bigger tests into smaller bits. eunit tests for
> the internals is also made of win. I am in no way against stealing
> good ideas, but long names just isn't one of them.

Heh. I kinda have to agree with you here, although I think a little
balance can be found:

stats: {
   database: {
       "increment create": function() .....
       "increment open": function() ...
       "decrement closed": function() ...
       "this seems to be a test of the max-open functionality that
uses stats as a test tool. ftw, but let's move it to another section":
function() ...
       reset: function() ....
   }
}

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Statistics Module

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Jan 30, 2009 at 2:42 PM, Jan Lehnardt <ja...@apache.org> wrote:
> Hi Paul,
>
> thanks for your feedback.
>
> On 30 Jan 2009, at 00:26, Paul Davis wrote:
>>
>> Two concerns:
>>
>> The way that stats are calculated currently with the dependent
>> variable being time could cause some issues in implementing more
>> statistics. With my extremely limited knowledge of stats I think
>> moving that to be dependent on the number of requests might be better.
>> This is something that hopefully someone out there knows more about.
>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>> requests", (the later of the two making stddev type stats
>> calculateable on the fly in constant memory.)
>
> I think you are going like the new aggregator :)
>

Code looks awesome. For reference, they're storing the last N samples
with each time point over a given time period. Ie, they store the
current counter values once a second.

>
>> I'm sorry, but i gotta say: 'should keep the same number of open
>> databases when reaching the max_dbs_open limit' is a really fucking
>> verbose test name. I'm all for making our tests more granular, etc.
>> but those names are just incomprehensible. Perhaps we can abandon the
>> novelas with some better categorization semantics?
>
> The idea here is natural language product description that makes clear
> what is happening without you having to look at the test code which
> can be non-obvious. If a test fails, you get a neat error message and
> you are ready to go.
>
> Granted, the two non-native speakers at work here might have not
> picked the most concise wording, but that can be improved without
> changing the general style. http://rspec.info/ has more information
> about this style. I've seen Alex having great successes with this style
> of BDD/TDD driven development and I am happy he pushed me into
> adapting it for this.
>
> One thing to think about that behaviour specification is not unit testing,
> but that's not what the JS suite has been doing anyway. Maybe we can
> split things up.
>

I find fault with the logic that long names in any way lead to
success. The two things that this style does offer is the focus on
small independent tests and a hierarchical categorization of tests.

Just pulling out the first section:

  stats: function(debug) {
    var open_databases_tests = {
      'should increment the number of open databases when creating a
db': function...
      'should increment the number of open databases when opening a
db': function...
      'should decrement the number of open databases when deleting': function...
      'should keep the same number of open databases when reaching the
max_dbs_open limit': function...
      'should return 0 for number of open databases after call to
restartServer()': function....

Contrast that with:

stats: {
    database: {
        create_inc: function() .....
        open_inc: function() ...
        delete_dec: function() ...
        max_open: function() ...
        reset: function() ....
    }
}

Everything in your source code should be trying to convey meaning as
quickly and efficiently as possible. All these do is add redundant
information and make the tests a pain to read.

I definitely agree that we could spend some time organizing our test
suite into smaller files. Perhaps one file per current test, and then
split up some of the bigger tests into smaller bits. eunit tests for
the internals is also made of win. I am in no way against stealing
good ideas, but long names just isn't one of them.

HTH,
Paul Davis

> Cheers
> Jan
> --
> Also: Fucking.
>
>> On Thu, Jan 29, 2009 at 5:42 PM, Jan Lehnardt <ja...@apache.org> wrote:
>>>
>>> Hi list,
>>>
>>> Alex Lang and I have been working on a statistics module this week.
>>> We'd like to share our intermediate results and discuss further steps
>>> and acceptance of the module into the CouchDB codebase. The code
>>> lives on GitHub:
>>>
>>>  http://github.com/janl/couchdb/tree/stats
>>>
>>> A full diff against trunk from a few hours ago lives on Friendpaste:
>>>
>>>  http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe
>>>
>>> Be gentle :)
>>>
>>> The driving idea was to make it easy for 3rd party monitoring solutions
>>> to pick up runtime statistics of a single CouchDB node. The stats module
>>> does not collect long-term stats persistently. It does not draw pretty
>>> graphs
>>> and it does not come with a pony.
>>>
>>> The stats module is rather simple, it consists of two Erlang modules,
>>> a collector and an aggregator. The collector is a `gen_server` holding
>>> and `ets` table that holds integer values associated with keys. For now
>>> these are simple counters. All CouchDB modules can send a message
>>> to the collector with any key they want to get a metric counted. E.g.
>>> `couch_httpd`:
>>>
>>> couch_stats_collector:increment({httpd, requests}).
>>>
>>> to count the total number of requests made.
>>>
>>> We plan to add support for minimum and maximum values of
>>> different metrics soon.
>>>
>>> The aggregator has two jobs: It exposes the raw counter values from
>>> the collector to the HTTP plugin (`couch_httpd_stats_handlers.erl`)
>>> and to aggregate counter values to more meaningful numbers like
>>> average requests per second over 60 seconds, 300 seconds, 900
>>> seconds (cron-style).
>>>
>>> Reading values is done through a single API endpoint `/_stats`. The
>>> canonical format is `/_stats/modulename/statkey` e.g.:
>>>
>>> /_stats/httpd/requests
>>>
>>> will return a JSON string:
>>>
>>> {"httpd": {"requests":3241}}
>>>
>>> We'll have a combined view for all stats under `GET /_stats`.
>>>
>>>
>>> A few notes:
>>>
>>> - Stats URLs are read-only.
>>>
>>> - This is a first iteration, we can add as many fancy stats and metrics
>>> as
>>>  we like, we went with the ones that felt most useful for us.
>>>
>>> - The `{module, key}` identifiers for stats are not yet finalized, we
>>> realized
>>>  early on that we'd like to have a fair number of stats before looking
>>>  into a good naming scheme. We don't have a good naming scheme
>>>  yet, recommendations are very welcome.
>>>
>>> - We'll add more inline-comments to key parts of the code.
>>>
>>> - We're not yet fully integrated into the build system (see below). eunit
>>>  offers a variety of options to conditionally compile code for testing
>>> and
>>>  moving test code to separate files if that is preferred.
>>>
>>> - The way we calculate averages works, but is a bit clunky and can be
>>>  improved. Also, different kinds of averages can be added.
>>>
>>> - Stats are lost on server shutdown or when restartServer(); is called
>>> wich
>>>  the JavaScript test suite does.
>>>
>>> - Some lines exceed the soft limit of 79 chars. We'll address that in the
>>>  future, if required.
>>>
>>> - Basic testing shows no significant slowdowns so far. Proper load
>>> testing
>>>  has not been done yet.
>>>
>>> Test Driven
>>>
>>> - We took a test driven approach developing this. Alex is an expert on
>>>  TDD (with high success rates, but that's another story). We added
>>>  an extensive number of tests to `couch_tests.js` that differ in style
>>>  from the existing tests and are more oriented towards proven practices.
>>>  They are also more verbose and try to test only one thing at a time.
>>>  We hope you like the new style and we'd like to nudge the devs into
>>>  taking over our style. The infrastructure for the new-style tests is not
>>>  final and we can still change things around, if needed.
>>>
>>> - For things that were hard or impossible to test through the REST
>>>  API we added eunit-style tests to our core modules that ensure their
>>>  functionality. We added a temporary make target `make t` to launch
>>>  our tests. You need Erlang/OTP R12B-5 for that or an existing euint
>>>  installation. We'd like to work with Noah to integrate that into the
>>>  existing `make test` target (and integrate the existing bits).
>>>
>>>
>>> Review
>>>
>>> We're looking for a validation of our architectural approach as well as
>>> integrating our `increment` calls into the existing CouchDB codebase.
>>>
>>>
>>> Legal
>>>
>>> We're releasing the source under the Apache 2.0 license. I (Jan) have
>>> a CLA on file, Alex does not. If I remember correctly, my CLA is enough
>>> to get the code into the hands of the ASF. Alex will receive honourable
>>> mentions in our THANKS file. :)
>>>
>>> The stats module has been developed under a contract with the BBC.
>>> They need to be able to integrate CouchDB with their existing monitoring
>>> solution and we hope this will get them there. The BBC has no interest
>>> in maintaining the code as a patch and has subsequently allowed us
>>> to release the code as open source under the Apache 2.0 license and
>>> asked if we could offer it to the CouchDB community for potential
>>> inclusion.
>>> By ASF law there still might have to a occur a proper software grant by
>>> the BBC, but take that as a given.
>>>
>>>
>>> Next Steps
>>>
>>> - We'd like to hammer out any architectural issues you might find.
>>>
>>> - We'll work on a decent naming scheme for stats.
>>>
>>> - We'll clean up code, add tests, and comments.
>>>
>>> - We'll add more stats and the missing bits we mentioned earlier.
>>>
>>> - We might end up adding SNMP support.
>>>
>>> - Once in some shape (and with all legal issues cleared), we'd like
>>> to move the code to an ASF subversion branch and finish integration
>>> into trunk from there. Hopefully before 0.9.
>>>
>>>
>>> Thanks for your time,
>>>
>>> Cheers
>>> Alex & Jan
>>> --
>>> PS from my PMC-hat: I think the stats module is a valuable addition
>>> to CouchDB and I welcome any effort to integrate it.
>>>
>>>
>>
>
>

Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
Hi Paul,

thanks for your feedback.

On 30 Jan 2009, at 00:26, Paul Davis wrote:
> Two concerns:
>
> The way that stats are calculated currently with the dependent
> variable being time could cause some issues in implementing more
> statistics. With my extremely limited knowledge of stats I think
> moving that to be dependent on the number of requests might be better.
> This is something that hopefully someone out there knows more about.
> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
> requests", (the later of the two making stddev type stats
> calculateable on the fly in constant memory.)

I think you are going like the new aggregator :)


> I'm sorry, but i gotta say: 'should keep the same number of open
> databases when reaching the max_dbs_open limit' is a really fucking
> verbose test name. I'm all for making our tests more granular, etc.
> but those names are just incomprehensible. Perhaps we can abandon the
> novelas with some better categorization semantics?

The idea here is natural language product description that makes clear
what is happening without you having to look at the test code which
can be non-obvious. If a test fails, you get a neat error message and
you are ready to go.

Granted, the two non-native speakers at work here might have not
picked the most concise wording, but that can be improved without
changing the general style. http://rspec.info/ has more information
about this style. I've seen Alex having great successes with this style
of BDD/TDD driven development and I am happy he pushed me into
adapting it for this.

One thing to think about that behaviour specification is not unit  
testing,
but that's not what the JS suite has been doing anyway. Maybe we can
split things up.

Cheers
Jan
--
Also: Fucking.

> On Thu, Jan 29, 2009 at 5:42 PM, Jan Lehnardt <ja...@apache.org> wrote:
>> Hi list,
>>
>> Alex Lang and I have been working on a statistics module this week.
>> We'd like to share our intermediate results and discuss further steps
>> and acceptance of the module into the CouchDB codebase. The code
>> lives on GitHub:
>>
>>   http://github.com/janl/couchdb/tree/stats
>>
>> A full diff against trunk from a few hours ago lives on Friendpaste:
>>
>>   http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe
>>
>> Be gentle :)
>>
>> The driving idea was to make it easy for 3rd party monitoring  
>> solutions
>> to pick up runtime statistics of a single CouchDB node. The stats  
>> module
>> does not collect long-term stats persistently. It does not draw  
>> pretty
>> graphs
>> and it does not come with a pony.
>>
>> The stats module is rather simple, it consists of two Erlang modules,
>> a collector and an aggregator. The collector is a `gen_server`  
>> holding
>> and `ets` table that holds integer values associated with keys. For  
>> now
>> these are simple counters. All CouchDB modules can send a message
>> to the collector with any key they want to get a metric counted. E.g.
>> `couch_httpd`:
>>
>> couch_stats_collector:increment({httpd, requests}).
>>
>> to count the total number of requests made.
>>
>> We plan to add support for minimum and maximum values of
>> different metrics soon.
>>
>> The aggregator has two jobs: It exposes the raw counter values from
>> the collector to the HTTP plugin (`couch_httpd_stats_handlers.erl`)
>> and to aggregate counter values to more meaningful numbers like
>> average requests per second over 60 seconds, 300 seconds, 900
>> seconds (cron-style).
>>
>> Reading values is done through a single API endpoint `/_stats`. The
>> canonical format is `/_stats/modulename/statkey` e.g.:
>>
>> /_stats/httpd/requests
>>
>> will return a JSON string:
>>
>> {"httpd": {"requests":3241}}
>>
>> We'll have a combined view for all stats under `GET /_stats`.
>>
>>
>> A few notes:
>>
>> - Stats URLs are read-only.
>>
>> - This is a first iteration, we can add as many fancy stats and  
>> metrics as
>>  we like, we went with the ones that felt most useful for us.
>>
>> - The `{module, key}` identifiers for stats are not yet finalized, we
>> realized
>>   early on that we'd like to have a fair number of stats before  
>> looking
>>   into a good naming scheme. We don't have a good naming scheme
>>   yet, recommendations are very welcome.
>>
>> - We'll add more inline-comments to key parts of the code.
>>
>> - We're not yet fully integrated into the build system (see below).  
>> eunit
>>  offers a variety of options to conditionally compile code for  
>> testing and
>>  moving test code to separate files if that is preferred.
>>
>> - The way we calculate averages works, but is a bit clunky and can be
>>  improved. Also, different kinds of averages can be added.
>>
>> - Stats are lost on server shutdown or when restartServer(); is  
>> called wich
>>  the JavaScript test suite does.
>>
>> - Some lines exceed the soft limit of 79 chars. We'll address that  
>> in the
>>  future, if required.
>>
>> - Basic testing shows no significant slowdowns so far. Proper load  
>> testing
>>  has not been done yet.
>>
>> Test Driven
>>
>> - We took a test driven approach developing this. Alex is an expert  
>> on
>>  TDD (with high success rates, but that's another story). We added
>>  an extensive number of tests to `couch_tests.js` that differ in  
>> style
>>  from the existing tests and are more oriented towards proven  
>> practices.
>>  They are also more verbose and try to test only one thing at a time.
>>  We hope you like the new style and we'd like to nudge the devs into
>>  taking over our style. The infrastructure for the new-style tests  
>> is not
>>  final and we can still change things around, if needed.
>>
>> - For things that were hard or impossible to test through the REST
>>  API we added eunit-style tests to our core modules that ensure their
>>  functionality. We added a temporary make target `make t` to launch
>>  our tests. You need Erlang/OTP R12B-5 for that or an existing euint
>>  installation. We'd like to work with Noah to integrate that into the
>>  existing `make test` target (and integrate the existing bits).
>>
>>
>> Review
>>
>> We're looking for a validation of our architectural approach as  
>> well as
>> integrating our `increment` calls into the existing CouchDB codebase.
>>
>>
>> Legal
>>
>> We're releasing the source under the Apache 2.0 license. I (Jan) have
>> a CLA on file, Alex does not. If I remember correctly, my CLA is  
>> enough
>> to get the code into the hands of the ASF. Alex will receive  
>> honourable
>> mentions in our THANKS file. :)
>>
>> The stats module has been developed under a contract with the BBC.
>> They need to be able to integrate CouchDB with their existing  
>> monitoring
>> solution and we hope this will get them there. The BBC has no  
>> interest
>> in maintaining the code as a patch and has subsequently allowed us
>> to release the code as open source under the Apache 2.0 license and
>> asked if we could offer it to the CouchDB community for potential  
>> inclusion.
>> By ASF law there still might have to a occur a proper software  
>> grant by
>> the BBC, but take that as a given.
>>
>>
>> Next Steps
>>
>> - We'd like to hammer out any architectural issues you might find.
>>
>> - We'll work on a decent naming scheme for stats.
>>
>> - We'll clean up code, add tests, and comments.
>>
>> - We'll add more stats and the missing bits we mentioned earlier.
>>
>> - We might end up adding SNMP support.
>>
>> - Once in some shape (and with all legal issues cleared), we'd like
>> to move the code to an ASF subversion branch and finish integration
>> into trunk from there. Hopefully before 0.9.
>>
>>
>> Thanks for your time,
>>
>> Cheers
>> Alex & Jan
>> --
>> PS from my PMC-hat: I think the stats module is a valuable addition
>> to CouchDB and I welcome any effort to integrate it.
>>
>>
>


Re: Statistics Module

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Jan 30, 2009 at 2:10 AM, Antony Blakey <an...@gmail.com> wrote:
>
> On 30/01/2009, at 5:32 PM, Paul Davis wrote:
>
>> On Fri, Jan 30, 2009 at 1:58 AM, Antony Blakey <an...@gmail.com>
>> wrote:
>>>
>>> On 30/01/2009, at 4:27 PM, Paul Davis wrote:
>>>
>>>> On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey
>>>> <an...@gmail.com>
>>>> wrote:
>>>>>
>>>>> On 30/01/2009, at 9:56 AM, Paul Davis wrote:
>>>>>
>>>>>> The way that stats are calculated currently with the dependent
>>>>>> variable being time could cause some issues in implementing more
>>>>>> statistics. With my extremely limited knowledge of stats I think
>>>>>> moving that to be dependent on the number of requests might be better.
>>>>>> This is something that hopefully someone out there knows more about.
>>>>>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>>>>>> requests", (the later of the two making stddev type stats
>>>>>> calculateable on the fly in constant memory.)
>>>>>
>>>>> The problem with using # of requests is that depending on your data,
>>>>> each
>>>>> request may take a long time. I have this problem at the moment: 1008
>>>>> documents in a 3.5G media database. During a compact, the status in
>>>>> _active_tasks updates every 1000 documents, so you can imagine how
>>>>> useful
>>>>> that is :/ I thought it had hung (and neither the beam.smp CPU time nor
>>>>> the
>>>>> IO requests were a good indicator). I spent some time chasing this down
>>>>> as a
>>>>> bug before realising the problems was in the status granularity!
>>>>>
>>>>
>>>> Actually I don't think that affects my question at all. It may change
>>>> how we report things though. As in, it may be important to be able to
>>>> report things that are not single increment/decrement conditions but
>>>> instead allow for adding arbitrary floating point numbers to the
>>>> number of recorded data points.
>>>
>>> I think I have the wrong end of the stick here - my problem was with the
>>> granularity of updates, not with the basis of calculation.
>>>
>>
>> Heh. Well, we can only measure what we know. And in the interest of
>> simplicity I think the granularity is gonna have to stick to pretty
>> much per request. Also you're flying with 300 MiB docs? perhaps its
>> time to chop or store in FTP?
>
> No, lots of attachments per doc. I need them to replicate. 3.5G / 1000 docs
> = roughly 3.5 MB attachments per doc. Not unreasonable.
>

What an appropriate thread to have made a math error. Also, yes, not
at all unreasonable.

> Antony Blakey
> --------------------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> Plurality is not to be assumed without necessity
>  -- William of Ockham (ca. 1285-1349)
>
>
>

Re: Statistics Module

Posted by Antony Blakey <an...@gmail.com>.
On 30/01/2009, at 5:32 PM, Paul Davis wrote:

> On Fri, Jan 30, 2009 at 1:58 AM, Antony Blakey <antony.blakey@gmail.com 
> > wrote:
>>
>> On 30/01/2009, at 4:27 PM, Paul Davis wrote:
>>
>>> On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey <antony.blakey@gmail.com 
>>> >
>>> wrote:
>>>>
>>>> On 30/01/2009, at 9:56 AM, Paul Davis wrote:
>>>>
>>>>> The way that stats are calculated currently with the dependent
>>>>> variable being time could cause some issues in implementing more
>>>>> statistics. With my extremely limited knowledge of stats I think
>>>>> moving that to be dependent on the number of requests might be  
>>>>> better.
>>>>> This is something that hopefully someone out there knows more  
>>>>> about.
>>>>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>>>>> requests", (the later of the two making stddev type stats
>>>>> calculateable on the fly in constant memory.)
>>>>
>>>> The problem with using # of requests is that depending on your  
>>>> data, each
>>>> request may take a long time. I have this problem at the moment:  
>>>> 1008
>>>> documents in a 3.5G media database. During a compact, the status in
>>>> _active_tasks updates every 1000 documents, so you can imagine  
>>>> how useful
>>>> that is :/ I thought it had hung (and neither the beam.smp CPU  
>>>> time nor
>>>> the
>>>> IO requests were a good indicator). I spent some time chasing  
>>>> this down
>>>> as a
>>>> bug before realising the problems was in the status granularity!
>>>>
>>>
>>> Actually I don't think that affects my question at all. It may  
>>> change
>>> how we report things though. As in, it may be important to be able  
>>> to
>>> report things that are not single increment/decrement conditions but
>>> instead allow for adding arbitrary floating point numbers to the
>>> number of recorded data points.
>>
>> I think I have the wrong end of the stick here - my problem was  
>> with the
>> granularity of updates, not with the basis of calculation.
>>
>
> Heh. Well, we can only measure what we know. And in the interest of
> simplicity I think the granularity is gonna have to stick to pretty
> much per request. Also you're flying with 300 MiB docs? perhaps its
> time to chop or store in FTP?

No, lots of attachments per doc. I need them to replicate. 3.5G / 1000  
docs = roughly 3.5 MB attachments per doc. Not unreasonable.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Plurality is not to be assumed without necessity
   -- William of Ockham (ca. 1285-1349)



Re: Statistics Module

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Jan 30, 2009 at 1:58 AM, Antony Blakey <an...@gmail.com> wrote:
>
> On 30/01/2009, at 4:27 PM, Paul Davis wrote:
>
>> On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey <an...@gmail.com>
>> wrote:
>>>
>>> On 30/01/2009, at 9:56 AM, Paul Davis wrote:
>>>
>>>> The way that stats are calculated currently with the dependent
>>>> variable being time could cause some issues in implementing more
>>>> statistics. With my extremely limited knowledge of stats I think
>>>> moving that to be dependent on the number of requests might be better.
>>>> This is something that hopefully someone out there knows more about.
>>>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>>>> requests", (the later of the two making stddev type stats
>>>> calculateable on the fly in constant memory.)
>>>
>>> The problem with using # of requests is that depending on your data, each
>>> request may take a long time. I have this problem at the moment: 1008
>>> documents in a 3.5G media database. During a compact, the status in
>>> _active_tasks updates every 1000 documents, so you can imagine how useful
>>> that is :/ I thought it had hung (and neither the beam.smp CPU time nor
>>> the
>>> IO requests were a good indicator). I spent some time chasing this down
>>> as a
>>> bug before realising the problems was in the status granularity!
>>>
>>
>> Actually I don't think that affects my question at all. It may change
>> how we report things though. As in, it may be important to be able to
>> report things that are not single increment/decrement conditions but
>> instead allow for adding arbitrary floating point numbers to the
>> number of recorded data points.
>
> I think I have the wrong end of the stick here - my problem was with the
> granularity of updates, not with the basis of calculation.
>

Heh. Well, we can only measure what we know. And in the interest of
simplicity I think the granularity is gonna have to stick to pretty
much per request. Also you're flying with 300 MiB docs? perhaps its
time to chop or store in FTP?

> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> It is no measure of health to be well adjusted to a profoundly sick society.
>  -- Jiddu Krishnamurti
>
>
>

Re: Statistics Module

Posted by Antony Blakey <an...@gmail.com>.
On 30/01/2009, at 4:27 PM, Paul Davis wrote:

> On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey <antony.blakey@gmail.com 
> > wrote:
>>
>> On 30/01/2009, at 9:56 AM, Paul Davis wrote:
>>
>>> The way that stats are calculated currently with the dependent
>>> variable being time could cause some issues in implementing more
>>> statistics. With my extremely limited knowledge of stats I think
>>> moving that to be dependent on the number of requests might be  
>>> better.
>>> This is something that hopefully someone out there knows more about.
>>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>>> requests", (the later of the two making stddev type stats
>>> calculateable on the fly in constant memory.)
>>
>> The problem with using # of requests is that depending on your  
>> data, each
>> request may take a long time. I have this problem at the moment: 1008
>> documents in a 3.5G media database. During a compact, the status in
>> _active_tasks updates every 1000 documents, so you can imagine how  
>> useful
>> that is :/ I thought it had hung (and neither the beam.smp CPU time  
>> nor the
>> IO requests were a good indicator). I spent some time chasing this  
>> down as a
>> bug before realising the problems was in the status granularity!
>>
>
> Actually I don't think that affects my question at all. It may change
> how we report things though. As in, it may be important to be able to
> report things that are not single increment/decrement conditions but
> instead allow for adding arbitrary floating point numbers to the
> number of recorded data points.

I think I have the wrong end of the stick here - my problem was with  
the granularity of updates, not with the basis of calculation.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

It is no measure of health to be well adjusted to a profoundly sick  
society.
   -- Jiddu Krishnamurti



Re: Statistics Module

Posted by Paul Davis <pa...@gmail.com>.
On Fri, Jan 30, 2009 at 12:32 AM, Antony Blakey <an...@gmail.com> wrote:
>
> On 30/01/2009, at 9:56 AM, Paul Davis wrote:
>
>> The way that stats are calculated currently with the dependent
>> variable being time could cause some issues in implementing more
>> statistics. With my extremely limited knowledge of stats I think
>> moving that to be dependent on the number of requests might be better.
>> This is something that hopefully someone out there knows more about.
>> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
>> requests", (the later of the two making stddev type stats
>> calculateable on the fly in constant memory.)
>
> The problem with using # of requests is that depending on your data, each
> request may take a long time. I have this problem at the moment: 1008
> documents in a 3.5G media database. During a compact, the status in
> _active_tasks updates every 1000 documents, so you can imagine how useful
> that is :/ I thought it had hung (and neither the beam.smp CPU time nor the
> IO requests were a good indicator). I spent some time chasing this down as a
> bug before realising the problems was in the status granularity!
>

Actually I don't think that affects my question at all. It may change
how we report things though. As in, it may be important to be able to
report things that are not single increment/decrement conditions but
instead allow for adding arbitrary floating point numbers to the
number of recorded data points.

IMO, your specific use case only gives my argument about having the
dependent variable be requests (or more specifically, data points) be
the dependent variable.

To explain more clearly the case is this: if we don't treat the
collected data points as the dependent variable we are unable to
calculate extended statistics like variance/stddev etc. This is
because if the dependent variable is time, then the number of data
points is unbounded. If that's the case we have unbounded memory
usage. (because I know of no incremental algorithms for calculating
these statistics without knowledge of past values, I could be wrong)

In other words, if we're doing stats for N values, when we store value
number N+1, we must know value 0 so it can be removed from the
calculations. If the dependent variable is time, then N can be
arbitrarily large thus causing memory problems.

Your use case just changes each data point from being a inc/dec op to
a "store arbitrary number" op. In any case, I'm not at all comfortable
relying on solely my knowledge of calculating statistics in an
incremental fashion so hopefully there's a stats buff out there who
will feel compelled to weigh in.

HTH,
Paul Davis

P.S. For those of you wondering why standard deviation is important, I
reference the ever so eloquent Zed Shaw [1] : "Programmers Need To
Learn Statistics Or I Will Kill Them All." Also, he is right.

[1] http://www.zedshaw.com/rants/programmer_stats.html


> Antony Blakey
> -------------
> CTO, Linkuistics Pty Ltd
> Ph: 0438 840 787
>
> The ultimate measure of a man is not where he stands in moments of comfort
> and convenience, but where he stands at times of challenge and controversy.
>  -- Martin Luther King
>
>
>

Re: Statistics Module

Posted by Antony Blakey <an...@gmail.com>.
On 31/01/2009, at 6:14 AM, Jan Lehnardt wrote:

> Hi Antony,
>
> On 30 Jan 2009, at 06:32, Antony Blakey wrote:
>
>> [...] I spent some time chasing this down as a bug before realising  
>> the problems was in the status granularity!
>
> Out of context, but hey. We have one-second-granularity
> now and we can even make it finer, if needed. Or a config
> option.

After waiting for 5 minutes to see a status update, 1 second will be  
fantastic!

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

I contend that we are both atheists. I just believe in one fewer god  
than you do. When you understand why you dismiss all the other  
possible gods, you will understand why I dismiss yours.
   --Stephen F Roberts



Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
Hi Antony,

On 30 Jan 2009, at 06:32, Antony Blakey wrote:

> [...] I spent some time chasing this down as a bug before realising  
> the problems was in the status granularity!

Out of context, but hey. We have one-second-granularity
now and we can even make it finer, if needed. Or a config
option.

Cheers
Jan
--

Re: Statistics Module

Posted by Antony Blakey <an...@gmail.com>.
On 30/01/2009, at 9:56 AM, Paul Davis wrote:

> The way that stats are calculated currently with the dependent
> variable being time could cause some issues in implementing more
> statistics. With my extremely limited knowledge of stats I think
> moving that to be dependent on the number of requests might be better.
> This is something that hopefully someone out there knows more about.
> (This is in terms of "avg for last 5 minutes" vs "avg for last 100
> requests", (the later of the two making stddev type stats
> calculateable on the fly in constant memory.)

The problem with using # of requests is that depending on your data,  
each request may take a long time. I have this problem at the moment:  
1008 documents in a 3.5G media database. During a compact, the status  
in _active_tasks updates every 1000 documents, so you can imagine how  
useful that is :/ I thought it had hung (and neither the beam.smp CPU  
time nor the IO requests were a good indicator). I spent some time  
chasing this down as a bug before realising the problems was in the  
status granularity!

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

The ultimate measure of a man is not where he stands in moments of  
comfort and convenience, but where he stands at times of challenge and  
controversy.
   -- Martin Luther King



Re: Statistics Module

Posted by Paul Davis <pa...@gmail.com>.
Jan & Alex,

Awesome work. I'm already having visions of a shiny new page in Futon.

Two concerns:

The way that stats are calculated currently with the dependent
variable being time could cause some issues in implementing more
statistics. With my extremely limited knowledge of stats I think
moving that to be dependent on the number of requests might be better.
This is something that hopefully someone out there knows more about.
(This is in terms of "avg for last 5 minutes" vs "avg for last 100
requests", (the later of the two making stddev type stats
calculateable on the fly in constant memory.)

I'm sorry, but i gotta say: 'should keep the same number of open
databases when reaching the max_dbs_open limit' is a really fucking
verbose test name. I'm all for making our tests more granular, etc.
but those names are just incomprehensible. Perhaps we can abandon the
novelas with some better categorization semantics?

HTH,
Paul Davis

On Thu, Jan 29, 2009 at 5:42 PM, Jan Lehnardt <ja...@apache.org> wrote:
> Hi list,
>
> Alex Lang and I have been working on a statistics module this week.
> We'd like to share our intermediate results and discuss further steps
> and acceptance of the module into the CouchDB codebase. The code
> lives on GitHub:
>
>    http://github.com/janl/couchdb/tree/stats
>
> A full diff against trunk from a few hours ago lives on Friendpaste:
>
>    http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe
>
> Be gentle :)
>
> The driving idea was to make it easy for 3rd party monitoring solutions
> to pick up runtime statistics of a single CouchDB node. The stats module
> does not collect long-term stats persistently. It does not draw pretty
> graphs
> and it does not come with a pony.
>
> The stats module is rather simple, it consists of two Erlang modules,
> a collector and an aggregator. The collector is a `gen_server` holding
> and `ets` table that holds integer values associated with keys. For now
> these are simple counters. All CouchDB modules can send a message
> to the collector with any key they want to get a metric counted. E.g.
> `couch_httpd`:
>
>  couch_stats_collector:increment({httpd, requests}).
>
> to count the total number of requests made.
>
> We plan to add support for minimum and maximum values of
> different metrics soon.
>
> The aggregator has two jobs: It exposes the raw counter values from
> the collector to the HTTP plugin (`couch_httpd_stats_handlers.erl`)
> and to aggregate counter values to more meaningful numbers like
> average requests per second over 60 seconds, 300 seconds, 900
> seconds (cron-style).
>
> Reading values is done through a single API endpoint `/_stats`. The
> canonical format is `/_stats/modulename/statkey` e.g.:
>
>  /_stats/httpd/requests
>
> will return a JSON string:
>
>  {"httpd": {"requests":3241}}
>
> We'll have a combined view for all stats under `GET /_stats`.
>
>
> A few notes:
>
>  - Stats URLs are read-only.
>
>  - This is a first iteration, we can add as many fancy stats and metrics as
>   we like, we went with the ones that felt most useful for us.
>
>  - The `{module, key}` identifiers for stats are not yet finalized, we
> realized
>    early on that we'd like to have a fair number of stats before looking
>    into a good naming scheme. We don't have a good naming scheme
>    yet, recommendations are very welcome.
>
>  - We'll add more inline-comments to key parts of the code.
>
>  - We're not yet fully integrated into the build system (see below). eunit
>   offers a variety of options to conditionally compile code for testing and
>   moving test code to separate files if that is preferred.
>
>  - The way we calculate averages works, but is a bit clunky and can be
>   improved. Also, different kinds of averages can be added.
>
>  - Stats are lost on server shutdown or when restartServer(); is called wich
>   the JavaScript test suite does.
>
>  - Some lines exceed the soft limit of 79 chars. We'll address that in the
>   future, if required.
>
>  - Basic testing shows no significant slowdowns so far. Proper load testing
>   has not been done yet.
>
> Test Driven
>
>  - We took a test driven approach developing this. Alex is an expert on
>   TDD (with high success rates, but that's another story). We added
>   an extensive number of tests to `couch_tests.js` that differ in style
>   from the existing tests and are more oriented towards proven practices.
>   They are also more verbose and try to test only one thing at a time.
>   We hope you like the new style and we'd like to nudge the devs into
>   taking over our style. The infrastructure for the new-style tests is not
>   final and we can still change things around, if needed.
>
>  - For things that were hard or impossible to test through the REST
>   API we added eunit-style tests to our core modules that ensure their
>   functionality. We added a temporary make target `make t` to launch
>   our tests. You need Erlang/OTP R12B-5 for that or an existing euint
>   installation. We'd like to work with Noah to integrate that into the
>   existing `make test` target (and integrate the existing bits).
>
>
> Review
>
> We're looking for a validation of our architectural approach as well as
> integrating our `increment` calls into the existing CouchDB codebase.
>
>
> Legal
>
> We're releasing the source under the Apache 2.0 license. I (Jan) have
> a CLA on file, Alex does not. If I remember correctly, my CLA is enough
> to get the code into the hands of the ASF. Alex will receive honourable
> mentions in our THANKS file. :)
>
> The stats module has been developed under a contract with the BBC.
> They need to be able to integrate CouchDB with their existing monitoring
> solution and we hope this will get them there. The BBC has no interest
> in maintaining the code as a patch and has subsequently allowed us
> to release the code as open source under the Apache 2.0 license and
> asked if we could offer it to the CouchDB community for potential inclusion.
> By ASF law there still might have to a occur a proper software grant by
> the BBC, but take that as a given.
>
>
> Next Steps
>
> - We'd like to hammer out any architectural issues you might find.
>
> - We'll work on a decent naming scheme for stats.
>
> - We'll clean up code, add tests, and comments.
>
> - We'll add more stats and the missing bits we mentioned earlier.
>
> - We might end up adding SNMP support.
>
> - Once in some shape (and with all legal issues cleared), we'd like
>  to move the code to an ASF subversion branch and finish integration
>  into trunk from there. Hopefully before 0.9.
>
>
> Thanks for your time,
>
> Cheers
> Alex & Jan
> --
> PS from my PMC-hat: I think the stats module is a valuable addition
> to CouchDB and I welcome any effort to integrate it.
>
>

Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
On 29 Jan 2009, at 23:42, Jan Lehnardt wrote:

> Hi list,
>
> Alex Lang and I have been working on a statistics module this week.
> We'd like to share our intermediate results and discuss further steps
> and acceptance of the module into the CouchDB codebase. The code
> lives on GitHub:
>
>    http://github.com/janl/couchdb/tree/stats

Not sure why I'm full of fail when it comes to Git, but here you have
the latest branch with some new stuff:

	http://github.com/janl/couchdb/tree/old-stats-new

News:

  - Added GET /_stats view (thanks Paul Davis and Martin Scholl for  
help).
  - Added Martin Scholl's implementation of less complex average &  
stddev
    calculation.
  - Add more tests, heh.
  - Messed up stats-new branch.


Todo:

  - Integration with build process.
  - Conditionally compile eunit source on `make test` not for regular  
`make`.
  - Merge with old `make test` code.
  - Add absolute value counter for things like request time.
   - In the process we'll remove the little clunky DifferencesList  
stuff and
     make all aggregators use absolute values.
  - Add float precision to aggregate values.
  - Move to ASF subversion branch.

If you could help us prioritize the Todo list, we're happy to follow  
your
suggestions. Thanks!

Cheers
Jan
--


Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
On 29 Jan 2009, at 23:42, Jan Lehnardt wrote:

> A full diff against trunk from a few hours ago lives on Friendpaste:
>
>    http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe

I messed up the diff. I updated the paste, please reload if you get a
version that starts diffing a `.gitignore` file.

Cheers
Jan
--


Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
Hi,

some more updates from work today:

we revamped the aggregator module and we now support
aggregate values over a 900 second (15 minutes) period
in second-granularity. This gives us max, min, avg and
stddev for any time-ranges in the 900 second array.
Aggregates are now created automatically for every
module that uses the `increase()/decrease()` API to
add metrics.

The 900 second window is a compile-time option for now.

The GitHub branch:

    http://github.com/janl/couchdb/tree/stats-new

is up to date.

Cheers
Jan
--

Work in progress includes a list of all
On 29 Jan 2009, at 23:42, Jan Lehnardt wrote:

> Hi list,
>
> Alex Lang and I have been working on a statistics module this week.
> We'd like to share our intermediate results and discuss further steps
> and acceptance of the module into the CouchDB codebase. The code
> lives on GitHub:
>
>    http://github.com/janl/couchdb/tree/stats
>
> A full diff against trunk from a few hours ago lives on Friendpaste:
>
>    http://friendpaste.com/3IfIyRv5EzXEnPyk7S9xqe
>
> Be gentle :)
>
> The driving idea was to make it easy for 3rd party monitoring  
> solutions
> to pick up runtime statistics of a single CouchDB node. The stats  
> module
> does not collect long-term stats persistently. It does not draw  
> pretty graphs
> and it does not come with a pony.
>
> The stats module is rather simple, it consists of two Erlang modules,
> a collector and an aggregator. The collector is a `gen_server` holding
> and `ets` table that holds integer values associated with keys. For  
> now
> these are simple counters. All CouchDB modules can send a message
> to the collector with any key they want to get a metric counted. E.g.
> `couch_httpd`:
>
>  couch_stats_collector:increment({httpd, requests}).
>
> to count the total number of requests made.
>
> We plan to add support for minimum and maximum values of
> different metrics soon.
>
> The aggregator has two jobs: It exposes the raw counter values from
> the collector to the HTTP plugin (`couch_httpd_stats_handlers.erl`)
> and to aggregate counter values to more meaningful numbers like
> average requests per second over 60 seconds, 300 seconds, 900
> seconds (cron-style).
>
> Reading values is done through a single API endpoint `/_stats`. The
> canonical format is `/_stats/modulename/statkey` e.g.:
>
>  /_stats/httpd/requests
>
> will return a JSON string:
>
>  {"httpd": {"requests":3241}}
>
> We'll have a combined view for all stats under `GET /_stats`.
>
>
> A few notes:
>
> - Stats URLs are read-only.
>
> - This is a first iteration, we can add as many fancy stats and  
> metrics as
>   we like, we went with the ones that felt most useful for us.
>
> - The `{module, key}` identifiers for stats are not yet finalized,  
> we realized
>    early on that we'd like to have a fair number of stats before  
> looking
>    into a good naming scheme. We don't have a good naming scheme
>    yet, recommendations are very welcome.
>
> - We'll add more inline-comments to key parts of the code.
>
> - We're not yet fully integrated into the build system (see below).  
> eunit
>   offers a variety of options to conditionally compile code for  
> testing and
>   moving test code to separate files if that is preferred.
>
> - The way we calculate averages works, but is a bit clunky and can be
>   improved. Also, different kinds of averages can be added.
>
> - Stats are lost on server shutdown or when restartServer(); is  
> called wich
>   the JavaScript test suite does.
>
> - Some lines exceed the soft limit of 79 chars. We'll address that  
> in the
>   future, if required.
>
> - Basic testing shows no significant slowdowns so far. Proper load  
> testing
>   has not been done yet.
>
> Test Driven
>
> - We took a test driven approach developing this. Alex is an expert on
>   TDD (with high success rates, but that's another story). We added
>   an extensive number of tests to `couch_tests.js` that differ in  
> style
>   from the existing tests and are more oriented towards proven  
> practices.
>   They are also more verbose and try to test only one thing at a time.
>   We hope you like the new style and we'd like to nudge the devs into
>   taking over our style. The infrastructure for the new-style tests  
> is not
>   final and we can still change things around, if needed.
>
> - For things that were hard or impossible to test through the REST
>   API we added eunit-style tests to our core modules that ensure their
>   functionality. We added a temporary make target `make t` to launch
>   our tests. You need Erlang/OTP R12B-5 for that or an existing euint
>   installation. We'd like to work with Noah to integrate that into the
>   existing `make test` target (and integrate the existing bits).
>
>
> Review
>
> We're looking for a validation of our architectural approach as well  
> as
> integrating our `increment` calls into the existing CouchDB codebase.
>
>
> Legal
>
> We're releasing the source under the Apache 2.0 license. I (Jan) have
> a CLA on file, Alex does not. If I remember correctly, my CLA is  
> enough
> to get the code into the hands of the ASF. Alex will receive  
> honourable
> mentions in our THANKS file. :)
>
> The stats module has been developed under a contract with the BBC.
> They need to be able to integrate CouchDB with their existing  
> monitoring
> solution and we hope this will get them there. The BBC has no interest
> in maintaining the code as a patch and has subsequently allowed us
> to release the code as open source under the Apache 2.0 license and
> asked if we could offer it to the CouchDB community for potential  
> inclusion.
> By ASF law there still might have to a occur a proper software grant  
> by
> the BBC, but take that as a given.
>
>
> Next Steps
>
> - We'd like to hammer out any architectural issues you might find.
>
> - We'll work on a decent naming scheme for stats.
>
> - We'll clean up code, add tests, and comments.
>
> - We'll add more stats and the missing bits we mentioned earlier.
>
> - We might end up adding SNMP support.
>
> - Once in some shape (and with all legal issues cleared), we'd like
>  to move the code to an ASF subversion branch and finish integration
>  into trunk from there. Hopefully before 0.9.
>
>
> Thanks for your time,
>
> Cheers
> Alex & Jan
> --
> PS from my PMC-hat: I think the stats module is a valuable addition
> to CouchDB and I welcome any effort to integrate it.
>
>


Re: Statistics Module

Posted by Jan Lehnardt <ja...@apache.org>.
On 29 Jan 2009, at 23:42, Jan Lehnardt wrote:

> Hi list,
>
> Alex Lang and I have been working on a statistics module this week.
> We'd like to share our intermediate results and discuss further steps
> and acceptance of the module into the CouchDB codebase. The code
> lives on GitHub:
>
>    http://github.com/janl/couchdb/tree/stats
>

And thanks to the userfriendlyness of git, please now look at

    http://github.com/janl/couchdb/commits/stats-new

Thanks to Chris for getting me sorted out.

Cheers
Jan
--