You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Phil Steitz <st...@yahoo.com> on 2003/05/14 23:27:57 UTC
RE: [math][patch] Lets try this again, here's a new patch for Rolling
Mark,
This is better than storing *all* of the values
(including out of window), but notice that the only
value that you actually use is
values[index]. That's all you really need to keep.
This can be done with just one lagged value.
Also, remember that you need to worry about
maintaining integrity of min and max -- special care
needs to be taken for the cases where the "rolling
off" value is the (single instance of) min or max.
Phil
__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by Phil Steitz <ph...@steitz.com>.
Mohan Kishore wrote:
> Not that I know sqat about this level of mathematics, but I would strongly
> recommend detailed JavaDoc of the algorithm, OR, a link to whitepaper/article
> online which details the same. Failing which, I fear, the use/future
> development/debugging will surely suffer...
Yes. And of course test cases. Step 0 is to complete documentation of
what is already implemented. There is nothing deep here from a
statistical standpoint, just implementation and interface contract,
which need to be carefully documented. Adding to TODO list.
>
> --- Phil Steitz <ph...@steitz.com> wrote:
>
>>Mark R. Diggory wrote:
>>
>>>Phil Steitz wrote:
>>>
>>>
>>>>Mark,
>>>>
>>>>This is better than storing *all* of the values
>>>>(including out of window), but notice that the only
>>>>value that you actually use is
>>>>values[index]. That's all you really need to keep. This can be done
>>>>with just one lagged value.
>>>>
>>>>
>>>
>>>Again, if you think you can get it to work that way, great. You have to
>>>understand that in the next step its going to be index + 1, then index +
>>>2, ... From my perspective, you need all the values between the current
>>>and the end of the window, otherwise, you don't have the values for the
>>>next interation's end of window and so on. Again, maybe i've missed
>>>something, your solution is not obvious to me.
>>>
>>
>>Well, now that I have sat down to think it through carefully, it's not
>>obvious to me either. In fact, its obvious that no such solution can
>>exist! Sorry for the stupidity. That's what I get for thinking about
>>algorithms in 20-second time slices. Need to fix the scheduler...
>>
>>I think an implementation like what you presented would be OK in the
>>base class, working more or less as follows:
>>
>>* The windowSize defaults to "INFINITY" (represented in the standard way
>> as -1 :-).
>>* Setting the windowSize to a positive number allocates storage for a
>> windowSize-length array, resets stats and starts storing values
>> into the array. If a positive length window currently exists, it is
>> destroyed and replaced by the new one.
>>* clear() also clears the window if there is one
>>* Setting windowSize back to "infinity" clears stats and destroys window
>>
>>There are more details to consider, but the basic idea is to allow the
>>user to determine when, and how much data will be persisted.
>>
>>I see this an an intermediate solution between the "stateless" current
>>implementation and an implementation that persists *all* of the values.
>>It can happily coexist with the "stateless" version IMHO.
>>
>>As Tim pointed out above, there are some things that we simply cannot
>>compute without access to the full set of values. Therefore, I would
>>see a separate implementation with a much richer set of stats based on a
>>solution that maintains the full value array in memory (or in some kind
>>of external cache).
>>
>>Once again, sorry for the obstinance.
>>
>>Phil
>>
>>
>>>>Also, remember that you need to worry about
>>>>maintaining integrity of min and max -- special care
>>>>needs to be taken for the cases where the "rolling
>>>>off" value is the (single instance of) min or max.
>>>>
>>>>
>>>
>>>Good points, and nice catch. The limitation seems to me to be that the
>>>array will need to get searched in such a case for the "next" min or max
>>>value.
>>>
>>>-Mark
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>>>
>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>>
>
>
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Search - Faster. Easier. Bingo.
> http://search.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
+1
The largest downfall I encounter when trying to use Java packages that
have complex statistical algorithms within them is a lack of
documentation about the algorithm itself.
-Mark
robert burrell donkin wrote:
> On Thursday, May 15, 2003, at 05:34 AM, Mohan Kishore wrote:
>
>> Not that I know sqat about this level of mathematics, but I would
>> strongly
>> recommend detailed JavaDoc of the algorithm, OR, a link to
>> whitepaper/article
>> online which details the same. Failing which, I fear, the use/future
>> development/debugging will surely suffer...
>
>
> +1
>
> i think that this will need doing before math could be released and it's
> probably easier to start as we mean to go on. it'll be easier to insist
> that future developers document their code properly (which is important
> since it'll be very difficult for committers to verify or debug
> algorithms without a precise definition of what they are supposed to do)
> if we all do this from the start. it'll also help avoid any possible
> problems with international differences in technical terminology.
>
> i'd say that the interfaces are probably the right place to document the
> mathematical operation being offered whereas the implementation should
> detail the aims of the implementation (eg memory footprint verses speed,
> small data sets verses large data sets, simple verses complex).
>
> maybe we could come up with a standard format for the class comments.
>
> - robert
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by robert burrell donkin <ro...@blueyonder.co.uk>.
On Thursday, May 15, 2003, at 10:34 PM, Phil Steitz wrote:
> Phil Steitz wrote:
<snip>
>>> maybe we could come up with a standard format for the class comments.
>> I am struggling a bit on this. Here are some questions:
>> 1. I think it would be best to reference external documentation for all
>> definitions and standard computational formulas. This obviously
>> creates dependencies and the need to maintain links. Is this
>> OK? Here is a good reference for basic stat definitions:
>> http://rd11.web.cern.ch/RD11/rkb/titleA.html. If this is acceptable,
>> I will include references to the definitions there throughout. We
>> will need to find similar sources for documentation of numerical
>> algorithms.
that sounds pretty good.
in the long term, i worry a little about broken links. maybe (in the
medium term) we could make the javadocs link to the component
documentation hosted at jakarta and this contain link or links to these
source but this sounds like a good start.
>> 2. The math/stat packages that I like the most (S-Plus, SAS) maintain
>> Users' and Programmers' Guides with detailed information on
>> algorithms and implementation details. I am not sure that putting
>> all of the implementation documentation in the javadoc will be
>> convenient either for the developers or the users. I would suggest
>> the following principles:
>> * favor standard algorithms with external documentation and
>> document these using external references wherever possible
>> * document simple extensions/implementation notes in javadoc
+1
>> * maintain a separate user's guide including more extensive
>> implementation notes, performance analysis, etc.
the javadocs are are easy to maintain and version, and are distributed
with each binary release. unfortunately (at least in the open source world)
user guides tend to be written late and developers have a tendency not to
read user guides. many commons components has user guides that are
integrated as part of the javadocs (usually as package documentation).
maybe we could maintain a systematic versioned user guide which is linked
from the javadocs. this is probably something that should be raise again
once the website is up.
>> 3. In many (most?) cases it will be most convenient to define the
>> interface contracts and whatever implementation notes go in the
>> javadocs in the method comments rather than the class
>> comments
>> I will submit a simple example defining a chi-square test statistic
>> using these ideas in a message to follow.
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=19971
looks good.
but maybe we could add an additional 'usage' section for implementation
classes. for most classes, this would probably say something like 'general'
but might contain stuff like 'not recommended for memory-limited
applications'.
- robert
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by Phil Steitz <ph...@steitz.com>.
Phil Steitz wrote:
> robert burrell donkin wrote:
>
>> On Thursday, May 15, 2003, at 05:34 AM, Mohan Kishore wrote:
>>
>>> Not that I know sqat about this level of mathematics, but I would
>>> strongly
>>> recommend detailed JavaDoc of the algorithm, OR, a link to
>>> whitepaper/article
>>> online which details the same. Failing which, I fear, the use/future
>>> development/debugging will surely suffer...
>>
>>
>>
>> +1
>>
>> i think that this will need doing before math could be released and
>> it's probably easier to start as we mean to go on.
>
>
> Agreed.
>
>> it'll be easier to insist that future developers document their code
>> properly (which is important since it'll be very difficult for
>> committers to verify or debug algorithms without a precise definition
>> of what they are supposed to do) if we all do this from the start.
>> it'll also help avoid any possible problems with international
>> differences in technical terminology.
>>
>> i'd say that the interfaces are probably the right place to document
>> the mathematical operation being offered whereas the implementation
>> should detail the aims of the implementation (eg memory footprint
>> verses speed, small data sets verses large data sets, simple verses
>> complex).
>>
> I agree here too. This is another reason that we might want to just
> enforce a rule that all interfaces will be abstracted. Refactoring as
> we speak...
>
>> maybe we could come up with a standard format for the class comments.
>
>
> I am struggling a bit on this. Here are some questions:
>
> 1. I think it would be best to reference external documentation for all
> definitions and standard computational formulas. This obviously
> creates dependencies and the need to maintain links. Is this
> OK? Here is a good reference for basic stat definitions:
> http://rd11.web.cern.ch/RD11/rkb/titleA.html. If this is acceptable,
> I will include references to the definitions there throughout. We
> will need to find similar sources for documentation of numerical
> algorithms.
>
> 2. The math/stat packages that I like the most (S-Plus, SAS) maintain
> Users' and Programmers' Guides with detailed information on
> algorithms and implementation details. I am not sure that putting
> all of the implementation documentation in the javadoc will be
> convenient either for the developers or the users. I would suggest
> the following principles:
>
> * favor standard algorithms with external documentation and
> document these using external references wherever possible
> * document simple extensions/implementation notes in javadoc
> * maintain a separate user's guide including more extensive
> implementation notes, performance analysis, etc.
>
> 3. In many (most?) cases it will be most convenient to define the
> interface contracts and whatever implementation notes go in the
> javadocs in the method comments rather than the class
> comments
>
> I will submit a simple example defining a chi-square test statistic
> using these ideas in a message to follow.
>
http://issues.apache.org/bugzilla/show_bug.cgi?id=19971
> Phil
>
>>
>> - robert
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by Phil Steitz <ph...@steitz.com>.
robert burrell donkin wrote:
> On Thursday, May 15, 2003, at 05:34 AM, Mohan Kishore wrote:
>
>> Not that I know sqat about this level of mathematics, but I would
>> strongly
>> recommend detailed JavaDoc of the algorithm, OR, a link to
>> whitepaper/article
>> online which details the same. Failing which, I fear, the use/future
>> development/debugging will surely suffer...
>
>
> +1
>
> i think that this will need doing before math could be released and it's
> probably easier to start as we mean to go on.
Agreed.
> it'll be easier to insist
> that future developers document their code properly (which is important
> since it'll be very difficult for committers to verify or debug
> algorithms without a precise definition of what they are supposed to do)
> if we all do this from the start. it'll also help avoid any possible
> problems with international differences in technical terminology.
>
> i'd say that the interfaces are probably the right place to document the
> mathematical operation being offered whereas the implementation should
> detail the aims of the implementation (eg memory footprint verses speed,
> small data sets verses large data sets, simple verses complex).
>
I agree here too. This is another reason that we might want to just
enforce a rule that all interfaces will be abstracted. Refactoring as
we speak...
> maybe we could come up with a standard format for the class comments.
I am struggling a bit on this. Here are some questions:
1. I think it would be best to reference external documentation for all
definitions and standard computational formulas. This obviously
creates dependencies and the need to maintain links. Is this
OK? Here is a good reference for basic stat definitions:
http://rd11.web.cern.ch/RD11/rkb/titleA.html. If this is acceptable,
I will include references to the definitions there throughout. We
will need to find similar sources for documentation of numerical
algorithms.
2. The math/stat packages that I like the most (S-Plus, SAS) maintain
Users' and Programmers' Guides with detailed information on
algorithms and implementation details. I am not sure that putting
all of the implementation documentation in the javadoc will be
convenient either for the developers or the users. I would suggest
the following principles:
* favor standard algorithms with external documentation and
document these using external references wherever possible
* document simple extensions/implementation notes in javadoc
* maintain a separate user's guide including more extensive
implementation notes, performance analysis, etc.
3. In many (most?) cases it will be most convenient to define the
interface contracts and whatever implementation notes go in the
javadocs in the method comments rather than the class
comments
I will submit a simple example defining a chi-square test statistic
using these ideas in a message to follow.
Phil
>
> - robert
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by robert burrell donkin <ro...@blueyonder.co.uk>.
On Thursday, May 15, 2003, at 05:34 AM, Mohan Kishore wrote:
> Not that I know sqat about this level of mathematics, but I would strongly
> recommend detailed JavaDoc of the algorithm, OR, a link to
> whitepaper/article
> online which details the same. Failing which, I fear, the use/future
> development/debugging will surely suffer...
+1
i think that this will need doing before math could be released and it's
probably easier to start as we mean to go on. it'll be easier to insist
that future developers document their code properly (which is important
since it'll be very difficult for committers to verify or debug algorithms
without a precise definition of what they are supposed to do) if we all do
this from the start. it'll also help avoid any possible problems with
international differences in technical terminology.
i'd say that the interfaces are probably the right place to document the
mathematical operation being offered whereas the implementation should
detail the aims of the implementation (eg memory footprint verses speed,
small data sets verses large data sets, simple verses complex).
maybe we could come up with a standard format for the class comments.
- robert
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by Mohan Kishore <mo...@yahoo.com>.
Not that I know sqat about this level of mathematics, but I would strongly
recommend detailed JavaDoc of the algorithm, OR, a link to whitepaper/article
online which details the same. Failing which, I fear, the use/future
development/debugging will surely suffer...
--- Phil Steitz <ph...@steitz.com> wrote:
> Mark R. Diggory wrote:
> > Phil Steitz wrote:
> >
> >> Mark,
> >>
> >> This is better than storing *all* of the values
> >> (including out of window), but notice that the only
> >> value that you actually use is
> >> values[index]. That's all you really need to keep. This can be done
> >> with just one lagged value.
> >>
> >>
> > Again, if you think you can get it to work that way, great. You have to
> > understand that in the next step its going to be index + 1, then index +
> > 2, ... From my perspective, you need all the values between the current
> > and the end of the window, otherwise, you don't have the values for the
> > next interation's end of window and so on. Again, maybe i've missed
> > something, your solution is not obvious to me.
> >
> Well, now that I have sat down to think it through carefully, it's not
> obvious to me either. In fact, its obvious that no such solution can
> exist! Sorry for the stupidity. That's what I get for thinking about
> algorithms in 20-second time slices. Need to fix the scheduler...
>
> I think an implementation like what you presented would be OK in the
> base class, working more or less as follows:
>
> * The windowSize defaults to "INFINITY" (represented in the standard way
> as -1 :-).
> * Setting the windowSize to a positive number allocates storage for a
> windowSize-length array, resets stats and starts storing values
> into the array. If a positive length window currently exists, it is
> destroyed and replaced by the new one.
> * clear() also clears the window if there is one
> * Setting windowSize back to "infinity" clears stats and destroys window
>
> There are more details to consider, but the basic idea is to allow the
> user to determine when, and how much data will be persisted.
>
> I see this an an intermediate solution between the "stateless" current
> implementation and an implementation that persists *all* of the values.
> It can happily coexist with the "stateless" version IMHO.
>
> As Tim pointed out above, there are some things that we simply cannot
> compute without access to the full set of values. Therefore, I would
> see a separate implementation with a much richer set of stats based on a
> solution that maintains the full value array in memory (or in some kind
> of external cache).
>
> Once again, sorry for the obstinance.
>
> Phil
>
> >> Also, remember that you need to worry about
> >> maintaining integrity of min and max -- special care
> >> needs to be taken for the cases where the "rolling
> >> off" value is the (single instance of) min or max.
> >>
> >>
> >
> > Good points, and nice catch. The limitation seems to me to be that the
> > array will need to get searched in such a case for the "next" min or max
> > value.
> >
> > -Mark
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: commons-dev-help@jakarta.apache.org
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by Phil Steitz <ph...@steitz.com>.
Mark R. Diggory wrote:
> Phil Steitz wrote:
>
>> Mark,
>>
>> This is better than storing *all* of the values
>> (including out of window), but notice that the only
>> value that you actually use is
>> values[index]. That's all you really need to keep. This can be done
>> with just one lagged value.
>>
>>
> Again, if you think you can get it to work that way, great. You have to
> understand that in the next step its going to be index + 1, then index +
> 2, ... From my perspective, you need all the values between the current
> and the end of the window, otherwise, you don't have the values for the
> next interation's end of window and so on. Again, maybe i've missed
> something, your solution is not obvious to me.
>
Well, now that I have sat down to think it through carefully, it's not
obvious to me either. In fact, its obvious that no such solution can
exist! Sorry for the stupidity. That's what I get for thinking about
algorithms in 20-second time slices. Need to fix the scheduler...
I think an implementation like what you presented would be OK in the
base class, working more or less as follows:
* The windowSize defaults to "INFINITY" (represented in the standard way
as -1 :-).
* Setting the windowSize to a positive number allocates storage for a
windowSize-length array, resets stats and starts storing values
into the array. If a positive length window currently exists, it is
destroyed and replaced by the new one.
* clear() also clears the window if there is one
* Setting windowSize back to "infinity" clears stats and destroys window
There are more details to consider, but the basic idea is to allow the
user to determine when, and how much data will be persisted.
I see this an an intermediate solution between the "stateless" current
implementation and an implementation that persists *all* of the values.
It can happily coexist with the "stateless" version IMHO.
As Tim pointed out above, there are some things that we simply cannot
compute without access to the full set of values. Therefore, I would
see a separate implementation with a much richer set of stats based on a
solution that maintains the full value array in memory (or in some kind
of external cache).
Once again, sorry for the obstinance.
Phil
>> Also, remember that you need to worry about
>> maintaining integrity of min and max -- special care
>> needs to be taken for the cases where the "rolling
>> off" value is the (single instance of) min or max.
>>
>>
>
> Good points, and nice catch. The limitation seems to me to be that the
> array will need to get searched in such a case for the "next" min or max
> value.
>
> -Mark
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Phil Steitz wrote:
>Mark,
>
>This is better than storing *all* of the values
>(including out of window), but notice that the only
>value that you actually use is
>values[index]. That's all you really need to keep.
>This can be done with just one lagged value.
>
>
Again, if you think you can get it to work that way, great. You have to
understand that in the next step its going to be index + 1, then index +
2, ... From my perspective, you need all the values between the current
and the end of the window, otherwise, you don't have the values for the
next interation's end of window and so on. Again, maybe i've missed
something, your solution is not obvious to me.
>Also, remember that you need to worry about
>maintaining integrity of min and max -- special care
>needs to be taken for the cases where the "rolling
>off" value is the (single instance of) min or max.
>
>
Good points, and nice catch. The limitation seems to me to be that the
array will need to get searched in such a case for the "next" min or max
value.
-Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [math][patch] Lets try this again, here's a new patch for Rolling
Posted by "Mark R. Diggory" <md...@latte.harvard.edu>.
Phil Steitz wrote:
> Mark,
> Also, remember that you need to worry about
> maintaining integrity of min and max -- special care
> needs to be taken for the cases where the "rolling
> off" value is the (single instance of) min or max.
>
> Phil
>
I have a patch for this as well. I'll roll together another full patch
and submit it again to you for addition.
-Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org