You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Gilles (JIRA)" <ji...@apache.org> on 2018/01/23 12:32:00 UTC
[jira] [Commented] (MATH-1441)
SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on
every call
[ https://issues.apache.org/jira/browse/MATH-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335710#comment-16335710 ]
Gilles commented on MATH-1441:
------------------------------
{quote}What are your thoughts?
{quote}
If it's a common use-case, it is certainly worth looking at how caching should be implemented without jeopardizing robustness.
As it happens, we are creating new components out of "Commons Math"; work on ["Commons Statistics"|https://git1-us-west.apache.org/repos/asf?p=commons-statistics.git] has just started.
Please file another report on the associated [issue-tracking system|https://issues.apache.org/jira/projects/STATISTICS/issues] with a link to this one.
You are most welcome to join in the discussions on the "dev" ML.
The move to a new component is an excellent opportunity to try and improve the design.
> SimpleRegression#getSlopeConfidenceInterval recalculates t distribution on every call
> -------------------------------------------------------------------------------------
>
> Key: MATH-1441
> URL: https://issues.apache.org/jira/browse/MATH-1441
> Project: Commons Math
> Issue Type: Improvement
> Affects Versions: 3.6.1
> Environment: Java 8, Linux x64.
> Reporter: Max Aller
> Priority: Minor
> Labels: performance
>
> SimpleRegression#getSlopeConfidenceInterval, when called a lot (on the other of 100k or 1M times), is surprisingly slow - 3M calls, on my 3rd gen i7 machine, takes *75 seconds*. This is primarily because recalculating the inverse cumulative probability, and reconstructing the TDistribution object itself, is somewhat expensive, entailing a lot of `log` and `sqrt` and iteration and all that stuff.
> The confidence interval is based on two values - *n* and *alpha*. I'd posit that alpha will _usually_ be one of a very small set of values, and n, well, at least in my case, I'm always calling this method with the same number of data points - a moving window of time-series data. But I recognize n might be all over the place for some users.
> In any event, I strongly believe some level of caching would greatly benefit the users of Commons Math. I've forked my own version of getSlopeConfidenceInterval that uses caching. Here's a test case demonstrating that:
> {code:java}
> class SlowRegressionsTest {
> @Test
> void slow() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> simpleRegression.getSlopeConfidenceInterval();
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`slow` took %dms%n", duration);
> }
> @Test
> void fast() {
> SimpleRegression simpleRegression = new SimpleRegression();
> simpleRegression.addData(0.0, 0.0);
> simpleRegression.addData(1.0, 1.0);
> simpleRegression.addData(2.0, 2.0);
> long start = System.currentTimeMillis();
> for (int i = 0; i < 3_000_000; i++) {
> SimpleRegressionUtilsKt.fastGetSlopeConfidenceInterval(simpleRegression);
> }
> long duration = System.currentTimeMillis() - start;
> System.out.printf("`fast` took %dms%n", duration);
> }
> }{code}
> which prints out
> {noformat}
> `fast` took 159ms
> `slow` took 75304ms{noformat}
> Nearly 500x faster - 53ns/call. My particular solution is written in Kotlin for Java 8, so not of direct relevance to you, but here it is:
> {code:java}
> package math
> import org.apache.commons.math3.distribution.TDistribution
> import org.apache.commons.math3.exception.OutOfRangeException
> import org.apache.commons.math3.exception.util.LocalizedFormats
> import org.apache.commons.math3.stat.regression.SimpleRegression
> import java.util.concurrent.ConcurrentHashMap
> @JvmOverloads
> fun SimpleRegression.fastGetSlopeConfidenceInterval(alpha: Double = 0.05): Double {
> if (n < 3) {
> return Double.NaN
> }
> if (alpha >= 1 || alpha <= 0) {
> throw OutOfRangeException(
> LocalizedFormats.SIGNIFICANCE_LEVEL,
> alpha, 0, 1
> )
> }
> // No advertised NotStrictlyPositiveException here - will return NaN above
> // PATCH: use cached inverse cumulative probability
> return slopeStdErr * getInverseCumulativeProbability(n, alpha)
> }
> private val cache = ConcurrentHashMap<Key, Double>()
> private data class Key(val n: Long, val alpha: Double)
> private fun getInverseCumulativeProbability(n: Long, alpha: Double): Double =
> cache.getOrPut(Key(n, alpha)) {
> TDistribution((n - 2).toDouble()).inverseCumulativeProbability(1.0 - alpha / 2.0)
> }
> {code}
> Limitations: 1. Kotlin, 2. ConcurrentHashMap is unbounded here.
> I don't know how/if Commons Math does caching elsewhere, but it'd sure be handy here, I believe. What are your thoughts?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)