You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Ben Nguyen <be...@gmail.com> on 2019/07/23 18:31:22 UTC

[GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Hello,

Issue was noticed as a PMD violation during PR’s Travis CI build.
“The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, NOPA=0, NOAM=4, WMC=6)”
https://github.com/apache/commons-statistics/pull/23

UML for current design:
https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png

The current design is my (probably wrong) interpretation on a suggestion from Eric Barnhill. 
The idea is that all calculation functionality is done in the OLSRegression class, which passes itself to the OLSResults constructor via regress() method.
The OLSResults (implementing RegressionResults interface) calls on statistics calculating methods from OLSRegression and saves it as private fields with getters.
I am thinking this is good for users who want everything to be calculated only once to then be used multiple times (without the recalculation or having to save all statistics themselves).
The RegressionResults interface could also include some kind of output method to get all statistics in a file or displayed?
Then, for those who only want a single calculated statistic, the methods to calculate in OLSRegression are also public.

This design does cause that PMD error, so any suggestions or alternate designs to correct my interpretation?

Thank you,
Cheers,
-Ben Nguyen

Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Posted by Gilles Sadowski <gi...@gmail.com>.

Hi.

Le mer. 24 juil. 2019 à 18:00, Ben Nguyen <be...@gmail.com> a écrit :
>
> Hello,
> Thank you for your feedback, I will answer your questions:
>
> > Why is the constructor "protected"?
>
> I had thought it should be restricted to only work with ols.regress() but then it should’ve be default.
> But public might be the way to go, that restriction is probably not necessary….?

"protected" means there is a case for defining a subclass.

If the class holds data that is only instantiated by our library,
there may be a case for preventing "external" instantiation (i.e.
make the constructor private or package-private).

>
> > Why is the computation performed in that constructor although the data is
> > stored in "OLSRegression"? [This looks like what PMD complains about.]
>
> Sorry, I don’t fully understand this question….
> From what I understand after searching about the Data Class violation:
> Quote: “[OLSResults ]interface reveals more data than behaviour. Low values (less than 30%) [OLSResults showed 20%] indicate that the class reveals much more data than behaviour, which is a sign of poor encapsulation.”
> Quote under “Weight Of Class (WOC)”
> https://pmd.github.io/latest/pmd_java_metrics_index.html

The point is not to make the report clean but to come up
with a satisfactory design.
It would be helpful to start from use-cases.

> > It's good to try and have encapsulation preserved through accessors but in
> > that case it's broken because some return a reference to an array (entries
> > are mutable).
>
> I will fix that with a deepCopy() method, though I’m not sure where to put it, in same class?
> Could also do one-liner like this for 2D array:
>     public double[][] getBetaVariance() {
>         return Arrays.stream(betaVariance).map(el -> el.clone()).toArray($ -> hatMatrix.clone());
>     }
> But I think .clone() is okay for primitive type 1D arrays? If not, System.arraycopy() should be okay?

There is also "copyOf" in JDK class "Arrays".

> > Implementing an interface "RegressionResults" suggest that application
> > should program against it, and not the specific "OLSResult"; however the
> > latter has more methods than the interface defines.
>
> Yeah, I think this is a conflict with:
>
> > RegressionResults res = reg.regress(data);
>
> Since not all regression types calculate the same statistics, ex: OLS has R-squared, TSS, RSS
> While GLS doesn’t officially have that (unless you count pseudo-solutions out there)
> But of course, all regression types have many similar statistics like beta, beta variance, residuals, etc
> So I’m not sure how to structure an interface like RegressionResults to account for this,
> I will look carefully into the factory method design pattern and understand it properly to solve this,
> I think I might’ve missed things that would address this….? I will now definitely refactor the current PR 23 into
> a proper factory method design as well with consideration for inner-interfaces you mentioned.

I'd think it could help with the PMD report, but it will not solve
the issue of the "incomplete" interface.

> > What is the most useful functionality?
> > What is the most robust functionality?
> > IOW:  Is there a case for not computing everything?
>
> My consideration (though I don’t have professional experience in Data science/engineering) is maybe someone needs
> to calculate one statistic from many new samples (constantly coming in) at once….

Streams?

Isn't the current "stored" implementation at odds with that use-case?

> This case might not exist, but it’s what I thought of,
> and considering for cases I haven’t thought of as well. The old code allowed for this so I didn’t want to take that away….
>
> > There are many redundant computations...
>
> Yes, PR 23 right now is kind of v1 to show everything works.

Kind of version 0.1-alpha, you mean? :-)

> I was focused on getting the port to work 100% with EJML first with old code base,

EJML is another issue with the current version: "StatisticsMatrix" inherits
from an EJML, so there is a direct (public API) dependence (that should
be avoided).

> I will look at optimization between methods (and factory methods) more deeply now (another optimization found with decomposing QR twice as well).

Good but a good design usually avoids redundancy.

> > > The RegressionResults interface could also include some kind of output method to get all statistics in a file or displayed?
> > Why should it be done within the class if it provides accessors to all
> > its contents?
>
> This issue asked for an ANOVA table printout: https://issues.apache.org/jira/browse/MATH-1097
> So I was considering that (even though calculating F-stat and p-value would be another challenge),
> But this lead me to consider a summary statistics printout like one found in R, for user convenience…
> Are you saying this is not necessary for this application?

No, I'm saying that we should first focus on the main functionality.

Regards,
Gilles

>
> Again, thank you for your feedback,
> -Ben Nguyen
>
>
>
> From: Gilles Sadowski
> Sent: Tuesday, July 23, 2019 6:45 PM
> To: Commons Developers List
> Subject: Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design
>
> Hello.
>
> [Please note that the first tag (in square bracket) on the "Subject" line should
> be the (short) name of the component, i.e. in this case, "[Statistics]".]
>
> Le mar. 23 juil. 2019 à 20:31, Ben Nguyen <be...@gmail.com> a écrit :
> >
> > Hello,
> >
> > Issue was noticed as a PMD violation during PR’s Travis CI build.
> > “The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, NOPA=0, NOAM=4, WMC=6)”
> > https://github.com/apache/commons-statistics/pull/23
> >
> > UML for current design:
> > https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png
> >
> > The current design is my (probably wrong) interpretation on a suggestion from Eric Barnhill.
> > The idea is that all calculation functionality is done in the OLSRegression class, which passes itself to the OLSResults constructor via regress() method.
>
> Better directly show the relevant code:
>
> public class OLSRegression extends AbstractRegression {
>     // ...
>
>     public RegressionResults regress() {
>         return new OLSResults(this);
>     }
> }
>
> public class OLSResults implements RegressionResults, Serializable {
>     protected OLSResults(OLSRegression ols) {
>         this.beta = ols.estimateBeta();
>         this.betaVariance = ols.estimateBetaVariance();
>         this.betaStandardErrors = ols.estimateBetaStandardErrors();
>         this.residuals = ols.estimateResiduals();
>         this.regressionStandardErrors = ols.estimateRegressionStandardError();
>         this.regressandVariance = ols.estimateRegressandVariance();
>
>         this.hatMatrix = ols.calculateHat().toArray2D();
>         this.totalSumOfSquares = ols.calculateTotalSumOfSquares();
>         this.residualSumOfSquares = ols.calculateResidualSumOfSquares();
>         this.regRSquared = ols.calculateRSquared();
>         this.adjustedRSquared = ols.calculateAdjustedRSquared();
>     }
> }
>
> Why is the constructor "protected"?
> Why is the computation performed in that constructor although the data is
> stored in "OLSRegression"? [This looks like what PMD complains about.]
>
> > The OLSResults (implementing RegressionResults interface) calls on statistics calculating methods from OLSRegression and saves it as private fields with getters.
>
> It's good to try and have encapsulation preserved through accessors but in
> that case it's broken because some return a reference to an array (entries
> are mutable).
>
> Implementing an interface "RegressionResults" suggest that application
> should program against it, and not the specific "OLSResult"; however the
> latter has more methods than the interface defines.
>
> > I am thinking this is good for users who want everything to be calculated only once to then be used multiple times (without the recalculation or having to save all statistics themselves).
>
> What is the most useful functionality?
> What is the most robust functionality?
> IOW:  Is there a case for not computing everything?
>
> These are two of the methods called in the constructor above:
>
>     public double calculateRSquared() {
>         return 1 - calculateResidualSumOfSquares() /
> calculateTotalSumOfSquares();
>     }
>
>     public double calculateAdjustedRSquared() {
>         final double n = getX().numRows();
>         if (getHasIntercept()) {
>             return 1 - (1 - calculateRSquared()) * (n / (n - getX().numCols()));
>         } else {
>             return 1 -
>                    (calculateResidualSumOfSquares() * (n - 1)) /
>                        (calculateTotalSumOfSquares() * (n - getX().numCols()));
>         }
>     }
>
> There are many redundant computations...
>
> > The RegressionResults interface could also include some kind of output method to get all statistics in a file or displayed?
>
> Why should it be done within the class if it provides accessors to all
> its contents?
>
> > Then, for those who only want a single calculated statistic, the methods to calculate in OLSRegression are also public.
>
> Then even assuming that you fix somehow "OLSResults" to compute
> each result only once, a user who directly calls the above two methods
> will get subpar performance.
>
> > This design does cause that PMD error, so any suggestions or alternate designs to correct my interpretation?
>
> There are definite problems with your implementation that need fixing;
> but it's not necessarily proof of a bad design.
> However it's clearly not the most obvious one, which IMO would go along
> the lines
>
> RegressionData data = RegressionDataLoader.from(x, y, hasIntercept);
> Regression reg = new OLSRegression();
> RegressionResults res = reg.regress(data);
>
> This is all untested of course, and could well fall short for advanced
> usage (stream?).  Your design (or yet another) could be an attempt
> to handle particular use-cases which the obvious approach does not
> cover (?) ...
>
> Regards,
> Gilles
>
> >
> > Thank you,
> > Cheers,
> > -Ben Nguyen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

RE: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Posted by Ben Nguyen <be...@gmail.com>.

Hello,
Thank you for your feedback, I will answer your questions:

> Why is the constructor "protected"?

I had thought it should be restricted to only work with ols.regress() but then it should’ve be default.
But public might be the way to go, that restriction is probably not necessary….?

> Why is the computation performed in that constructor although the data is
> stored in "OLSRegression"? [This looks like what PMD complains about.]

Sorry, I don’t fully understand this question…. 
From what I understand after searching about the Data Class violation:
Quote: “[OLSResults ]interface reveals more data than behaviour. Low values (less than 30%) [OLSResults showed 20%] indicate that the class reveals much more data than behaviour, which is a sign of poor encapsulation.”
Quote under “Weight Of Class (WOC)”
https://pmd.github.io/latest/pmd_java_metrics_index.html


> It's good to try and have encapsulation preserved through accessors but in
> that case it's broken because some return a reference to an array (entries
> are mutable).

I will fix that with a deepCopy() method, though I’m not sure where to put it, in same class?
Could also do one-liner like this for 2D array:
    public double[][] getBetaVariance() {
        return Arrays.stream(betaVariance).map(el -> el.clone()).toArray($ -> hatMatrix.clone());
    }
But I think .clone() is okay for primitive type 1D arrays? If not, System.arraycopy() should be okay?

> Implementing an interface "RegressionResults" suggest that application
> should program against it, and not the specific "OLSResult"; however the
> latter has more methods than the interface defines.

Yeah, I think this is a conflict with:

> RegressionResults res = reg.regress(data);

Since not all regression types calculate the same statistics, ex: OLS has R-squared, TSS, RSS
While GLS doesn’t officially have that (unless you count pseudo-solutions out there)
But of course, all regression types have many similar statistics like beta, beta variance, residuals, etc
So I’m not sure how to structure an interface like RegressionResults to account for this,
I will look carefully into the factory method design pattern and understand it properly to solve this, 
I think I might’ve missed things that would address this….? I will now definitely refactor the current PR 23 into 
a proper factory method design as well with consideration for inner-interfaces you mentioned.

> What is the most useful functionality?
> What is the most robust functionality?
> IOW:  Is there a case for not computing everything?

My consideration (though I don’t have professional experience in Data science/engineering) is maybe someone needs 
to calculate one statistic from many new samples (constantly coming in) at once…. This case might not exist, but it’s what I thought of,
and considering for cases I haven’t thought of as well. The old code allowed for this so I didn’t want to take that away….

> There are many redundant computations...

Yes, PR 23 right now is kind of v1 to show everything works. I was focused on getting the port to work 100% with EJML first with old code base, 
I will look at optimization between methods (and factory methods) more deeply now (another optimization found with decomposing QR twice as well). 

> > The RegressionResults interface could also include some kind of output method to get all statistics in a file or displayed?
> Why should it be done within the class if it provides accessors to all
> its contents?

This issue asked for an ANOVA table printout: https://issues.apache.org/jira/browse/MATH-1097
So I was considering that (even though calculating F-stat and p-value would be another challenge),
But this lead me to consider a summary statistics printout like one found in R, for user convenience…
Are you saying this is not necessary for this application?

Again, thank you for your feedback,
-Ben Nguyen



From: Gilles Sadowski
Sent: Tuesday, July 23, 2019 6:45 PM
To: Commons Developers List
Subject: Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Hello.

[Please note that the first tag (in square bracket) on the "Subject" line should
be the (short) name of the component, i.e. in this case, "[Statistics]".]

Le mar. 23 juil. 2019 à 20:31, Ben Nguyen <be...@gmail.com> a écrit :
>
> Hello,
>
> Issue was noticed as a PMD violation during PR’s Travis CI build.
> “The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, NOPA=0, NOAM=4, WMC=6)”
> https://github.com/apache/commons-statistics/pull/23
>
> UML for current design:
> https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png
>
> The current design is my (probably wrong) interpretation on a suggestion from Eric Barnhill.
> The idea is that all calculation functionality is done in the OLSRegression class, which passes itself to the OLSResults constructor via regress() method.

Better directly show the relevant code:

public class OLSRegression extends AbstractRegression {
    // ...

    public RegressionResults regress() {
        return new OLSResults(this);
    }
}

public class OLSResults implements RegressionResults, Serializable {
    protected OLSResults(OLSRegression ols) {
        this.beta = ols.estimateBeta();
        this.betaVariance = ols.estimateBetaVariance();
        this.betaStandardErrors = ols.estimateBetaStandardErrors();
        this.residuals = ols.estimateResiduals();
        this.regressionStandardErrors = ols.estimateRegressionStandardError();
        this.regressandVariance = ols.estimateRegressandVariance();

        this.hatMatrix = ols.calculateHat().toArray2D();
        this.totalSumOfSquares = ols.calculateTotalSumOfSquares();
        this.residualSumOfSquares = ols.calculateResidualSumOfSquares();
        this.regRSquared = ols.calculateRSquared();
        this.adjustedRSquared = ols.calculateAdjustedRSquared();
    }
}

Why is the constructor "protected"?
Why is the computation performed in that constructor although the data is
stored in "OLSRegression"? [This looks like what PMD complains about.]

> The OLSResults (implementing RegressionResults interface) calls on statistics calculating methods from OLSRegression and saves it as private fields with getters.

It's good to try and have encapsulation preserved through accessors but in
that case it's broken because some return a reference to an array (entries
are mutable).

Implementing an interface "RegressionResults" suggest that application
should program against it, and not the specific "OLSResult"; however the
latter has more methods than the interface defines.

> I am thinking this is good for users who want everything to be calculated only once to then be used multiple times (without the recalculation or having to save all statistics themselves).

What is the most useful functionality?
What is the most robust functionality?
IOW:  Is there a case for not computing everything?

These are two of the methods called in the constructor above:

    public double calculateRSquared() {
        return 1 - calculateResidualSumOfSquares() /
calculateTotalSumOfSquares();
    }

    public double calculateAdjustedRSquared() {
        final double n = getX().numRows();
        if (getHasIntercept()) {
            return 1 - (1 - calculateRSquared()) * (n / (n - getX().numCols()));
        } else {
            return 1 -
                   (calculateResidualSumOfSquares() * (n - 1)) /
                       (calculateTotalSumOfSquares() * (n - getX().numCols()));
        }
    }

There are many redundant computations...

> The RegressionResults interface could also include some kind of output method to get all statistics in a file or displayed?

Why should it be done within the class if it provides accessors to all
its contents?

> Then, for those who only want a single calculated statistic, the methods to calculate in OLSRegression are also public.

Then even assuming that you fix somehow "OLSResults" to compute
each result only once, a user who directly calls the above two methods
will get subpar performance.

> This design does cause that PMD error, so any suggestions or alternate designs to correct my interpretation?

There are definite problems with your implementation that need fixing;
but it's not necessarily proof of a bad design.
However it's clearly not the most obvious one, which IMO would go along
the lines

RegressionData data = RegressionDataLoader.from(x, y, hasIntercept);
Regression reg = new OLSRegression();
RegressionResults res = reg.regress(data);

This is all untested of course, and could well fall short for advanced
usage (stream?).  Your design (or yet another) could be an attempt
to handle particular use-cases which the obvious approach does not
cover (?) ...

Regards,
Gilles

>
> Thank you,
> Cheers,
> -Ben Nguyen
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [GSoC][Regression][OLS] RegressionResults/OLSResults Re-design

Posted by Gilles Sadowski <gi...@gmail.com>.

Hello.

[Please note that the first tag (in square bracket) on the "Subject" line should
be the (short) name of the component, i.e. in this case, "[Statistics]".]

Le mar. 23 juil. 2019 à 20:31, Ben Nguyen <be...@gmail.com> a écrit :
>
> Hello,
>
> Issue was noticed as a PMD violation during PR’s Travis CI build.
> “The class 'OLSResults' is suspected to be a Data Class (WOC=20.000%, NOPA=0, NOAM=4, WMC=6)”
> https://github.com/apache/commons-statistics/pull/23
>
> UML for current design:
> https://github.com/BBenNguyenn/commons-statistics/blob/gsoc-milestone-1/commons-statistics-regression/UML_current.png
>
> The current design is my (probably wrong) interpretation on a suggestion from Eric Barnhill.
> The idea is that all calculation functionality is done in the OLSRegression class, which passes itself to the OLSResults constructor via regress() method.

Better directly show the relevant code:

public class OLSRegression extends AbstractRegression {
    // ...

    public RegressionResults regress() {
        return new OLSResults(this);
    }
}

public class OLSResults implements RegressionResults, Serializable {
    protected OLSResults(OLSRegression ols) {
        this.beta = ols.estimateBeta();
        this.betaVariance = ols.estimateBetaVariance();
        this.betaStandardErrors = ols.estimateBetaStandardErrors();
        this.residuals = ols.estimateResiduals();
        this.regressionStandardErrors = ols.estimateRegressionStandardError();
        this.regressandVariance = ols.estimateRegressandVariance();

        this.hatMatrix = ols.calculateHat().toArray2D();
        this.totalSumOfSquares = ols.calculateTotalSumOfSquares();
        this.residualSumOfSquares = ols.calculateResidualSumOfSquares();
        this.regRSquared = ols.calculateRSquared();
        this.adjustedRSquared = ols.calculateAdjustedRSquared();
    }
}

Why is the constructor "protected"?
Why is the computation performed in that constructor although the data is
stored in "OLSRegression"? [This looks like what PMD complains about.]

> The OLSResults (implementing RegressionResults interface) calls on statistics calculating methods from OLSRegression and saves it as private fields with getters.

It's good to try and have encapsulation preserved through accessors but in
that case it's broken because some return a reference to an array (entries
are mutable).

Implementing an interface "RegressionResults" suggest that application
should program against it, and not the specific "OLSResult"; however the
latter has more methods than the interface defines.

> I am thinking this is good for users who want everything to be calculated only once to then be used multiple times (without the recalculation or having to save all statistics themselves).

What is the most useful functionality?
What is the most robust functionality?
IOW:  Is there a case for not computing everything?

These are two of the methods called in the constructor above:

    public double calculateRSquared() {
        return 1 - calculateResidualSumOfSquares() /
calculateTotalSumOfSquares();
    }

    public double calculateAdjustedRSquared() {
        final double n = getX().numRows();
        if (getHasIntercept()) {
            return 1 - (1 - calculateRSquared()) * (n / (n - getX().numCols()));
        } else {
            return 1 -
                   (calculateResidualSumOfSquares() * (n - 1)) /
                       (calculateTotalSumOfSquares() * (n - getX().numCols()));
        }
    }

There are many redundant computations...

> The RegressionResults interface could also include some kind of output method to get all statistics in a file or displayed?

Why should it be done within the class if it provides accessors to all
its contents?

> Then, for those who only want a single calculated statistic, the methods to calculate in OLSRegression are also public.

Then even assuming that you fix somehow "OLSResults" to compute
each result only once, a user who directly calls the above two methods
will get subpar performance.

> This design does cause that PMD error, so any suggestions or alternate designs to correct my interpretation?

There are definite problems with your implementation that need fixing;
but it's not necessarily proof of a bad design.
However it's clearly not the most obvious one, which IMO would go along
the lines

RegressionData data = RegressionDataLoader.from(x, y, hasIntercept);
Regression reg = new OLSRegression();
RegressionResults res = reg.regress(data);

This is all untested of course, and could well fall short for advanced
usage (stream?).  Your design (or yet another) could be an attempt
to handle particular use-cases which the obvious approach does not
cover (?) ...

Regards,
Gilles

>
> Thank you,
> Cheers,
> -Ben Nguyen
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org