You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Eric Barnhill <er...@gmail.com> on 2019/10/22 19:49:41 UTC

[statistics-regression] Proposed Regression class/method structure

I propose the following class structure for commons-statistics-regression.

The interface carried over from commons-math is more of an academic
approach to thinking about regression. For rebooting the library (and I
hinted at this when I wrote the tickets for summer of code) I was hoping to
emulate widespread tools like R and scikit-learn, and consider that
"machine learning" is an increasingly popular use of regression. This
proposed structure creates an interface that is not the same as, but will
be very friendly to, anyone coming from R or scikit-learn, or similar tools
in JavaScript.

There are of course many ways I can see to elaborate this scheme, say using
RegressionResult objects and so forth. But Matrices paired with a double[],
returning a double[] of coefficients or predictions, are likely to be the
most common use cases and should be plenty to get started.

Under the hood I would use the available implementations in commons-math to
get up and running, and worry about improving them later.

Feedback appreciated,
Eric

[image: image.png]

Re: [statistics-regression] Proposed Regression class/method structure

Posted by Eric Barnhill <er...@gmail.com>.
Here is a link to the picture

https://imgur.com/a/9jjoOGB

On Tue, Oct 22, 2019 at 4:13 PM Gilles Sadowski <gi...@gmail.com>
wrote:

> Hello.
>
> Le mar. 22 oct. 2019 à 21:50, Eric Barnhill <er...@gmail.com> a
> écrit :
> >
> > I propose the following class structure for
> commons-statistics-regression.
>
> Which?
> [Attachment was probably stripped: such should go to a JIRA report.]
>
> > The interface carried over from commons-math is more of an academic
> approach to thinking about regression. For rebooting the library (and I
> hinted at this when I wrote the tickets for summer of code) I was hoping to
> emulate widespread tools like R and scikit-learn, and consider that
> "machine learning" is an increasingly popular use of regression. This
> proposed structure creates an interface that is not the same as, but will
> be very friendly to, anyone coming from R or scikit-learn, or similar tools
> in JavaScript.
> >
> > There are of course many ways I can see to elaborate this scheme, say
> using RegressionResult objects and so forth. But Matrices paired with a
> double[], returning a double[] of coefficients or predictions, are likely
> to be the most common use cases and should be plenty to get started.
>
> Commenting perhaps too early (not seeing the proposed design), but we
> broadly
> discussed that the linear algebra API is not easy to get right, and once
> we "get
> started", the trend is to be stuck with it for ages (related issues
> are among the
> oldest unresolved ones in CM).
>
> > Under the hood I would use the available implementations in commons-math
> to get up and running, and worry about improving them later.
>
> Do you mean port from, or depend on, CM?
>
> Regards,
> Gilles
>
> >
> > Feedback appreciated,
> > Eric
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

Re: [statistics-regression] Proposed Regression class/method structure

Posted by Alex Herbert <al...@gmail.com>.
On 23/10/2019 00:13, Gilles Sadowski wrote:
> Hello.
>
> Le mar. 22 oct. 2019 à 21:50, Eric Barnhill <er...@gmail.com> a écrit :
>> I propose the following class structure for commons-statistics-regression.
> Which?
> [Attachment was probably stripped: such should go to a JIRA report.]

Quick first thoughts on the method names:

LinearRegression::RSquared

LogisticRegression::predictionProbs


Are these computing methods or property getters? I assume that all the 
computation is done in the methods:

Regression::fit

Regression::predict(double[])


Thus the methods in the implementation classes access additional results 
specific to the the method. So should be:

LinearRegression::getRSquared

LogisticRegression::getPredictionProbabilities(double[])


>> The interface carried over from commons-math is more of an academic approach to thinking about regression. For rebooting the library (and I hinted at this when I wrote the tickets for summer of code) I was hoping to emulate widespread tools like R and scikit-learn, and consider that "machine learning" is an increasingly popular use of regression. This proposed structure creates an interface that is not the same as, but will be very friendly to, anyone coming from R or scikit-learn, or similar tools in JavaScript.
>>
>> There are of course many ways I can see to elaborate this scheme, say using RegressionResult objects and so forth. But Matrices paired with a double[], returning a double[] of coefficients or predictions, are likely to be the most common use cases and should be plenty to get started.
> Commenting perhaps too early (not seeing the proposed design), but we broadly
> discussed that the linear algebra API is not easy to get right, and once we "get
> started", the trend is to be stuck with it for ages (related issues
> are among the
> oldest unresolved ones in CM).
>
>> Under the hood I would use the available implementations in commons-math to get up and running, and worry about improving them later.
> Do you mean port from, or depend on, CM?

I assume that the Matrix object in the API is a new interface for 
commons-statistics. Thus allowing the underlying implementation to be 
pluggable. The initial version could included a shaded library to use 
whatever is appropriate.

Alex


>
> Regards,
> Gilles
>
>> Feedback appreciated,
>> Eric
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [statistics-regression] Proposed Regression class/method structure

Posted by Gilles Sadowski <gi...@gmail.com>.
Hi.

2019-10-28 22:12 UTC+01:00, Alex Herbert <al...@gmail.com>:
>
>
>> On 28 Oct 2019, at 17:55, Eric Barnhill <er...@gmail.com> wrote:
>>
>> Here is a schematic for how the interface might be made more abstract.
>>
>> https://imgur.com/a/izx5Xkh <https://imgur.com/a/izx5Xkh>
>
> Regression and RegressionResults both have a predict method with the same
> signature.
>
>>
>> In this case, we may want to just implement the simplest case, using
>> Matrix
>> and double[], for now.
>>
>> In principle the RegressionMetric class could extend a Metrics class
>> later.
>>
>> Do you feel this would set up the library better for the future?
>
> I know that the use case for a diagonal matrix only was put forward
> previously. So I can see the Matrix abstraction as useful. But should this
> then be Matrix<E>.
>
> You have Vector<E> for most methods to pass a 1D set of numeric data. But
> the RegressionData.of method accepts a double[]. This should also be
> Vector<E>.
>
> I am assuming that Vector is an abstraction of a 1D data object.
>
> What are the possible values for <E>?
>
> Double
> double[]
> Possibly complex numbers.
> … ?
>
> Such that Matrix<E> and Vector<E> just denote that the analysis is done on a
> matrix and vector of the same type.
>
> This would then require abstraction of all operations required by the
> regression objects such as:
>
> Vector<E> = Matrix<E>.multiply(Vector<E>)
> Matrix<E> = Matrix<E>.transpose()

Is "Vector" necessary?

Gilles

> Etc.
>
> Then you start by using concrete classes for Matrix<double[]> and
> Vector<double[]>.
>
> I see that the nomenclature Matrix<double[]> is a bit of a misnomer as it
> may be confused for Matrix<double[][]>. So this would be documented as <E>
> is the type of the entire data for a single matrix dimension. The matrix is
> actually a E[].
>
>
>>
>> Eric
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [statistics-regression] Proposed Regression class/method structure

Posted by Alex Herbert <al...@gmail.com>.

> On 28 Oct 2019, at 17:55, Eric Barnhill <er...@gmail.com> wrote:
> 
> Here is a schematic for how the interface might be made more abstract.
> 
> https://imgur.com/a/izx5Xkh <https://imgur.com/a/izx5Xkh>

Regression and RegressionResults both have a predict method with the same signature.

> 
> In this case, we may want to just implement the simplest case, using Matrix
> and double[], for now.
> 
> In principle the RegressionMetric class could extend a Metrics class later.
> 
> Do you feel this would set up the library better for the future?

I know that the use case for a diagonal matrix only was put forward previously. So I can see the Matrix abstraction as useful. But should this then be Matrix<E>.

You have Vector<E> for most methods to pass a 1D set of numeric data. But the RegressionData.of method accepts a double[]. This should also be Vector<E>.

I am assuming that Vector is an abstraction of a 1D data object.

What are the possible values for <E>? 

Double
double[]
Possibly complex numbers.
… ?

Such that Matrix<E> and Vector<E> just denote that the analysis is done on a matrix and vector of the same type. 

This would then require abstraction of all operations required by the regression objects such as:

Vector<E> = Matrix<E>.multiply(Vector<E>)
Matrix<E> = Matrix<E>.transpose()

Etc.

Then you start by using concrete classes for Matrix<double[]> and Vector<double[]>.

I see that the nomenclature Matrix<double[]> is a bit of a misnomer as it may be confused for Matrix<double[][]>. So this would be documented as <E> is the type of the entire data for a single matrix dimension. The matrix is actually a E[].


> 
> Eric


Re: [statistics-regression] Proposed Regression class/method structure

Posted by Eric Barnhill <er...@gmail.com>.
On Mon, Oct 28, 2019 at 3:01 PM Gilles Sadowski <gi...@gmail.com>
wrote:

> Hi Eric.
>
> 2019-10-28 18:55 UTC+01:00, Eric Barnhill <er...@gmail.com>:
> > Here is a schematic for how the interface might be made more abstract.
> >
> > https://imgur.com/a/izx5Xkh
>
> This cannot be downloaded.
> Please attach the image to a JIRA issue.
>
> Regards,
> Gilles
>
> It is attached to STATISTICS-8.

As for whether Vector is necessary. The idea was to sketch out an interface
that was more abstracted. Maybe Vector is a bit too abstract in a java
context, it's a pretty common container in many languages.

With more time to ponder, my vote is just to use EJML Matrix and double[]
as I proposed in the first scheme. Any use cases for which Matrix and
double[] will not suffice would be quite far off and I suspect this simple
approach will be sufficient for the commons mission.

Eric

Re: [statistics-regression] Proposed Regression class/method structure

Posted by Gilles Sadowski <gi...@gmail.com>.
Hi Eric.

2019-10-28 18:55 UTC+01:00, Eric Barnhill <er...@gmail.com>:
> Here is a schematic for how the interface might be made more abstract.
>
> https://imgur.com/a/izx5Xkh

This cannot be downloaded.
Please attach the image to a JIRA issue.

Regards,
Gilles

>
> In this case, we may want to just implement the simplest case, using Matrix
> and double[], for now.
>
> In principle the RegressionMetric class could extend a Metrics class later.
>
> Do you feel this would set up the library better for the future?
>
> Eric
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [statistics-regression] Proposed Regression class/method structure

Posted by Eric Barnhill <er...@gmail.com>.
Here is a schematic for how the interface might be made more abstract.

https://imgur.com/a/izx5Xkh

In this case, we may want to just implement the simplest case, using Matrix
and double[], for now.

In principle the RegressionMetric class could extend a Metrics class later.

Do you feel this would set up the library better for the future?

Eric

Re: [statistics-regression] Proposed Regression class/method structure

Posted by Gilles Sadowski <gi...@gmail.com>.
Hello.

Le mar. 22 oct. 2019 à 21:50, Eric Barnhill <er...@gmail.com> a écrit :
>
> I propose the following class structure for commons-statistics-regression.

Which?
[Attachment was probably stripped: such should go to a JIRA report.]

> The interface carried over from commons-math is more of an academic approach to thinking about regression. For rebooting the library (and I hinted at this when I wrote the tickets for summer of code) I was hoping to emulate widespread tools like R and scikit-learn, and consider that "machine learning" is an increasingly popular use of regression. This proposed structure creates an interface that is not the same as, but will be very friendly to, anyone coming from R or scikit-learn, or similar tools in JavaScript.
>
> There are of course many ways I can see to elaborate this scheme, say using RegressionResult objects and so forth. But Matrices paired with a double[], returning a double[] of coefficients or predictions, are likely to be the most common use cases and should be plenty to get started.

Commenting perhaps too early (not seeing the proposed design), but we broadly
discussed that the linear algebra API is not easy to get right, and once we "get
started", the trend is to be stuck with it for ages (related issues
are among the
oldest unresolved ones in CM).

> Under the hood I would use the available implementations in commons-math to get up and running, and worry about improving them later.

Do you mean port from, or depend on, CM?

Regards,
Gilles

>
> Feedback appreciated,
> Eric

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org