You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Phil Steitz <ph...@gmail.com> on 2006/01/01 21:57:59 UTC

[math] JSR 247: Data Mining 2.0

There is an early draft available for download here:
http://jcp.org/aboutJava/communityprocess/edr/jsr247/index.html

I have just started reviewing this.  We may want to provide some
collective input to this, as there is some overlap with what we have
already implemented in .stat.   I am willing to collect and
consolidate feedback if there is interest in providing this to the EG.
 If we decide to go further into data mining, we will want to look at
this very carefully.

I will also review and apply patches in /experimental if anyone wants
to start experimenting with providing a [math]-based implementation of
some part of the spec.

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] JSR 247: Data Mining 2.0

Posted by Mark Diggory <md...@apache.org>.

Phil Steitz wrote:

>On 1/2/06, Mark Diggory <md...@apache.org> wrote:
>  
>
>>Phil,
>>
>>This is a great idea as a specification and standard. We currently have
>>a service in our project which does something similar, but its mostly
>>implemented in Perl and R.
>>    
>>
>
>What project would that be?
>  
>
My primary employment at the moment at Harvard; The Virtual Data Center 
project 
[http://www.thedata.org][http://www.sourceforge.net/projects/thedata]

>>I wonder though, how much of it would be implemented at that database
>>level vs. in the application. For instance, in doing a transform that
>>returned a subset of a dataset from a db, it would much more efficient
>>to do it at the db level (in the query) than in the application itself.
>>    
>>
>
>The spec being developed is focussed on the analytical / statistical
>side rather than OLAP and also aims to be implementation-independent
>(i.e., what is really being standardized is the API for vendors to
>implement and client apps to use).  That said, your point is valid -
>it may be difficult to optimize implementation of some functions when
>the db engine can / should do much of the work natively.
>
>  
>
>>But I like as well the idea of a standalone java based implementation
>>too (maybe on HSQLDB) or perhaps theres a direction that could be taken
>>with Hibernate as well.
>>
>>    
>>
>As noted above, the functional areas being considered are more
>analytical - regression, clustering, classification, feature
>extraction, etc.  The overlap with [math] is in the statistical stuff.
>
>Phil
>  
>
Very true, we can explore implementations of the algorithms, I'm sure 
they would be useful the stat library. I point out HSQLDB because it has 
the capability to call java functions directly and use them in stored 
procedures etc. See:

http://hsqldb.org/doc/guide/ch09.html#stored-section

I could see the placement of Commons Math libraries within this 
situation be very effective if done right. Though in HSQLDB I'm still 
learning if the same can be done with updating aggregate functions the 
way one can with static methods.

-Mark

Re: [math] JSR 247: Data Mining 2.0

Posted by Phil Steitz <ph...@gmail.com>.

On 1/2/06, Mark Diggory <md...@apache.org> wrote:
> Phil,
>
> This is a great idea as a specification and standard. We currently have
> a service in our project which does something similar, but its mostly
> implemented in Perl and R.

What project would that be?
>
> I wonder though, how much of it would be implemented at that database
> level vs. in the application. For instance, in doing a transform that
> returned a subset of a dataset from a db, it would much more efficient
> to do it at the db level (in the query) than in the application itself.

The spec being developed is focussed on the analytical / statistical
side rather than OLAP and also aims to be implementation-independent
(i.e., what is really being standardized is the API for vendors to
implement and client apps to use).  That said, your point is valid -
it may be difficult to optimize implementation of some functions when
the db engine can / should do much of the work natively.

> But I like as well the idea of a standalone java based implementation
> too (maybe on HSQLDB) or perhaps theres a direction that could be taken
> with Hibernate as well.
>
As noted above, the functional areas being considered are more
analytical - regression, clustering, classification, feature
extraction, etc.  The overlap with [math] is in the statistical stuff.

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] JSR 247: Data Mining 2.0

Posted by Mark Diggory <md...@apache.org>.

Phil,

This is a great idea as a specification and standard. We currently have
a service in our project which does something similar, but its mostly
implemented in Perl and R.

I wonder though, how much of it would be implemented at that database
level vs. in the application. For instance, in doing a transform that
returned a subset of a dataset from a db, it would much more efficient
to do it at the db level (in the query) than in the application itself.
But I like as well the idea of a standalone java based implementation
too (maybe on HSQLDB) or perhaps theres a direction that could be taken
with Hibernate as well.

-Mark

Phil Steitz wrote:

>There is an early draft available for download here:
>http://jcp.org/aboutJava/communityprocess/edr/jsr247/index.html
>
>I have just started reviewing this.  We may want to provide some
>collective input to this, as there is some overlap with what we have
>already implemented in .stat.   I am willing to collect and
>consolidate feedback if there is interest in providing this to the EG.
> If we decide to go further into data mining, we will want to look at
>this very carefully.
>
>I will also review and apply patches in /experimental if anyone wants
>to start experimenting with providing a [math]-based implementation of
>some part of the spec.
>
>Phil
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] JSR 247: Data Mining 2.0

Posted by Mark Diggory <md...@gmail.com>.

Phil,

This is a great idea as a specification and standard. We currently have 
a service in our project which does something similar, but its mostly 
implemented in Perl and R. 

I wonder though, how much of it would be implemented at that database 
level vs. in the application. For instance, in doing a transform that 
returned a subset of a dataset from a db, it would much more efficient 
to do it at the db level (in the query) than in the application itself.  
But I like as well the idea of a standalone java based implementation 
too (maybe on HSQLDB) or perhaps theres a direction that could be taken 
with Hibernate as well.

-Mark

Phil Steitz wrote:

>There is an early draft available for download here:
>http://jcp.org/aboutJava/communityprocess/edr/jsr247/index.html
>
>I have just started reviewing this.  We may want to provide some
>collective input to this, as there is some overlap with what we have
>already implemented in .stat.   I am willing to collect and
>consolidate feedback if there is interest in providing this to the EG.
> If we decide to go further into data mining, we will want to look at
>this very carefully.
>
>I will also review and apply patches in /experimental if anyone wants
>to start experimenting with providing a [math]-based implementation of
>some part of the spec.
>
>Phil
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] JSR 247: Data Mining 2.0

Posted by Phil Steitz <ph...@gmail.com>.

On 1/2/06, John Gant <jo...@gmail.com> wrote:
> Reviewed the specification, and can say that it seems to contain some
> nice algorithms. I think [math] could add some very important
> methodologies for time series analysis (for instance smoothing
> algorithms, AR, MA, ARMA (if desired), and other decomposition
> methodologies). Phil, how can [math] contribute to this specification?

I am still studying the spec, so can't yet comment fully, but in
general, I can see two ways for us to get involved:

1. Contribute to the spec itself - i.e., give feedback on the
structure and content of the API
2. Implement portions of the spec or provide wrappers for [math]
components that provide some of the functionality described by the
spec

The comment period for the "Early Draft Review" closes 11 Jan, so if
we want to get involved in 1., we should start that ASAP.   My only
general comment so far is that because the actors targeted by the spec
appear to be essentially "datamining vendors" and "API users" there is
not as much mix-and-match pluggability in the API as we might like to
see in [math] - i.e., "vendors" like us who want to provide
pluggability at multiple levels may not have the flexibility that we
would like.  This is just based on a very preliminary review, however,
and I may change my mind about this when I have worked more with the
API and more fully digested the spec.

> Noticed that the distance measures (within clustering algorithms) are
> pluggable but didn't see a list of distance measures in this spec,
> should [math] create or contribute to this list?

This is a good example illustrating how we should be thinking about
the spec.  The first question to ask is is the API sufficient to
provide all of the implementation flexibility that the various
clustering algorithms are going to need?  We discussed this same topic
a while back.  Assuming the answer is "yes" then no feedback is
necessary (for that part of the spec) and we can plow ahead creating
some distance measure implementations - the latter would be part of
our "vendor implementation".  The benefit of taking this approach is
that our metrics would then become (independently) useful to a broader
audience than our own clustering implementations (as would the
clustering impls themselves, if they implement the spec API).

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [math] JSR 247: Data Mining 2.0

Posted by John Gant <jo...@gmail.com>.

Reviewed the specification, and can say that it seems to contain some
nice algorithms. I think [math] could add some very important
methodologies for time series analysis (for instance smoothing
algorithms, AR, MA, ARMA (if desired), and other decomposition
methodologies). Phil, how can [math] contribute to this specification?
Noticed that the distance measures (within clustering algorithms) are
pluggable but didn't see a list of distance measures in this spec,
should [math] create or contribute to this list?

John

On 1/1/06, Phil Steitz <ph...@gmail.com> wrote:
> There is an early draft available for download here:
> http://jcp.org/aboutJava/communityprocess/edr/jsr247/index.html
>
> I have just started reviewing this.  We may want to provide some
> collective input to this, as there is some overlap with what we have
> already implemented in .stat.   I am willing to collect and
> consolidate feedback if there is interest in providing this to the EG.
>  If we decide to go further into data mining, we will want to look at
> this very carefully.
>
> I will also review and apply patches in /experimental if anyone wants
> to start experimenting with providing a [math]-based implementation of
> some part of the spec.
>
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
>

--
John Gant

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org