You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@commons.apache.org by Phil Steitz <ps...@apache.org> on 2009/08/08 02:08:30 UTC

[ANNOUNCEMENT] Apache Commons Math 2.0 Released

The Apache Commons team is pleased to announce the release of version 
2.0 of Commons Math.  Commons Math is a library of lightweight, 
self-contained mathematics and statistics components addressing the most 
common problems not available in the Java programming language or 
Commons Lang.

Version 2.0 is a major release, including bug fixes, new features and 
enhancements to existing features.  Most notable among the new features 
are matrix decomposition algorithms, sparse matrices and vectors, 
genetic algorithms, new optimization algorithms, curve fitting 
algorithms,  state derivatives in ODE step handlers,  new multistep 
integrators,  multiple regression, correlation, rank transformations and 
Mersenne twister pseudo random number generator.

This release is NOT source and binary compatible with earlier versions 
of Commons Math.  Starting with version 2.0 of the  library, the minimal 
version of the Java platform required to compile and use commons-math is 
Java 5. 

Source and binary distributions are available for download from the 
Apache Commons Math download site:
http://commons.apache.org/downloads/download_math.cgi

Please verify signatures using the KEYS file available at the above 
location when downloading the release.

Maven users please note that the maven repository groupId for Commons 
Math has changed in version 2.0 to "org.apache.commons."  The artifactId 
remains "commons-math."

For more information on Apache Commons Math, visit the Math home page:
http://commons.apache.org/math/

Feedback, suggestions for improvment or bug reports are welcome via the 
"Mailing Lists" and "Issue Tracking" links here:
http://commons.apache.org/math/project-info.html

Phil Steitz
- On behalf of the Apache Commons community


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Oct 3, 2009 at 1:59 PM, Jake Mannix <ja...@gmail.com> wrote:

> What's more problematic is that Vector doesn't have any iterator methods,
> either dense or nonZero.  This means we have to subclass as well as
> wrap, which is a total pain.
>

I am sure that they would be open for patches on this.

>  They also don't have the equivalent of the OrderedIntDoubleVector that
> Ted has in his patch ( MAHOUT-165 ),
>

I just changed the name.  Didn't write the code.  (credit where due and all
that)

Immutable OrderedIntDoubleVectors also have the potential to be the lowest
memory overhead around.

> Our apis currently may be "ad-hoc", but as someone who's had to write
> a vector api quite a few times in my life now, I like the one you guys
> came up with here a lot better than the one in cmath (not quite
> as nice as Colt's, but with better names).
>

Aw shucks.

But again, I bet that the math guys would love the feedback and patches to
make it all better.

-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

On Sat, Oct 3, 2009 at 7:44 AM, Sean Owen <sr...@gmail.com> wrote:

> I looked at the APIs just now and thought they were pretty good --
> yes, RealVector seems a little overdone perhaps. It does seem more
> complete and planned than the serviceable but ad-hoc APIs developed in
> Mahout to date. But implementations are provided no? It does make the
> job of writing a Writable wrapper implementation a little harder but
> hey in my IDE it's still one click to create the skeleton.
>

Some of RealVector is bad because of the silly mapXXX/mapXXXtoSelf
which should be replaced with (like our own) map(UnaryFunction f) /
mapToSelf(UnaryFunction f), which isn't so bad, just ugly.

What's more problematic is that Vector doesn't have any iterator methods,
either dense or nonZero.  This means we have to subclass as well as
wrap, which is a total pain.

They also don't have the equivalent of the OrderedIntDoubleVector that
Ted has in his patch ( MAHOUT-165 ), which is definitely by far the fastest
implementation of immutable sparse vectors which don't need random
access.  We can always add that however, not a big problem.

Our apis currently may be "ad-hoc", but as someone who's had to write
a vector api quite a few times in my life now, I like the one you guys
came up with here a lot better than the one in cmath (not quite
as nice as Colt's, but with better names).

  -jake

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

Of course.  This is now pretty easy if the dictionary is separated.  The
labeled matrix only cares about object identity of dictionary, not about
what kinds of labels are there (except for get and put).

For that matter, it also doesn't care if the LabelDictionary is strictly 1:1
or probabilistically 1:~1.  As you say, allowing collisions judiciously is
very, very powerful.

On Sat, Oct 3, 2009 at 11:30 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> interface LabelDictionary<T> {
>  T getLabel(int index);
>  int getIndex(T label);
> }
>
> allows for LabelDictionary<String> as choice, but allows for flexibility
> (such
> as having if they're strings or token ngrams keep track of their IDF or
> number
> of tokens or underlying type [i.e. you're doing regression on some model
> with
> a lot of numeric parameters, but pre-normalization they all carried
> different
> units]).
>

-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

On Sat, Oct 3, 2009 at 4:08 PM, Ted Dunning <te...@gmail.com> wrote:

> > What do you mean by both "labels as an idea can be separated from
> matrices"
> > and "matrix operations should be by label rather than by index"?  These
> > sound
> > like contradictory statements to me - the latter means that matrices are
> > inherently
> > tied to labels.
> >
>
> Yeah... I think I wrote that poorly.
>
> Let me try again.
>
> If I have two vectors that I think of as word counts, {"a": 100, "b": 20}
> and {"b": 2, "c": 10}, then I absolutely want to have the dot product be 40
> and not 400.  That is, I want the product of the values for "b".
>
> The simplest way to do this is to index using strings.
>

This right here presupposes that your vectors have been created in some way
which are a) sparse (as you mentioned),  b) intrinsically related to their
labels, and c) those labels are Strings.  A common case, but is everything
in the world of machine learning like that?

> Another way to do this is always build sparse vectors using coherent
> integer
> codes.  Thus, if the count for "b" gets put into location 23 in one vector,
> it will get put into 23 in all other vectors or else they be
> non-conformable
> due to a domain exception.  Conversely, no two labels should be put into
> the
> same location without being identical.
>

I certainly agree on the first point (the labels -> index mapping should be
single-valued), but the latter is again, common, but not required for all
applications: in high enough dimensions, allowing some collisions with
low probability can be fine - pLSI is a good example of how this can work,
and kernelized decompositions are another, I would imagine.  But your
point is taken, typically a 1:1 map is fine, and admittedly, the case where
the map is simply "toString()" <-> "valueOf()" allows for the case where
no Stringy label is needed.

> If our actual implementation of vectors doesn't know about strings, then we
> can build a string dictionary class and a vector wrapper class with a
> reference to a string dictionary.  All operations on the vectors (except
> get/put) would be delegated to the underlying vector operations after a
> check to verify that the string dictionaries for the two wrappers are
> identical.  If the dictionaries are identical then we know that the
> encodings are the same and we don't have to worry about the internals. Get
> and put are special since we have to add string based versions that look up
> the string and use the resulting index.
>
> In this scheme, the vector implementation itself knows nothing about labels
> and yet all operations proceed as if it did.
>

Ok, this is reasonable - but going this way we shouldn't tie the labels to
be
Strings - something like a generic

interface LabelDictionary<T> {
  T getLabel(int index);
  int getIndex(T label);
}

allows for LabelDictionary<String> as choice, but allows for flexibility
(such
as having if they're strings or token ngrams keep track of their IDF or
number
of tokens or underlying type [i.e. you're doing regression on some model
with
a lot of numeric parameters, but pre-normalization they all carried
different
units]).

> Frankly, the difference will be bigger than that because the string
> dictionary needs to be shared and thus concurrency safe.  Because the
> object
> is shared, it wouldn't even be easy to make the dictionary immutable.
>

Well once it's immutable, it's very thread-safe, so why exactly would it be
hard
to make the dictionary immutable?  You make it once, then seal it up... I
guess
you're referring to the fact that in the text corpus case, you might be
building
up the dictionary in process, and thus logically can't lock it down?

I wonder how much would break if a matrix only know how large it was so
> far.  Or in the case of labeled vectors if it know how many elements were
> in
> the shared dictionary so far.
>
> Seems to me like it much just work well.
>

How much breaks if the matrix doesn't even know anything about it's
dimensionality?  I've always put checks like that it just to catch
programmer
error (you have lots of vectors flying around, living in different spaces,
and
accidentally doing twentyDSparseVector.dot(tenDSparseVector) either
gives back normal floating point answers which lead to hard to track down
bugs (which is why I've sometimes coded up my vector classes as being
tied to marker interfaces: Vector<T extends VectorDomain>, so that my
IDE can use type-safety to keep me from being too stupid.  But dealing with
collections of these, and subinterfaces of them, and Matrices of them...
can lead to pretty ugly java generics code, whose readability is poor enough
to overwhelm the benefits, IMO).

> > Defining DomainException instead of CardinalityException, to be thrown
> when
> > the label sets are different, would be a lot better, as long as we're
> only
> > requiring, say, that you carry around the *name* of the label set, not
> the
> > full set, if
> > you are working at the lower level "by index only" apis.
> >
>
> I think a reference should be sufficient.
>

It seems like you're assuming single-JVM operations here - I mention
name (or any other unique identifier) to take care of the Hadoop case
where multiple types of vectors can live on the same machines as well
as different ones.

> Experiment 1:
>
> build a simple label wrapper for commons math or jplasma.
>
> Build a few sample apps to find where it binds (say an in-memory word
> counter, cooccurrence computer and simple SVD implementation).
>
> Success would be had if this could be done in a few hours without changing
> the underlying matrix implementation.
>
> Experiment 2:
>
> Modify the sparse matrix implementation used in (1) to be an extensible
> matrix and make the wrapper query the dictionary for size questions.
>

Ok - effectively test it all at once - first labels, then extendable
cardinalities,
sounds like a reasonable experiment...

  -jake

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Oct 3, 2009 at 1:49 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Sat, Oct 3, 2009 at 12:30 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Labels are the only thing that scares me.  It may be that we really need
> to
> > figure out a good answer to that in any case so that labels as an idea
> can
> > be separated from matrices.
> >
> > The real problem is that matrix operations should be by label rather than
> > index.  If we can somehow make the indexes universal, then we should be
> OK.
> > One way to do that is to broaden the idea of conformability in matrices
> to
> > require that the were created using a common label dictionary for the
> > conformable indexes.
> >
>
> What do you mean by both "labels as an idea can be separated from matrices"
> and "matrix operations should be by label rather than by index"?  These
> sound
> like contradictory statements to me - the latter means that matrices are
> inherently
> tied to labels.
>

Yeah... I think I wrote that poorly.

Let me try again.

If I have two vectors that I think of as word counts, {"a": 100, "b": 20}
and {"b": 2, "c": 10}, then I absolutely want to have the dot product be 40
and not 400.  That is, I want the product of the values for "b".

The simplest way to do this is to index using strings.

Another way to do this is always build sparse vectors using coherent integer
codes.  Thus, if the count for "b" gets put into location 23 in one vector,
it will get put into 23 in all other vectors or else they be non-conformable
due to a domain exception.  Conversely, no two labels should be put into the
same location without being identical.

If our actual implementation of vectors doesn't know about strings, then we
can build a string dictionary class and a vector wrapper class with a
reference to a string dictionary.  All operations on the vectors (except
get/put) would be delegated to the underlying vector operations after a
check to verify that the string dictionaries for the two wrappers are
identical.  If the dictionaries are identical then we know that the
encodings are the same and we don't have to worry about the internals. Get
and put are special since we have to add string based versions that look up
the string and use the resulting index.

In this scheme, the vector implementation itself knows nothing about labels
and yet all operations proceed as if it did.

I worry about the performance of the current api if we encouraged people to
> always address values in a Vector via get(String label) (which seems to be
> what you're implying if we encourage always using labels not indices).
> What
> could be a method call and an array access (getQuick(index) ), is instead a
> method
> call, a HashMap get(String), another method call, a bounds-check, and then
> an
> array lookup.  Maybe the JIT is smart enough to handle most of this, but
> I'd be
> surprised if there wasn't a difference here.
>

Frankly, the difference will be bigger than that because the string
dictionary needs to be shared and thus concurrency safe.  Because the object
is shared, it wouldn't even be easy to make the dictionary immutable.

I would actually still recommend using get/put based on labels rather than
integers and then recommend also that they use higher level operations for
the most part.

> Another issue is that some matrices are essentially unbounded (or we do
> not
> > know the bounds). ...
> I'm totally down with you on this one - the current setup where Matrix and
> Vector impls are required to know their final dimensionality at
> construction I
> certainly find pretty constraining: it requires that I make one full pass
> through my
> data just to measure how big everything is.
>

I wonder how much would break if a matrix only know how large it was so
far.  Or in the case of labeled vectors if it know how many elements were in
the shared dictionary so far.

Seems to me like it much just work well.

> Defining DomainException instead of CardinalityException, to be thrown when
> the label sets are different, would be a lot better, as long as we're only
> requiring, say, that you carry around the *name* of the label set, not the
> full set, if
> you are working at the lower level "by index only" apis.
>

I think a reference should be sufficient.

>  > > What are people's inclinations on this?
> > >
> >
> > Try an experiment?
> >
>
> What kind of experiment?  There are a lot of ideas thrown around -
> relationship between labels and matrices, using CommonsMath underlying apis
> and
> implementations, separating Writable from Vector/Matrix, unbinding
> cardinalities from instantiation...
>

Experiment 1:

build a simple label wrapper for commons math or jplasma.

Build a few sample apps to find where it binds (say an in-memory word
counter, cooccurrence computer and simple SVD implementation).

Success would be had if this could be done in a few hours without changing
the underlying matrix implementation.

Experiment 2:

Modify the sparse matrix implementation used in (1) to be an extensible
matrix and make the wrapper query the dictionary for size questions.

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

On Sat, Oct 3, 2009 at 12:30 PM, Ted Dunning <te...@gmail.com> wrote:

> Labels are the only thing that scares me.  It may be that we really need to
> figure out a good answer to that in any case so that labels as an idea can
> be separated from matrices.
>
> The real problem is that matrix operations should be by label rather than
> index.  If we can somehow make the indexes universal, then we should be OK.
> One way to do that is to broaden the idea of conformability in matrices to
> require that the were created using a common label dictionary for the
> conformable indexes.
>

What do you mean by both "labels as an idea can be separated from matrices"
and "matrix operations should be by label rather than by index"?  These
sound
like contradictory statements to me - the latter means that matrices are
inherently
tied to labels.

I worry about the performance of the current api if we encouraged people to
always address values in a Vector via get(String label) (which seems to be
what
you're implying if we encourage always using labels not indices).   What
could
be a method call and an array access (getQuick(index) ), is instead a method
call,
a HashMap get(String), another method call, a bounds-check, and then an
array
lookup.  Maybe the JIT is smart enough to handle most of this, but I'd be
surprised
if there wasn't a difference here.

> Another issue is that some matrices are essentially unbounded (or we do not
> know the bounds).  These matrices must by nature be sparse.  This comes up
> in situations such as a document x term matrix where we do not know how
> many
> terms there may be, nor how many documents.
>

I'm totally down with you on this one - the current setup where Matrix and
Vector
impls are required to know their final dimensionality at construction I
certainly
find pretty constraining: it requires that I make one full pass through my
data
just to measure how big everything is.

Defining DomainException instead of CardinalityException, to be thrown when
the label sets are different, would be a lot better, as long as we're only
requiring,
say, that you carry around the *name* of the label set, not the full set, if
you
are working at the lower level "by index only" apis.

> > What are people's inclinations on this?
> >
>
> Try an experiment?
>

What kind of experiment?  There are a lot of ideas thrown around -
relationship
between labels and matrices, using CommonsMath underlying apis and
implementations, separating Writable from Vector/Matrix, unbinding
cardinalities
from instantiation...

  -jake

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

On Sat, Oct 3, 2009 at 7:44 AM, Sean Owen <sr...@gmail.com> wrote:

> Is the idea of 'labels' a key feature it would be missing? again a
> wrapper strategy can probably take care of that, not clear how hard it
> is.
>

Labels are the only thing that scares me.  It may be that we really need to
figure out a good answer to that in any case so that labels as an idea can
be separated from matrices.

The real problem is that matrix operations should be by label rather than
index.  If we can somehow make the indexes universal, then we should be OK.
One way to do that is to broaden the idea of conformability in matrices to
require that the were created using a common label dictionary for the
conformable indexes.

Another issue is that some matrices are essentially unbounded (or we do not
know the bounds).  These matrices must by nature be sparse.  This comes up
in situations such as a document x term matrix where we do not know how many
terms there may be, nor how many documents.

> What are people's inclinations on this?
>

Try an experiment?

-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Sean Owen <sr...@gmail.com>.

I looked at the APIs just now and thought they were pretty good --
yes, RealVector seems a little overdone perhaps. It does seem more
complete and planned than the serviceable but ad-hoc APIs developed in
Mahout to date. But implementations are provided no? It does make the
job of writing a Writable wrapper implementation a little harder but
hey in my IDE it's still one click to create the skeleton.

Is the idea of 'labels' a key feature it would be missing? again a
wrapper strategy can probably take care of that, not clear how hard it
is.

My guess is it can be made to do whatever we need, since we don't need
anything terribly fancy. The performance characteristics could be an
issue. I would suspect that if there are such issues we can either
work to contribute improvements, or produce additional implementations
with desired characteristics.

It'd sure be nice to not have our own library. That said it's a fair
bit of work to port, just to get back to where we are. I wouldn't mind
looking at that for 0.3 (since I need to get my hands dirty with
matrices for parallelizable recommendations, and had been waiting a
bit to settle the question of what the implementation we use would
be.)

What are people's inclinations on this?

Sean

On Thu, Oct 1, 2009 at 2:54 AM, Jake Mannix <ja...@gmail.com> wrote:
> So what's the status on integration of commons-math-2.0 in Mahout?
>
> Do we need that stuff?  Some of their apis are pretty ugly (look at the
> number
> of methods you need to implement to qualify to be a "RealVector"), but
> piggybacking on some of their functionality would be pretty useful
> (especially
> stats/regression/distributions as well as the small matrix decomposition
> stuff).

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

On Fri, Oct 2, 2009 at 1:54 PM, Steve Lianoglou <
mailinglist.honeypot@gmail.com> wrote:

>
> Someone picked up colt and is making a parallelized colt .. it didn't hit
> my radar until recently, so I thought I'd pop it on the collective's radar:
>
> http://sites.google.com/site/piotrwendykier/software/parallelcolt
>
> It also looks as if the developer is interfacing w/ native atlas/lapack.
> See the "Netlib-java" section there.
>

But it also looks like they've still got the hep.aida.* packages, which
means that nobody in the Apache world can use them...

  -jake

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Steve Lianoglou <ma...@gmail.com>.

Hi,

On Oct 2, 2009, at 4:33 PM, Ted Dunning wrote:

> On Fri, Oct 2, 2009 at 12:29 PM, Jake Mannix <ja...@gmail.com>  
> wrote:
>
>> Is the desire to just get good performant linear algebra routines  
>> (avoid
>> reinventing wheels), or to actually be able to interoperate with  
>> Commons
>> Math?
>>
>
> My desire is primarily the former, but I would like it if we could,  
> as a
> project, contribute to the progress of commons Math.
>
> Regarding the arcane API of c-math, they have shown a very strong
> receptivity to simplifying things and improving the structure.
>
> Regarding Colt, I don't think that Colt is even all that close to  
> the state
> of the art for linear algebra performance available in Java.  It  
> used to be
> the pinnacle, but various other libraries have substantially  
> eclipsed it.
> Other libraries have the added benefit of being able to make use of  
> Atlas or
> other platform specific implementations where available.  This can  
> give
> outrageous performance.

Someone picked up colt and is making a parallelized colt .. it didn't  
hit my radar until recently, so I thought I'd pop it on the  
collective's radar:

http://sites.google.com/site/piotrwendykier/software/parallelcolt

It also looks as if the developer is interfacing w/ native atlas/ 
lapack. See the "Netlib-java" section there.

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Oct 2, 2009 at 12:29 PM, Jake Mannix <ja...@gmail.com> wrote:

>  Is the desire to just get good performant linear algebra routines (avoid
> reinventing wheels), or to actually be able to interoperate with Commons
> Math?
>

My desire is primarily the former, but I would like it if we could, as a
project, contribute to the progress of commons Math.

Regarding the arcane API of c-math, they have shown a very strong
receptivity to simplifying things and improving the structure.

Regarding Colt, I don't think that Colt is even all that close to the state
of the art for linear algebra performance available in Java.  It used to be
the pinnacle, but various other libraries have substantially eclipsed it.
Other libraries have the added benefit of being able to make use of Atlas or
other platform specific implementations where available.  This can give
outrageous performance.

-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

  Is the desire to just get good performant linear algebra routines (avoid
reinventing
wheels), or to actually be able to interoperate with Commons Math?

  If the latter, then fine, we should start migrating to that, and implement
Writable
subclasses of their classes where we need to, and implement their interfaces

(ugly as they are: no iterators [nonZero or otherwise], a gazillion
mapXXXtoSelf
methods which are totally useless) when also necessary.

  If the former, then I just spent a little while digging through Colt's
sourcecode,
and it's actually really easy to rip out all of the unacceptably licensed
hep.aida.*
references, and the rest is attribution licensed and we could just take it,
and
they have way better interfaces for doing linear algebra.

  If you guys want to see a patch of what Colt looks like after having
ripped out
the hep.aida.* stuff (really the only parts that are used are some utililty
stats classes
we could easily reimplement ourselves, but as it is the code that was
removed
was mostly in benchmarking code), I can post that to a JIRA.  Of course, it
trades
a nice api for having to go back to the lovely naming convention of
DoubleMatrix1D
etc. :)

Just my $0.02, from digging into Commons Math (and Colt, a few other of the
discussed libraries).

  -jake

On Fri, Oct 2, 2009 at 11:48 AM, Sean Owen <sr...@gmail.com> wrote:

> Agree. Does this help bridge the gap to Commons Math? I remember that HDFS
> binding was an issue, but if that is separated... suppose I should have a
> look at its APIs. Naturally I think we would strongly prefer to reuse a
> matrix library than write another unless there is a need that justifies it.
> Just want to get a fix on what the gap is, whether we or Math can bridge
> it.
> Guess I would be surprised if it is not possible.
>
> On Oct 2, 2009 7:10 PM, "Ted Dunning" <te...@gmail.com> wrote:
>
> I think separating the concerns is a good thing.  It allows us to use
> better
> implementations as well as handle cases where being a writable makes no
> sense.
>
> (just like Jake said)
>
> On Fri, Oct 2, 2009 at 10:50 AM, Jake Mannix <ja...@gmail.com>
> wrote:
> > On Thu, Oct 1, 2009 ...
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Sean Owen <sr...@gmail.com>.

Agree. Does this help bridge the gap to Commons Math? I remember that HDFS
binding was an issue, but if that is separated... suppose I should have a
look at its APIs. Naturally I think we would strongly prefer to reuse a
matrix library than write another unless there is a need that justifies it.
Just want to get a fix on what the gap is, whether we or Math can bridge it.
Guess I would be surprised if it is not possible.

On Oct 2, 2009 7:10 PM, "Ted Dunning" <te...@gmail.com> wrote:

I think separating the concerns is a good thing.  It allows us to use better
implementations as well as handle cases where being a writable makes no
sense.

(just like Jake said)

On Fri, Oct 2, 2009 at 10:50 AM, Jake Mannix <ja...@gmail.com> wrote:
> On Thu, Oct 1, 2009 ...
--
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

I think separating the concerns is a good thing.  It allows us to use better
implementations as well as handle cases where being a writable makes no
sense.

(just like Jake said)

On Fri, Oct 2, 2009 at 10:50 AM, Jake Mannix <ja...@gmail.com> wrote:

> On Thu, Oct 1, 2009 at 3:42 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> >
> > On Oct 1, 2009, at 12:17 AM, Jake Mannix wrote:
> >
> > So why do we really need vectors to be Writable?  I see the appeal, it's
> >> nice and makes the code nicely integrated, but the way I ended up going,
> >> so that you could use decomposer either with or without Hadoop was to
> >> use a decorator - just have VectorWritable be an implementation of
> Vector
> >> which encapsulates the Writable methods, and delegates to a Hadoop -
> >> agnostic Vector member instance.
> >>
> >> This way all the algorithms which use the Vectors don't need to care
> about
> >> Hadoop unless they really do.
> >>
> >
> > That sounds reasonable, just going to take a little refactoring.
> >
>
> So what do the rest of you think about doing this?  Do we want to do some
> refactoring (post 0.2, naturally) which separates the writableness from the
> Matrix/Vector-ness?
>
> Or are we fine with all of our linear algebraic classes being tied to HDFS
> at an interface level?  (Even Matrices, which will probably soon need to
> be adapted to the idea that often they won't live on any single machine,
> and thus you'll never be write()'ing them out all at once, and so won't
> always even make sense to have them be Writable).
>
>  -jake
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Grant Ingersoll <gs...@apache.org>.

+1

On Oct 2, 2009, at 1:50 PM, Jake Mannix wrote:

> On Thu, Oct 1, 2009 at 3:42 AM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>
>>
>> On Oct 1, 2009, at 12:17 AM, Jake Mannix wrote:
>>
>> So why do we really need vectors to be Writable?  I see the appeal,  
>> it's
>>> nice and makes the code nicely integrated, but the way I ended up  
>>> going,
>>> so that you could use decomposer either with or without Hadoop was  
>>> to
>>> use a decorator - just have VectorWritable be an implementation of  
>>> Vector
>>> which encapsulates the Writable methods, and delegates to a Hadoop -
>>> agnostic Vector member instance.
>>>
>>> This way all the algorithms which use the Vectors don't need to  
>>> care about
>>> Hadoop unless they really do.
>>>
>>
>> That sounds reasonable, just going to take a little refactoring.
>>
>
> So what do the rest of you think about doing this?  Do we want to do  
> some
> refactoring (post 0.2, naturally) which separates the writableness  
> from the
> Matrix/Vector-ness?
>
> Or are we fine with all of our linear algebraic classes being tied  
> to HDFS
> at an interface level?  (Even Matrices, which will probably soon  
> need to
> be adapted to the idea that often they won't live on any single  
> machine,
> and thus you'll never be write()'ing them out all at once, and so  
> won't
> always even make sense to have them be Writable).
>
>  -jake

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Oct 1, 2009 at 3:42 AM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Oct 1, 2009, at 12:17 AM, Jake Mannix wrote:
>
> So why do we really need vectors to be Writable?  I see the appeal, it's
>> nice and makes the code nicely integrated, but the way I ended up going,
>> so that you could use decomposer either with or without Hadoop was to
>> use a decorator - just have VectorWritable be an implementation of Vector
>> which encapsulates the Writable methods, and delegates to a Hadoop -
>> agnostic Vector member instance.
>>
>> This way all the algorithms which use the Vectors don't need to care about
>> Hadoop unless they really do.
>>
>
> That sounds reasonable, just going to take a little refactoring.
>

So what do the rest of you think about doing this?  Do we want to do some
refactoring (post 0.2, naturally) which separates the writableness from the
Matrix/Vector-ness?

Or are we fine with all of our linear algebraic classes being tied to HDFS
at an interface level?  (Even Matrices, which will probably soon need to
be adapted to the idea that often they won't live on any single machine,
and thus you'll never be write()'ing them out all at once, and so won't
always even make sense to have them be Writable).

  -jake

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Grant Ingersoll <gs...@apache.org>.

On Oct 1, 2009, at 12:17 AM, Jake Mannix wrote:

> On Wed, Sep 30, 2009 at 8:26 PM, Ted Dunning <te...@gmail.com>  
> wrote:
>
> So why do we really need vectors to be Writable?  I see the appeal,  
> it's
> nice and makes the code nicely integrated, but the way I ended up  
> going,
> so that you could use decomposer either with or without Hadoop was to
> use a decorator - just have VectorWritable be an implementation of  
> Vector
> which encapsulates the Writable methods, and delegates to a Hadoop -
> agnostic Vector member instance.
>
> This way all the algorithms which use the Vectors don't need to care  
> about
> Hadoop unless they really do.
>


That sounds reasonable, just going to take a little refactoring.

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

The MTJ committers were willing to relicense under Apache and donate the
entire package.

On Wed, Sep 30, 2009 at 9:17 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Wed, Sep 30, 2009 at 8:26 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > No motion.  I was pushing that integration because it looked like MTJ was
> > integrating with them.  That would give some pretty high performance
> linear
> > algebra to commons-math.
> >
>
> MTJ is LGPL, how was that ever going anywhere?

-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Sep 30, 2009 at 8:26 PM, Ted Dunning <te...@gmail.com> wrote:

> No motion.  I was pushing that integration because it looked like MTJ was
> integrating with them.  That would give some pretty high performance linear
> algebra to commons-math.
>

MTJ is LGPL, how was that ever going anywhere?

Luc has been doing some very nice work on small matrix decompositions
> lately.  As you say, however, the class structure is kinda over-done.
>

Yeah, I ended up using (in decomposer) commons-math-2.0's small matrix
eigen decomposition for the final step in Lanczos, and as a check on the
Hebbian techniques (to verify accuracy when the dimension is low enough
to do both approaches and compare).

The other issue is that we need vectors to be Writables which is not
> something they are reasonably going to do.
>

So why do we really need vectors to be Writable?  I see the appeal, it's
nice and makes the code nicely integrated, but the way I ended up going,
so that you could use decomposer either with or without Hadoop was to
use a decorator - just have VectorWritable be an implementation of Vector
which encapsulates the Writable methods, and delegates to a Hadoop -
agnostic Vector member instance.

This way all the algorithms which use the Vectors don't need to care about
Hadoop unless they really do.


> My question is whether we could get math's decompositions by implementing
> their RealVector interface, or by extending one of their vector implements
> as a Writable.  Only the first option seems to have a chance to be easy
> (guessing).
>

Implementing RealVector is uglyugly.   Extending to implement Writable can
be practically done by my IDE itself, it looks like.

But do we want to do either of these?  They actually don't even have any
equivalent of OrderedIntDoublePair for fast iteration and slow random access
(which is the only sparse implementation I've found I need - I rarely have
much
use for random access in a sparse vector).

Is there anything else other than small-scale linear algebra that we could
use from commons-math?  If that's it, then it's probably not worth it - we
can
steal an implementation of whatever we need for auxiliary work with the
Big Data matrices if we need to, right?  I hear they're apache licensed over
there. ;p

  -jake

On Wed, Sep 30, 2009 at 6:54 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> > So what's the status on integration of commons-math-2.0 in Mahout?
> >
> > Do we need that stuff?  Some of their apis are pretty ugly (look at the
> > number
> > of methods you need to implement to qualify to be a "RealVector"), but
> > piggybacking on some of their functionality would be pretty useful
> > (especially
> > stats/regression/distributions as well as the small matrix decomposition
> > stuff).
> >
> >  -jake
> >
> >
> > On Fri, Aug 7, 2009 at 10:00 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > This is the key step that was pre-requisite to integration of MTJ into
> > > commons math, and thereby making really good linear algebra available
> for
> > > us
> > > in Mahout.
> > >
> > > ---------- Forwarded message ----------
> > > From: Phil Steitz <ps...@apache.org>
> > > Date: Fri, Aug 7, 2009 at 5:08 PM
> > > Subject: [ANNOUNCEMENT] Apache Commons Math 2.0 Released
> > > To: announcements@jakarta.apache.org, announce@apache.org, Commons
> > > Developers List <de...@commons.apache.org>, Commons Users List <
> > > user@commons.apache.org>
> > > Cc: private@commons.apache.org
> > >
> > >
> > > The Apache Commons team is pleased to announce the release of version
> 2.0
> > > of
> > > Commons Math.  Commons Math is a library of lightweight, self-contained
> > > mathematics and statistics components addressing the most common
> problems
> > > not available in the Java programming language or Commons Lang.
> > >
> > > Version 2.0 is a major release, including bug fixes, new features and
> > > enhancements to existing features.  Most notable among the new features
> > are
> > > matrix decomposition algorithms, sparse matrices and vectors, genetic
> > > algorithms, new optimization algorithms, curve fitting algorithms,
>  state
> > > derivatives in ODE step handlers,  new multistep integrators,  multiple
> > > regression, correlation, rank transformations and Mersenne twister
> pseudo
> > > random number generator.
> > >
> > > This release is NOT source and binary compatible with earlier versions
> of
> > > Commons Math.  Starting with version 2.0 of the  library, the minimal
> > > version of the Java platform required to compile and use commons-math
> is
> > > Java 5.
> > > Source and binary distributions are available for download from the
> > Apache
> > > Commons Math download site:
> > > http://commons.apache.org/downloads/download_math.cgi
> > >
> > > Please verify signatures using the KEYS file available at the above
> > > location
> > > when downloading the release.
> > >
> > > Maven users please note that the maven repository groupId for Commons
> > Math
> > > has changed in version 2.0 to "org.apache.commons."  The artifactId
> > remains
> > > "commons-math."
> > >
> > > For more information on Apache Commons Math, visit the Math home page:
> > > http://commons.apache.org/math/
> > >
> > > Feedback, suggestions for improvment or bug reports are welcome via the
> > > "Mailing Lists" and "Issue Tracking" links here:
> > > http://commons.apache.org/math/project-info.html
> > >
> > > Phil Steitz
> > > - On behalf of the Apache Commons community
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: dev-help@commons.apache.org
> > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

No motion.  I was pushing that integration because it looked like MTJ was
integrating with them.  That would give some pretty high performance linear
algebra to commons-math.

That hasn't gone anywhere lately as far as I know.

The only other integration point is that every time we have needed something
like a new sampler or distribution that I implemented, I gave it to math
(all twice, that is).

Luc has been doing some very nice work on small matrix decompositions
lately.  As you say, however, the class structure is kinda over-done.

The other issue is that we need vectors to be Writables which is not
something they are reasonably going to do.

My question is whether we could get math's decompositions by implementing
their RealVector interface, or by extending one of their vector implements
as a Writable.  Only the first option seems to have a chance to be easy
(guessing).

On Wed, Sep 30, 2009 at 6:54 PM, Jake Mannix <ja...@gmail.com> wrote:

> So what's the status on integration of commons-math-2.0 in Mahout?
>
> Do we need that stuff?  Some of their apis are pretty ugly (look at the
> number
> of methods you need to implement to qualify to be a "RealVector"), but
> piggybacking on some of their functionality would be pretty useful
> (especially
> stats/regression/distributions as well as the small matrix decomposition
> stuff).
>
>  -jake
>
>
> On Fri, Aug 7, 2009 at 10:00 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > This is the key step that was pre-requisite to integration of MTJ into
> > commons math, and thereby making really good linear algebra available for
> > us
> > in Mahout.
> >
> > ---------- Forwarded message ----------
> > From: Phil Steitz <ps...@apache.org>
> > Date: Fri, Aug 7, 2009 at 5:08 PM
> > Subject: [ANNOUNCEMENT] Apache Commons Math 2.0 Released
> > To: announcements@jakarta.apache.org, announce@apache.org, Commons
> > Developers List <de...@commons.apache.org>, Commons Users List <
> > user@commons.apache.org>
> > Cc: private@commons.apache.org
> >
> >
> > The Apache Commons team is pleased to announce the release of version 2.0
> > of
> > Commons Math.  Commons Math is a library of lightweight, self-contained
> > mathematics and statistics components addressing the most common problems
> > not available in the Java programming language or Commons Lang.
> >
> > Version 2.0 is a major release, including bug fixes, new features and
> > enhancements to existing features.  Most notable among the new features
> are
> > matrix decomposition algorithms, sparse matrices and vectors, genetic
> > algorithms, new optimization algorithms, curve fitting algorithms,  state
> > derivatives in ODE step handlers,  new multistep integrators,  multiple
> > regression, correlation, rank transformations and Mersenne twister pseudo
> > random number generator.
> >
> > This release is NOT source and binary compatible with earlier versions of
> > Commons Math.  Starting with version 2.0 of the  library, the minimal
> > version of the Java platform required to compile and use commons-math is
> > Java 5.
> > Source and binary distributions are available for download from the
> Apache
> > Commons Math download site:
> > http://commons.apache.org/downloads/download_math.cgi
> >
> > Please verify signatures using the KEYS file available at the above
> > location
> > when downloading the release.
> >
> > Maven users please note that the maven repository groupId for Commons
> Math
> > has changed in version 2.0 to "org.apache.commons."  The artifactId
> remains
> > "commons-math."
> >
> > For more information on Apache Commons Math, visit the Math home page:
> > http://commons.apache.org/math/
> >
> > Feedback, suggestions for improvment or bug reports are welcome via the
> > "Mailing Lists" and "Issue Tracking" links here:
> > http://commons.apache.org/math/project-info.html
> >
> > Phil Steitz
> > - On behalf of the Apache Commons community
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Jake Mannix <ja...@gmail.com>.

So what's the status on integration of commons-math-2.0 in Mahout?

Do we need that stuff?  Some of their apis are pretty ugly (look at the
number
of methods you need to implement to qualify to be a "RealVector"), but
piggybacking on some of their functionality would be pretty useful
(especially
stats/regression/distributions as well as the small matrix decomposition
stuff).

  -jake


On Fri, Aug 7, 2009 at 10:00 PM, Ted Dunning <te...@gmail.com> wrote:

> This is the key step that was pre-requisite to integration of MTJ into
> commons math, and thereby making really good linear algebra available for
> us
> in Mahout.
>
> ---------- Forwarded message ----------
> From: Phil Steitz <ps...@apache.org>
> Date: Fri, Aug 7, 2009 at 5:08 PM
> Subject: [ANNOUNCEMENT] Apache Commons Math 2.0 Released
> To: announcements@jakarta.apache.org, announce@apache.org, Commons
> Developers List <de...@commons.apache.org>, Commons Users List <
> user@commons.apache.org>
> Cc: private@commons.apache.org
>
>
> The Apache Commons team is pleased to announce the release of version 2.0
> of
> Commons Math.  Commons Math is a library of lightweight, self-contained
> mathematics and statistics components addressing the most common problems
> not available in the Java programming language or Commons Lang.
>
> Version 2.0 is a major release, including bug fixes, new features and
> enhancements to existing features.  Most notable among the new features are
> matrix decomposition algorithms, sparse matrices and vectors, genetic
> algorithms, new optimization algorithms, curve fitting algorithms,  state
> derivatives in ODE step handlers,  new multistep integrators,  multiple
> regression, correlation, rank transformations and Mersenne twister pseudo
> random number generator.
>
> This release is NOT source and binary compatible with earlier versions of
> Commons Math.  Starting with version 2.0 of the  library, the minimal
> version of the Java platform required to compile and use commons-math is
> Java 5.
> Source and binary distributions are available for download from the Apache
> Commons Math download site:
> http://commons.apache.org/downloads/download_math.cgi
>
> Please verify signatures using the KEYS file available at the above
> location
> when downloading the release.
>
> Maven users please note that the maven repository groupId for Commons Math
> has changed in version 2.0 to "org.apache.commons."  The artifactId remains
> "commons-math."
>
> For more information on Apache Commons Math, visit the Math home page:
> http://commons.apache.org/math/
>
> Feedback, suggestions for improvment or bug reports are welcome via the
> "Mailing Lists" and "Issue Tracking" links here:
> http://commons.apache.org/math/project-info.html
>
> Phil Steitz
> - On behalf of the Apache Commons community
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Fwd: [ANNOUNCEMENT] Apache Commons Math 2.0 Released

Posted by Ted Dunning <te...@gmail.com>.

This is the key step that was pre-requisite to integration of MTJ into
commons math, and thereby making really good linear algebra available for us
in Mahout.

---------- Forwarded message ----------
From: Phil Steitz <ps...@apache.org>
Date: Fri, Aug 7, 2009 at 5:08 PM
Subject: [ANNOUNCEMENT] Apache Commons Math 2.0 Released
To: announcements@jakarta.apache.org, announce@apache.org, Commons
Developers List <de...@commons.apache.org>, Commons Users List <
user@commons.apache.org>
Cc: private@commons.apache.org

The Apache Commons team is pleased to announce the release of version 2.0 of
Commons Math.  Commons Math is a library of lightweight, self-contained
mathematics and statistics components addressing the most common problems
not available in the Java programming language or Commons Lang.

Version 2.0 is a major release, including bug fixes, new features and
enhancements to existing features.  Most notable among the new features are
matrix decomposition algorithms, sparse matrices and vectors, genetic
algorithms, new optimization algorithms, curve fitting algorithms,  state
derivatives in ODE step handlers,  new multistep integrators,  multiple
regression, correlation, rank transformations and Mersenne twister pseudo
random number generator.

This release is NOT source and binary compatible with earlier versions of
Commons Math.  Starting with version 2.0 of the  library, the minimal
version of the Java platform required to compile and use commons-math is
Java 5.
Source and binary distributions are available for download from the Apache
Commons Math download site:
http://commons.apache.org/downloads/download_math.cgi

Please verify signatures using the KEYS file available at the above location
when downloading the release.

Maven users please note that the maven repository groupId for Commons Math
has changed in version 2.0 to "org.apache.commons."  The artifactId remains
"commons-math."

For more information on Apache Commons Math, visit the Math home page:
http://commons.apache.org/math/

Feedback, suggestions for improvment or bug reports are welcome via the
"Mailing Lists" and "Issue Tracking" links here:
http://commons.apache.org/math/project-info.html

Phil Steitz
- On behalf of the Apache Commons community

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

-- 
Ted Dunning, CTO
DeepDyve