You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jason Rennie <jr...@gmail.com> on 2008/01/30 17:43:35 UTC

sparse matrix library

My experience w/ implementing/developing ML algorithms is that a good matrix
library is essential.  In grad school, I didn't have to choose one, I just
used Matlab.  While building the recs engine for StyleFeeder, I looked at
the available options for java, including MTJ and Colt, and found nothing to
fit my needs.  Dense matrix support looked solid, but I didn't find a good
sparse matrix library.  So, I wrote my own.  I'm wondering: did I miss
something?  If not, I'm guessing it might be a useful contribution to
Mahout.  Thoughts?  It's a bit limited/specialized and would require work
before it's sufficiently general-purpose (even as a serial version), but it
might be a place to start.

Jason

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
http://people.csail.mit.edu/jrennie/

Re: sparse matrix library

Posted by gopi <go...@gmail.com>.
checkout the source code from:
svn checkout http://svn.apache.org/repos/asf/lucene/mahout/trunk
Lucene/mahout/trunk

Chaitanya

On Feb 2, 2008 10:01 PM, Jason Rennie <jr...@gmail.com> wrote:

> How do you get the code?  SVN is asking for username/password...
>
> What is the constant hit you take for using HBase vs. an in-memory sparse
> matrix library?
>
> Jason
>
> On Thu, Jan 31, 2008 at 2:39 AM, Lukas Vlcek <lu...@gmail.com>
> wrote:
>
> > BTW: Have you seen http://wiki.apache.org/hadoop/Matrix yet?
> >
> > On Jan 30, 2008 11:23 PM, Jason Rennie <jr...@gmail.com> wrote:
> >
> > > On Jan 30, 2008 2:49 PM, Ted Dunning <td...@veoh.com> wrote:
> > >
> > > > For efficient row and column access, I built a hybrid that has both
> > CSR
> > > > and
> > > > CSC represenations under the hood.  I also don't care much about
> > > mutation.
> > >
> > >
> > > Cool :)  OS?  Or, something you built for VEOH?  Probably best to move
> > > off-list, but what are you responsible for there?
> > >
> > >
> > > >   A * v
> > > >   v * A
> > > >   A * A
> > > >   A' * A
> > > >   rowSum(A)
> > > >   columnSum(A)
> > > >   sum(A)
> > >
> > >
> > > No matrix/vector norms?  'course, not necessary, but probably worth
> > > throwing
> > > in :)
> > >
> > >  forEachNonZero, forEachNonZeroRow, forEachNonZeroColumn
> > > >   reduceNonZero, reduceNonZeroRow, reduceNonZeroColumn
> > > >   many kinds of views
> > >
> > >
> > > Otherwise sounds great.
> > >
> > > The way that Colt does (most of) this is to use a higher-order API.
> >  Most
> > > > users I have talked to were completely confused by this.  I think
> the
> > > > right
> > > > answer is to require a small set of primitives for each
> implementation
> > > and
> > > > inherit nice API much like AbstractMap provides lots of sugar over a
> > > > spartan
> > > > Map implementation.
> > >
> > >
> > > Not sure what you mean by "higher-order"... but agreed that Colt leave
> > > something to be desired.  Are there any interfaces or abstract classes
> > > written up?
> > >
> > >
> > > > I also think it would help us to have a nice syntax for some
> > algorithms.
> > > >  I
> > > > have lately been working with groovy and almost have some support
> for
> > > very
> > > > simple map-reduce programming.  Since Groovy supports infix
> > overloading,
> > > > that would allow us to have a very simple language for writing
> > matrixlike
> > > > code that inter-operates very well with the Java side.  I will write
> > > more
> > > > as
> > > > that becomes available.
> > >
> > >
> > > Didn't know about groovy.  Looks interesting.  Thanks for the pointer.
> > >  Will
> > > be interested to hear where you go with it.
> > >
> > > Jason
> > >
> >
> >
> >
> > --
> > http://blog.lukas-vlcek.com/
> >
>
>
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
>

Re: sparse matrix library

Posted by Jason Rennie <jr...@gmail.com>.
How do you get the code?  SVN is asking for username/password...

What is the constant hit you take for using HBase vs. an in-memory sparse
matrix library?

Jason

On Thu, Jan 31, 2008 at 2:39 AM, Lukas Vlcek <lu...@gmail.com> wrote:

> BTW: Have you seen http://wiki.apache.org/hadoop/Matrix yet?
>
> On Jan 30, 2008 11:23 PM, Jason Rennie <jr...@gmail.com> wrote:
>
> > On Jan 30, 2008 2:49 PM, Ted Dunning <td...@veoh.com> wrote:
> >
> > > For efficient row and column access, I built a hybrid that has both
> CSR
> > > and
> > > CSC represenations under the hood.  I also don't care much about
> > mutation.
> >
> >
> > Cool :)  OS?  Or, something you built for VEOH?  Probably best to move
> > off-list, but what are you responsible for there?
> >
> >
> > >   A * v
> > >   v * A
> > >   A * A
> > >   A' * A
> > >   rowSum(A)
> > >   columnSum(A)
> > >   sum(A)
> >
> >
> > No matrix/vector norms?  'course, not necessary, but probably worth
> > throwing
> > in :)
> >
> >  forEachNonZero, forEachNonZeroRow, forEachNonZeroColumn
> > >   reduceNonZero, reduceNonZeroRow, reduceNonZeroColumn
> > >   many kinds of views
> >
> >
> > Otherwise sounds great.
> >
> > The way that Colt does (most of) this is to use a higher-order API.
>  Most
> > > users I have talked to were completely confused by this.  I think the
> > > right
> > > answer is to require a small set of primitives for each implementation
> > and
> > > inherit nice API much like AbstractMap provides lots of sugar over a
> > > spartan
> > > Map implementation.
> >
> >
> > Not sure what you mean by "higher-order"... but agreed that Colt leave
> > something to be desired.  Are there any interfaces or abstract classes
> > written up?
> >
> >
> > > I also think it would help us to have a nice syntax for some
> algorithms.
> > >  I
> > > have lately been working with groovy and almost have some support for
> > very
> > > simple map-reduce programming.  Since Groovy supports infix
> overloading,
> > > that would allow us to have a very simple language for writing
> matrixlike
> > > code that inter-operates very well with the Java side.  I will write
> > more
> > > as
> > > that becomes available.
> >
> >
> > Didn't know about groovy.  Looks interesting.  Thanks for the pointer.
> >  Will
> > be interested to hear where you go with it.
> >
> > Jason
> >
>
>
>
> --
> http://blog.lukas-vlcek.com/
>



-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Re: sparse matrix library

Posted by Grant Ingersoll <gs...@apache.org>.
This is a good idea.  Many of these things will be needed in Hadoop  
for others, so ideally, we can try to help along some of these things  
in Hadoop by testing them out, patching, etc.  This will help keep our  
deps. down and make Hadoop better, as well.

-Grant

On Jan 31, 2008, at 2:39 AM, Lukas Vlcek wrote:

> BTW: Have you seen http://wiki.apache.org/hadoop/Matrix yet?
>
> On Jan 30, 2008 11:23 PM, Jason Rennie <jr...@gmail.com> wrote:
>
>> On Jan 30, 2008 2:49 PM, Ted Dunning <td...@veoh.com> wrote:
>>
>>> For efficient row and column access, I built a hybrid that has  
>>> both CSR
>>> and
>>> CSC represenations under the hood.  I also don't care much about
>> mutation.
>>
>>
>> Cool :)  OS?  Or, something you built for VEOH?  Probably best to  
>> move
>> off-list, but what are you responsible for there?
>>
>>
>>>  A * v
>>>  v * A
>>>  A * A
>>>  A' * A
>>>  rowSum(A)
>>>  columnSum(A)
>>>  sum(A)
>>
>>
>> No matrix/vector norms?  'course, not necessary, but probably worth
>> throwing
>> in :)
>>
>> forEachNonZero, forEachNonZeroRow, forEachNonZeroColumn
>>>  reduceNonZero, reduceNonZeroRow, reduceNonZeroColumn
>>>  many kinds of views
>>
>>
>> Otherwise sounds great.
>>
>> The way that Colt does (most of) this is to use a higher-order  
>> API.  Most
>>> users I have talked to were completely confused by this.  I think  
>>> the
>>> right
>>> answer is to require a small set of primitives for each  
>>> implementation
>> and
>>> inherit nice API much like AbstractMap provides lots of sugar over a
>>> spartan
>>> Map implementation.
>>
>>
>> Not sure what you mean by "higher-order"... but agreed that Colt  
>> leave
>> something to be desired.  Are there any interfaces or abstract  
>> classes
>> written up?
>>
>>
>>> I also think it would help us to have a nice syntax for some  
>>> algorithms.
>>> I
>>> have lately been working with groovy and almost have some support  
>>> for
>> very
>>> simple map-reduce programming.  Since Groovy supports infix  
>>> overloading,
>>> that would allow us to have a very simple language for writing  
>>> matrixlike
>>> code that inter-operates very well with the Java side.  I will write
>> more
>>> as
>>> that becomes available.
>>
>>
>> Didn't know about groovy.  Looks interesting.  Thanks for the  
>> pointer.
>> Will
>> be interested to hear where you go with it.
>>
>> Jason
>>
>
>
>
> -- 
> http://blog.lukas-vlcek.com/

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





Re: sparse matrix library

Posted by Lukas Vlcek <lu...@gmail.com>.
BTW: Have you seen http://wiki.apache.org/hadoop/Matrix yet?

On Jan 30, 2008 11:23 PM, Jason Rennie <jr...@gmail.com> wrote:

> On Jan 30, 2008 2:49 PM, Ted Dunning <td...@veoh.com> wrote:
>
> > For efficient row and column access, I built a hybrid that has both CSR
> > and
> > CSC represenations under the hood.  I also don't care much about
> mutation.
>
>
> Cool :)  OS?  Or, something you built for VEOH?  Probably best to move
> off-list, but what are you responsible for there?
>
>
> >   A * v
> >   v * A
> >   A * A
> >   A' * A
> >   rowSum(A)
> >   columnSum(A)
> >   sum(A)
>
>
> No matrix/vector norms?  'course, not necessary, but probably worth
> throwing
> in :)
>
>  forEachNonZero, forEachNonZeroRow, forEachNonZeroColumn
> >   reduceNonZero, reduceNonZeroRow, reduceNonZeroColumn
> >   many kinds of views
>
>
> Otherwise sounds great.
>
> The way that Colt does (most of) this is to use a higher-order API.  Most
> > users I have talked to were completely confused by this.  I think the
> > right
> > answer is to require a small set of primitives for each implementation
> and
> > inherit nice API much like AbstractMap provides lots of sugar over a
> > spartan
> > Map implementation.
>
>
> Not sure what you mean by "higher-order"... but agreed that Colt leave
> something to be desired.  Are there any interfaces or abstract classes
> written up?
>
>
> > I also think it would help us to have a nice syntax for some algorithms.
> >  I
> > have lately been working with groovy and almost have some support for
> very
> > simple map-reduce programming.  Since Groovy supports infix overloading,
> > that would allow us to have a very simple language for writing matrixlike
> > code that inter-operates very well with the Java side.  I will write
> more
> > as
> > that becomes available.
>
>
> Didn't know about groovy.  Looks interesting.  Thanks for the pointer.
>  Will
> be interested to hear where you go with it.
>
> Jason
>



-- 
http://blog.lukas-vlcek.com/

Re: sparse matrix library

Posted by Jason Rennie <jr...@gmail.com>.
On Jan 30, 2008 2:49 PM, Ted Dunning <td...@veoh.com> wrote:

> For efficient row and column access, I built a hybrid that has both CSR
> and
> CSC represenations under the hood.  I also don't care much about mutation.


Cool :)  OS?  Or, something you built for VEOH?  Probably best to move
off-list, but what are you responsible for there?


>   A * v
>   v * A
>   A * A
>   A' * A
>   rowSum(A)
>   columnSum(A)
>   sum(A)


No matrix/vector norms?  'course, not necessary, but probably worth throwing
in :)

  forEachNonZero, forEachNonZeroRow, forEachNonZeroColumn
>   reduceNonZero, reduceNonZeroRow, reduceNonZeroColumn
>   many kinds of views


Otherwise sounds great.

The way that Colt does (most of) this is to use a higher-order API.  Most
> users I have talked to were completely confused by this.  I think the
> right
> answer is to require a small set of primitives for each implementation and
> inherit nice API much like AbstractMap provides lots of sugar over a
> spartan
> Map implementation.


Not sure what you mean by "higher-order"... but agreed that Colt leave
something to be desired.  Are there any interfaces or abstract classes
written up?


> I also think it would help us to have a nice syntax for some algorithms.
>  I
> have lately been working with groovy and almost have some support for very
> simple map-reduce programming.  Since Groovy supports infix overloading,
> that would allow us to have a very simple language for writing matrix like
> code that inter-operates very well with the Java side.  I will write more
> as
> that becomes available.


Didn't know about groovy.  Looks interesting.  Thanks for the pointer.  Will
be interested to hear where you go with it.

Jason

Re: sparse matrix library

Posted by Ted Dunning <td...@veoh.com>.

For efficient row and column access, I built a hybrid that has both CSR and
CSC represenations under the hood.  I also don't care much about mutation.

Relative to API, I think we need to support (at least):

   A * v
   v * A
   A * A
   A' * A
   rowSum(A)
   columnSum(A)
   sum(A)
   forEachNonZero, forEachNonZeroRow, forEachNonZeroColumn
   reduceNonZero, reduceNonZeroRow, reduceNonZeroColumn
   many kinds of views
   

The way that Colt does (most of) this is to use a higher-order API.  Most
users I have talked to were completely confused by this.  I think the right
answer is to require a small set of primitives for each implementation and
inherit nice API much like AbstractMap provides lots of sugar over a spartan
Map implementation.

I also think it would help us to have a nice syntax for some algorithms.  I
have lately been working with groovy and almost have some support for very
simple map-reduce programming.  Since Groovy supports infix overloading,
that would allow us to have a very simple language for writing matrix like
code that inter-operates very well with the Java side.  I will write more as
that becomes available.
 


On 1/30/08 11:13 AM, "Jason Rennie" <jr...@gmail.com> wrote:

> On Jan 30, 2008 1:58 PM, Ted Dunning <td...@veoh.com> wrote:
> 
>> Many of my matrices need both fast column AND fast row access.  They are
>> also binary.  Yet another implementation.
>> 
> 
> Do you know a good storage format where both column and row access is
> efficient?  Hadn't found a good one yet, but neither have I searched very
> hard.  Also, do you have thoughts on sparse matrix construction?  I'm
> actually using a separate "builder" class which gathers RCV triples, then
> generates a matrix upon call to a build() function.  I don't modify a matrix
> once built.  Certainly allows for efficient construction.  Though, I'm sure
> we'd want sparse matrices to be mutable...
> 
> Personally, I'd prefer many optimized implementations to few inefficient
> ones. :)  Do you have thoughts on an interface/abstract class?  Seems to me
> we'd want a type parameter (generics) for function argument and return
> types, even though everything would be handled with primitives
> under-the-hood.
> 
> I imagine it won't be difficult to distribute basic matrix ops like
> vector*matrix and matrix*matrix.  Are there specs/interfaces for the
> map/reduce calls?
> 
> Jason


Re: sparse matrix library

Posted by Jason Rennie <jr...@gmail.com>.
On Jan 30, 2008 1:58 PM, Ted Dunning <td...@veoh.com> wrote:

> Many of my matrices need both fast column AND fast row access.  They are
> also binary.  Yet another implementation.
>

Do you know a good storage format where both column and row access is
efficient?  Hadn't found a good one yet, but neither have I searched very
hard.  Also, do you have thoughts on sparse matrix construction?  I'm
actually using a separate "builder" class which gathers RCV triples, then
generates a matrix upon call to a build() function.  I don't modify a matrix
once built.  Certainly allows for efficient construction.  Though, I'm sure
we'd want sparse matrices to be mutable...

Personally, I'd prefer many optimized implementations to few inefficient
ones. :)  Do you have thoughts on an interface/abstract class?  Seems to me
we'd want a type parameter (generics) for function argument and return
types, even though everything would be handled with primitives
under-the-hood.

I imagine it won't be difficult to distribute basic matrix ops like
vector*matrix and matrix*matrix.  Are there specs/interfaces for the
map/reduce calls?

Jason

Re: sparse matrix library

Posted by Ted Dunning <td...@veoh.com>.



On 1/30/08 10:37 AM, "Jason Rennie" <jr...@gmail.com> wrote:

> Btw, not familiar w/ RSR.  Didn't mean CSC, did you?

I did mean CSC.  Forgot that the last letter was row/column choice.

Many of my matrices need both fast column AND fast row access.  They are
also binary.  Yet another implementation.



Re: sparse matrix library

Posted by Jason Rennie <jr...@gmail.com>.
On Jan 30, 2008 12:40 PM, Ted Dunning <td...@veoh.com> wrote:

> I wrote my own package as well based roughly on Colt's high level
> structure.


Has a sparse matrix code-based been chosen for Mahout?  If not, we might as
well start the discussion :)

It turns out that for many data-mining projects you need several flavors of
> sparse matrices ([CSR, RSR] x [binary values, real values, complex values]
> x
> ...).


Yup.  I used CSR/real for my lib.  We'd certainly want one relatively clean
interface/abstract class and many implementations thereof.  Is there an
open, canonical sparse matrix java lib?

Btw, not familiar w/ RSR.  Didn't mean CSC, did you?


> It also helps a lot of have row and column labels a la R.  Many algorithms
> tie back to something other than row and column numbers.


Double yup :)  Certainly has been the case in most of my work (text
classification, collaborative filtering, NLP).

Cheers,

Jason

Re: sparse matrix library

Posted by Ted Dunning <td...@veoh.com>.

I wrote my own package as well based roughly on Colt's high level structure.

It turns out that for many data-mining projects you need several flavors of
sparse matrices ([CSR, RSR] x [binary values, real values, complex values] x
...).

It also helps a lot of have row and column labels a la R.  Many algorithms
tie back to something other than row and column numbers.


On 1/30/08 8:43 AM, "Jason Rennie" <jr...@gmail.com> wrote:

> My experience w/ implementing/developing ML algorithms is that a good matrix
> library is essential.  In grad school, I didn't have to choose one, I just
> used Matlab.  While building the recs engine for StyleFeeder, I looked at
> the available options for java, including MTJ and Colt, and found nothing to
> fit my needs.  Dense matrix support looked solid, but I didn't find a good
> sparse matrix library.  So, I wrote my own.  I'm wondering: did I miss
> something?  If not, I'm guessing it might be a useful contribution to
> Mahout.  Thoughts?  It's a bit limited/specialized and would require work
> before it's sufficiently general-purpose (even as a serial version), but it
> might be a place to start.
> 
> Jason