You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2013/06/19 03:14:24 UTC

Mahout vectors/matrices/solvers on spark

Hello,

so i finally got around to actually do it.

I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
solvers using spark and Bagel /scala.

I also want to use in-core solvers that run directly on Mahout.

Question #1: which mahout artifacts are better be imported if I don't want
to pick the hadoop stuff dependencies? Is there even such a separation of
code? I know mahout-math seems to try to avoid being hadoop specfic but not
sure if it is followed strictly.

Question #2: which in-core solvers are available for Mahout matrices? I
know there's SSVD, probably Cholesky, is there something else? In
paticular, i need to be solving linear systems, I guess Cholesky should be
equipped enough to do just that?

Question #3: why did we try to import Colt solvers rather than actually
depend on Colt in the first place? Why did we not accept Colt's sparse
matrices and created native ones instead?

Colt seems to have a notion of parse in-core matrices too and seems like a
well-rounded solution. However, it doesn't seem like being actively
supported, whereas I know Mahout experienced continued enhancements to the
in-core matrix support.

Thanks in advance
-Dmitriy

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
that has occurred to me too. we are not inferring any aggregations really
here. it may turn out that its use beneficial with bigger volumes and real
I/O though. hard to tell. anyway i will probably keep both as an option.


On Tue, Jul 9, 2013 at 7:51 AM, Ted Dunning <te...@gmail.com> wrote:

> Also, it is likely that the combiner has little effect.  This means that
> you are essentially using a vector to serialized single elements.
>
> Sent from my iPhone
>
> On Jul 8, 2013, at 23:13, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> > yes that's my working hypothesis. Serializing and combining
> > RandomAccessSparseVectors is slower than elementwise messages.
> >
> >
> > On Mon, Jul 8, 2013 at 11:00 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> It is common for double serialization to creep into the systems as well.
> >> My guess however is that the primitive serialization is just much faster
> >> than the vector serialization.
> >>
> >> Sent from my iPhone
> >>
> >> On Jul 8, 2013, at 22:55, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>
> >>> yes, but it is just a test and I am trying to interpolate results that
> i
> >>> see to bigger volume. sort of. To get some taste of the programming
> model
> >>> performance.
> >>>
> >>> I do get cpu-bound behavior and i hit spark cache 100% of the time. so
> i
> >>> theory, since i am not having spills and i am not doing sorts, it
> should
> >> be
> >>> fairly fast.
> >>>
> >>> I have two algorithms. One just sends elementwise messages to the
> vertex
> >>> representing a row it should be in. Another one is using the same set
> of
> >>> initial messages but also uses Bagel combiners which, the way i
> >> understand
> >>> it, apply combining of elements to form partial vectors before shipping
> >> it
> >>> off to remote vertex paritition. Reasoning here apparently since
> elements
> >>> are combined, there's fewer io. Well, perhaps not in this case so much,
> >>> since we are not really doing any sort of information aggregation. On
> >>> single spark node setup i of course don't have actual io, so it should
> >>> approach speed of in-core copy-by-serialization.
> >>>
> >>> What i am seeing is that elementwise messages work almost two times
> >> faster
> >>> in cpu bound behavior than the version with combiners. it would seem
> the
> >>> culprit is that VectorWritable serialization and then deserialization
> of
> >>> vectorized fragments is considerably slower than serialization of
> >>> elementwise messages containing only primitive types there (target row,
> >>> index, value), even that the latter is significantly larger amount of
> >>> objects as well as data.
> >>>
> >>> Still though, i am trying to convince myself that even using combiners
> >>> should be ok compared to shuffle and sort overhead. But i think in
> >> reality
> >>> it still looks a bit slower than i expected. well i guess i should not
> be
> >>> lazy and benchmark it against Mahout MR-based transpose as well as
> >> spark's
> >>> version of RDD shuffle-and-sort.
> >>>
> >>> anyway, map-only tasks on spark distributed matrices are lightning fast
> >> but
> >>> Bagel serialze/deserialize scatter/gather seems to be much slower than
> >> just
> >>> map-only processing. Perhaps I am doing it wrong somehow.
> >>>
> >>>
> >>> On Mon, Jul 8, 2013 at 10:22 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>>
> >>>> Transpose of that small a matrix should happen in memory.
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>> On Jul 8, 2013, at 17:26, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>>>
> >>>>> Anybody knows how good (or bad) our performance on matrix transpose?
> >> how
> >>>>> long will it take to transpose a 10M non-zeros with Mahout (if i
> wanted
> >>>> to
> >>>>> setup fully distributed but single node MR cluster?)
> >>>>>
> >>>>> Trying to figure if the numbers i see with Bagel-based Mahout matrix
> >>>>> transposition are any good.
> >>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
Also, it is likely that the combiner has little effect.  This means that you are essentially using a vector to serialized single elements.  

Sent from my iPhone

On Jul 8, 2013, at 23:13, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> yes that's my working hypothesis. Serializing and combining
> RandomAccessSparseVectors is slower than elementwise messages.
> 
> 
> On Mon, Jul 8, 2013 at 11:00 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> It is common for double serialization to creep into the systems as well.
>> My guess however is that the primitive serialization is just much faster
>> than the vector serialization.
>> 
>> Sent from my iPhone
>> 
>> On Jul 8, 2013, at 22:55, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>>> yes, but it is just a test and I am trying to interpolate results that i
>>> see to bigger volume. sort of. To get some taste of the programming model
>>> performance.
>>> 
>>> I do get cpu-bound behavior and i hit spark cache 100% of the time. so i
>>> theory, since i am not having spills and i am not doing sorts, it should
>> be
>>> fairly fast.
>>> 
>>> I have two algorithms. One just sends elementwise messages to the vertex
>>> representing a row it should be in. Another one is using the same set of
>>> initial messages but also uses Bagel combiners which, the way i
>> understand
>>> it, apply combining of elements to form partial vectors before shipping
>> it
>>> off to remote vertex paritition. Reasoning here apparently since elements
>>> are combined, there's fewer io. Well, perhaps not in this case so much,
>>> since we are not really doing any sort of information aggregation. On
>>> single spark node setup i of course don't have actual io, so it should
>>> approach speed of in-core copy-by-serialization.
>>> 
>>> What i am seeing is that elementwise messages work almost two times
>> faster
>>> in cpu bound behavior than the version with combiners. it would seem the
>>> culprit is that VectorWritable serialization and then deserialization of
>>> vectorized fragments is considerably slower than serialization of
>>> elementwise messages containing only primitive types there (target row,
>>> index, value), even that the latter is significantly larger amount of
>>> objects as well as data.
>>> 
>>> Still though, i am trying to convince myself that even using combiners
>>> should be ok compared to shuffle and sort overhead. But i think in
>> reality
>>> it still looks a bit slower than i expected. well i guess i should not be
>>> lazy and benchmark it against Mahout MR-based transpose as well as
>> spark's
>>> version of RDD shuffle-and-sort.
>>> 
>>> anyway, map-only tasks on spark distributed matrices are lightning fast
>> but
>>> Bagel serialze/deserialize scatter/gather seems to be much slower than
>> just
>>> map-only processing. Perhaps I am doing it wrong somehow.
>>> 
>>> 
>>> On Mon, Jul 8, 2013 at 10:22 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>> 
>>>> Transpose of that small a matrix should happen in memory.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Jul 8, 2013, at 17:26, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>>> 
>>>>> Anybody knows how good (or bad) our performance on matrix transpose?
>> how
>>>>> long will it take to transpose a 10M non-zeros with Mahout (if i wanted
>>>> to
>>>>> setup fully distributed but single node MR cluster?)
>>>>> 
>>>>> Trying to figure if the numbers i see with Bagel-based Mahout matrix
>>>>> transposition are any good.
>> 

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
yes that's my working hypothesis. Serializing and combining
RandomAccessSparseVectors is slower than elementwise messages.


On Mon, Jul 8, 2013 at 11:00 PM, Ted Dunning <te...@gmail.com> wrote:

> It is common for double serialization to creep into the systems as well.
>  My guess however is that the primitive serialization is just much faster
> than the vector serialization.
>
> Sent from my iPhone
>
> On Jul 8, 2013, at 22:55, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> > yes, but it is just a test and I am trying to interpolate results that i
> > see to bigger volume. sort of. To get some taste of the programming model
> > performance.
> >
> > I do get cpu-bound behavior and i hit spark cache 100% of the time. so i
> > theory, since i am not having spills and i am not doing sorts, it should
> be
> > fairly fast.
> >
> > I have two algorithms. One just sends elementwise messages to the vertex
> > representing a row it should be in. Another one is using the same set of
> > initial messages but also uses Bagel combiners which, the way i
> understand
> > it, apply combining of elements to form partial vectors before shipping
> it
> > off to remote vertex paritition. Reasoning here apparently since elements
> > are combined, there's fewer io. Well, perhaps not in this case so much,
> > since we are not really doing any sort of information aggregation. On
> > single spark node setup i of course don't have actual io, so it should
> > approach speed of in-core copy-by-serialization.
> >
> > What i am seeing is that elementwise messages work almost two times
> faster
> > in cpu bound behavior than the version with combiners. it would seem the
> > culprit is that VectorWritable serialization and then deserialization of
> > vectorized fragments is considerably slower than serialization of
> > elementwise messages containing only primitive types there (target row,
> > index, value), even that the latter is significantly larger amount of
> > objects as well as data.
> >
> > Still though, i am trying to convince myself that even using combiners
> > should be ok compared to shuffle and sort overhead. But i think in
> reality
> > it still looks a bit slower than i expected. well i guess i should not be
> > lazy and benchmark it against Mahout MR-based transpose as well as
> spark's
> > version of RDD shuffle-and-sort.
> >
> > anyway, map-only tasks on spark distributed matrices are lightning fast
> but
> > Bagel serialze/deserialize scatter/gather seems to be much slower than
> just
> > map-only processing. Perhaps I am doing it wrong somehow.
> >
> >
> > On Mon, Jul 8, 2013 at 10:22 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Transpose of that small a matrix should happen in memory.
> >>
> >> Sent from my iPhone
> >>
> >> On Jul 8, 2013, at 17:26, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>
> >>> Anybody knows how good (or bad) our performance on matrix transpose?
> how
> >>> long will it take to transpose a 10M non-zeros with Mahout (if i wanted
> >> to
> >>> setup fully distributed but single node MR cluster?)
> >>>
> >>> Trying to figure if the numbers i see with Bagel-based Mahout matrix
> >>> transposition are any good.
> >>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
It is common for double serialization to creep into the systems as well.  My guess however is that the primitive serialization is just much faster than the vector serialization.  

Sent from my iPhone

On Jul 8, 2013, at 22:55, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> yes, but it is just a test and I am trying to interpolate results that i
> see to bigger volume. sort of. To get some taste of the programming model
> performance.
> 
> I do get cpu-bound behavior and i hit spark cache 100% of the time. so i
> theory, since i am not having spills and i am not doing sorts, it should be
> fairly fast.
> 
> I have two algorithms. One just sends elementwise messages to the vertex
> representing a row it should be in. Another one is using the same set of
> initial messages but also uses Bagel combiners which, the way i understand
> it, apply combining of elements to form partial vectors before shipping it
> off to remote vertex paritition. Reasoning here apparently since elements
> are combined, there's fewer io. Well, perhaps not in this case so much,
> since we are not really doing any sort of information aggregation. On
> single spark node setup i of course don't have actual io, so it should
> approach speed of in-core copy-by-serialization.
> 
> What i am seeing is that elementwise messages work almost two times faster
> in cpu bound behavior than the version with combiners. it would seem the
> culprit is that VectorWritable serialization and then deserialization of
> vectorized fragments is considerably slower than serialization of
> elementwise messages containing only primitive types there (target row,
> index, value), even that the latter is significantly larger amount of
> objects as well as data.
> 
> Still though, i am trying to convince myself that even using combiners
> should be ok compared to shuffle and sort overhead. But i think in reality
> it still looks a bit slower than i expected. well i guess i should not be
> lazy and benchmark it against Mahout MR-based transpose as well as spark's
> version of RDD shuffle-and-sort.
> 
> anyway, map-only tasks on spark distributed matrices are lightning fast but
> Bagel serialze/deserialize scatter/gather seems to be much slower than just
> map-only processing. Perhaps I am doing it wrong somehow.
> 
> 
> On Mon, Jul 8, 2013 at 10:22 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> Transpose of that small a matrix should happen in memory.
>> 
>> Sent from my iPhone
>> 
>> On Jul 8, 2013, at 17:26, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> 
>>> Anybody knows how good (or bad) our performance on matrix transpose? how
>>> long will it take to transpose a 10M non-zeros with Mahout (if i wanted
>> to
>>> setup fully distributed but single node MR cluster?)
>>> 
>>> Trying to figure if the numbers i see with Bagel-based Mahout matrix
>>> transposition are any good.
>> 

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
yes, but it is just a test and I am trying to interpolate results that i
see to bigger volume. sort of. To get some taste of the programming model
performance.

I do get cpu-bound behavior and i hit spark cache 100% of the time. so i
theory, since i am not having spills and i am not doing sorts, it should be
fairly fast.

I have two algorithms. One just sends elementwise messages to the vertex
representing a row it should be in. Another one is using the same set of
initial messages but also uses Bagel combiners which, the way i understand
it, apply combining of elements to form partial vectors before shipping it
off to remote vertex paritition. Reasoning here apparently since elements
are combined, there's fewer io. Well, perhaps not in this case so much,
since we are not really doing any sort of information aggregation. On
single spark node setup i of course don't have actual io, so it should
approach speed of in-core copy-by-serialization.

What i am seeing is that elementwise messages work almost two times faster
in cpu bound behavior than the version with combiners. it would seem the
culprit is that VectorWritable serialization and then deserialization of
vectorized fragments is considerably slower than serialization of
elementwise messages containing only primitive types there (target row,
index, value), even that the latter is significantly larger amount of
objects as well as data.

Still though, i am trying to convince myself that even using combiners
should be ok compared to shuffle and sort overhead. But i think in reality
it still looks a bit slower than i expected. well i guess i should not be
lazy and benchmark it against Mahout MR-based transpose as well as spark's
version of RDD shuffle-and-sort.

anyway, map-only tasks on spark distributed matrices are lightning fast but
Bagel serialze/deserialize scatter/gather seems to be much slower than just
map-only processing. Perhaps I am doing it wrong somehow.


On Mon, Jul 8, 2013 at 10:22 PM, Ted Dunning <te...@gmail.com> wrote:

> Transpose of that small a matrix should happen in memory.
>
> Sent from my iPhone
>
> On Jul 8, 2013, at 17:26, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> > Anybody knows how good (or bad) our performance on matrix transpose? how
> > long will it take to transpose a 10M non-zeros with Mahout (if i wanted
> to
> > setup fully distributed but single node MR cluster?)
> >
> > Trying to figure if the numbers i see with Bagel-based Mahout matrix
> > transposition are any good.
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
Transpose of that small a matrix should happen in memory. 

Sent from my iPhone

On Jul 8, 2013, at 17:26, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Anybody knows how good (or bad) our performance on matrix transpose? how
> long will it take to transpose a 10M non-zeros with Mahout (if i wanted to
> setup fully distributed but single node MR cluster?)
> 
> Trying to figure if the numbers i see with Bagel-based Mahout matrix
> transposition are any good.

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Anybody knows how good (or bad) our performance on matrix transpose? how
long will it take to transpose a 10M non-zeros with Mahout (if i wanted to
setup fully distributed but single node MR cluster?)

Trying to figure if the numbers i see with Bagel-based Mahout matrix
transposition are any good.

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Fri, Jul 5, 2013 at 1:40 AM, Nick Pentreath <ni...@gmail.com>wrote:

> Hi Dmitry
>
>
> You can take a look at using the update "magic" method which is similar
> to apply but handles assignment.
>
>
>
> If you want to keep the := as assignment I think you could do
>
>
> def :=(value: Double) = update ...
>


>
> (I don't have my laptop around at the moment so can't check this works).
>
>
> Also you can take a dig into the Breeze code which has a pretty similar
> DSL to what you're trying to put together, for examples of how David has
> done it - ignoring the code that is related to all the


As far as i can tell by either documentation or code, Breeze does not
implement single element assignment thru := operator (as in A(5,2) := 3.0),
but thru  update(row,col) method only. (I studied and tried breeze's linalg
package quite a bit).

update() method in Mahout is actually replaced by safe set(r,c) or unsafe
setQuick(r,c), so i don't see a point to implement yet another method
(although update() is more in line with scala conventions, much as set() is
with java's)


operators and specializing for Int, Double and Float, the core DSL is
> fairly compact I think.
>
>
> N
>
> —
> Sent from Mailbox for iPhone
>
> On Fri, Jul 5, 2013 at 10:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > For anyone good at scala DSLs, the following is the puzzle i can't seem
> to
> > figure at the moment.
> > I mentioned  before that I implemented assignment notations to a row or a
> > block, e.g. for a row vector : A(5,::) := (1,2,3)
> > what it really translates into in this particular case is
> > A.viewRow(5).assign(new Vector(new double[]{1,2,3}))
> > One thing i can't quite figure for in-core matrix DSL is how to translate
> > element assignments such as
> > A(5,5) := 2.0
> > into A.setQuick(5,5,2.0)
> > while still having
> > val k = A(5,5)
> > translating into
> > val k = A.getQuick(5,5).
> > it could be implemented with a "elementView" analogue but would require
> the
> > element view object creation  -- which is, first,  a big no-no (too
> > expensive) for a simple solitary element assignment (or read-out)
> > operation, and secondly, reading the element such as A(5,5) * 2.0 would
> > also involve view element object creation with implicit conversion to
> > Double whereas it is not even needed at all in this case.
> > at this point i have only a very obvious apply(Double,Double):Double =
> > m.getQuick(...), i.e. only element reads are supported with that syntax.
> > I am guessing Jake, if anyone, might have an idea here... thanks.
> > On Thu, Jul 4, 2013 at 11:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> >> FWIW, Givens streaming qr will be a bit more economical on memory than
> >> Householder's since it doesn't need the full buffer to compute R and
> >> doesn't need to keep entire original matrix around.
> >>
> >>
> >> On Thu, Jul 4, 2013 at 11:15 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >>
> >>> Ted,
> >>>
> >>> would it make sense to port parts of QR in-core row-wise Givens solver
> >>> out of SSVD to work on any Matrix? I know givens method is advertised
> as
> >>> stable but not sure if it is the fastest accepted one. I guess they
> are all
> >>> about the same.
> >>>
> >>> If yes, i will need also to port the UpperTriangular matrix to adhere
> to
> >>> all the bells and wistles, and also have some sort of RowShift matrix
> (a
> >>> much simpler analogue of Pivoted matrix for row-rolled buffers). would
> that
> >>> make sense?
> >>>
> >>> thanks.
> >>> -D
> >>>
> >>>
> >>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Nick Pentreath <ni...@gmail.com>.
Hi Dmitry


​You can take a look at using the update "magic" method which is similar to apply but handles assignment. 



​If you want to keep the := as assignment I think you could do 


def :=(value: Double) = update ...


(I don't have my laptop around at the moment so can't check this works).


Also you can take a dig into the Breeze code which has a pretty similar DSL to what you're trying to put together, for examples of how David has done it - ignoring the code that is related to all the operators and specializing for Int, Double and Float, the core DSL is fairly compact I think.


N 

—
Sent from Mailbox for iPhone

On Fri, Jul 5, 2013 at 10:16 AM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> For anyone good at scala DSLs, the following is the puzzle i can't seem to
> figure at the moment.
> I mentioned  before that I implemented assignment notations to a row or a
> block, e.g. for a row vector : A(5,::) := (1,2,3)
> what it really translates into in this particular case is
> A.viewRow(5).assign(new Vector(new double[]{1,2,3}))
> One thing i can't quite figure for in-core matrix DSL is how to translate
> element assignments such as
> A(5,5) := 2.0
> into A.setQuick(5,5,2.0)
> while still having
> val k = A(5,5)
> translating into
> val k = A.getQuick(5,5).
> it could be implemented with a "elementView" analogue but would require the
> element view object creation  -- which is, first,  a big no-no (too
> expensive) for a simple solitary element assignment (or read-out)
> operation, and secondly, reading the element such as A(5,5) * 2.0 would
> also involve view element object creation with implicit conversion to
> Double whereas it is not even needed at all in this case.
> at this point i have only a very obvious apply(Double,Double):Double =
> m.getQuick(...), i.e. only element reads are supported with that syntax.
> I am guessing Jake, if anyone, might have an idea here... thanks.
> On Thu, Jul 4, 2013 at 11:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> FWIW, Givens streaming qr will be a bit more economical on memory than
>> Householder's since it doesn't need the full buffer to compute R and
>> doesn't need to keep entire original matrix around.
>>
>>
>> On Thu, Jul 4, 2013 at 11:15 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> Ted,
>>>
>>> would it make sense to port parts of QR in-core row-wise Givens solver
>>> out of SSVD to work on any Matrix? I know givens method is advertised as
>>> stable but not sure if it is the fastest accepted one. I guess they are all
>>> about the same.
>>>
>>> If yes, i will need also to port the UpperTriangular matrix to adhere to
>>> all the bells and wistles, and also have some sort of RowShift matrix (a
>>> much simpler analogue of Pivoted matrix for row-rolled buffers). would that
>>> make sense?
>>>
>>> thanks.
>>> -D
>>>
>>>
>>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
On Fri, Jul 5, 2013 at 1:25 AM, Jake Mannix <ja...@gmail.com> wrote:

> > at this point i have only a very obvious apply(Double,Double):Double =
> > m.getQuick(...), i.e. only element reads are supported with that syntax.
> >
> > I am guessing Jake, if anyone, might have an idea here... thanks.
> >
>
> Hmmm... right off the top of my head, no.  But my play-DSL is pretty
> free of operators, as that level of scala cleverness worries me a little,
> when it comes to making sure we're sticking to efficient operations
> under the hood.


I don't know how to make something like that work without some pretty
clever lazy evaluation juju.

Re: Mahout vectors/matrices/solvers on spark

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Jul 5, 2013 at 1:15 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> For anyone good at scala DSLs, the following is the puzzle i can't seem to
> figure at the moment.
>
> I mentioned  before that I implemented assignment notations to a row or a
> block, e.g. for a row vector : A(5,::) := (1,2,3)
>
> what it really translates into in this particular case is
>
> A.viewRow(5).assign(new Vector(new double[]{1,2,3}))
>
>
> One thing i can't quite figure for in-core matrix DSL is how to translate
> element assignments such as
>
> A(5,5) := 2.0
>
> into A.setQuick(5,5,2.0)
>
> while still having
>
> val k = A(5,5)
>
> translating into
> val k = A.getQuick(5,5).
>
>
> it could be implemented with a "elementView" analogue but would require the
> element view object creation  -- which is, first,  a big no-no (too
> expensive) for a simple solitary element assignment (or read-out)
> operation, and secondly, reading the element such as A(5,5) * 2.0 would
> also involve view element object creation with implicit conversion to
> Double whereas it is not even needed at all in this case.
>
> at this point i have only a very obvious apply(Double,Double):Double =
> m.getQuick(...), i.e. only element reads are supported with that syntax.
>
> I am guessing Jake, if anyone, might have an idea here... thanks.
>

Hmmm... right off the top of my head, no.  But my play-DSL is pretty
free of operators, as that level of scala cleverness worries me a little,
when it comes to making sure we're sticking to efficient operations
under the hood.


>
>
>
> On Thu, Jul 4, 2013 at 11:23 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > FWIW, Givens streaming qr will be a bit more economical on memory than
> > Householder's since it doesn't need the full buffer to compute R and
> > doesn't need to keep entire original matrix around.
> >
> >
> > On Thu, Jul 4, 2013 at 11:15 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >wrote:
> >
> >> Ted,
> >>
> >> would it make sense to port parts of QR in-core row-wise Givens solver
> >> out of SSVD to work on any Matrix? I know givens method is advertised as
> >> stable but not sure if it is the fastest accepted one. I guess they are
> all
> >> about the same.
> >>
> >> If yes, i will need also to port the UpperTriangular matrix to adhere to
> >> all the bells and wistles, and also have some sort of RowShift matrix (a
> >> much simpler analogue of Pivoted matrix for row-rolled buffers). would
> that
> >> make sense?
> >>
> >> thanks.
> >> -D
> >>
> >>
> >
>



-- 

  -jake

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
For anyone good at scala DSLs, the following is the puzzle i can't seem to
figure at the moment.

I mentioned  before that I implemented assignment notations to a row or a
block, e.g. for a row vector : A(5,::) := (1,2,3)

what it really translates into in this particular case is

A.viewRow(5).assign(new Vector(new double[]{1,2,3}))


One thing i can't quite figure for in-core matrix DSL is how to translate
element assignments such as

A(5,5) := 2.0

into A.setQuick(5,5,2.0)

while still having

val k = A(5,5)

translating into
val k = A.getQuick(5,5).


it could be implemented with a "elementView" analogue but would require the
element view object creation  -- which is, first,  a big no-no (too
expensive) for a simple solitary element assignment (or read-out)
operation, and secondly, reading the element such as A(5,5) * 2.0 would
also involve view element object creation with implicit conversion to
Double whereas it is not even needed at all in this case.

at this point i have only a very obvious apply(Double,Double):Double =
m.getQuick(...), i.e. only element reads are supported with that syntax.

I am guessing Jake, if anyone, might have an idea here... thanks.



On Thu, Jul 4, 2013 at 11:23 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> FWIW, Givens streaming qr will be a bit more economical on memory than
> Householder's since it doesn't need the full buffer to compute R and
> doesn't need to keep entire original matrix around.
>
>
> On Thu, Jul 4, 2013 at 11:15 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> Ted,
>>
>> would it make sense to port parts of QR in-core row-wise Givens solver
>> out of SSVD to work on any Matrix? I know givens method is advertised as
>> stable but not sure if it is the fastest accepted one. I guess they are all
>> about the same.
>>
>> If yes, i will need also to port the UpperTriangular matrix to adhere to
>> all the bells and wistles, and also have some sort of RowShift matrix (a
>> much simpler analogue of Pivoted matrix for row-rolled buffers). would that
>> make sense?
>>
>> thanks.
>> -D
>>
>>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
FWIW, Givens streaming qr will be a bit more economical on memory than
Householder's since it doesn't need the full buffer to compute R and
doesn't need to keep entire original matrix around.


On Thu, Jul 4, 2013 at 11:15 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Ted,
>
> would it make sense to port parts of QR in-core row-wise Givens solver out
> of SSVD to work on any Matrix? I know givens method is advertised as stable
> but not sure if it is the fastest accepted one. I guess they are all about
> the same.
>
> If yes, i will need also to port the UpperTriangular matrix to adhere to
> all the bells and wistles, and also have some sort of RowShift matrix (a
> much simpler analogue of Pivoted matrix for row-rolled buffers). would that
> make sense?
>
> thanks.
> -D
>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ted,

would it make sense to port parts of QR in-core row-wise Givens solver out
of SSVD to work on any Matrix? I know givens method is advertised as stable
but not sure if it is the fastest accepted one. I guess they are all about
the same.

If yes, i will need also to port the UpperTriangular matrix to adhere to
all the bells and wistles, and also have some sort of RowShift matrix (a
much simpler analogue of Pivoted matrix for row-rolled buffers). would that
make sense?

thanks.
-D

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
This is pretty exciting!

Thanks Dmitriy.


On Wed, Jul 3, 2013 at 10:12 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Excellent!
>
> so I guess SSVD can be divorced from apache-math solver then.
>
> Actually it all shaping up surprisingly well, with scala DSL for both
> in-core and mahout DRMS and spark solvers. I haven't been able to pay as
> much attention to this as i hoped due to being pretty sick last month. But
> even with very few time, I think DRM+DSL drivers and in-core scala DSL for
> this might earn much easier acceptance for in-core and distributed linear
> algebra in Mahout. Not to mention memory-cached DRM spark representation is
> a door to iterative solvers. It's been coming together quite nicely and
> in-core eigen decomposition makes it a really rounded offer. (i of course
> was after eigen for the spark version of SSVD/PCA).
>
> I guess i will report back when i get basic Bagel-based primitives working
> for DRMs.
>
>
> On Wed, Jul 3, 2013 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > On Wed, Jul 3, 2013 at 6:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > On Wed, Jun 19, 2013 at 12:20 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > >
> > > > As far as in-memory solvers, we have:
> > > >
> > > > 1) LR decomposition (tested and kinda fast)
> > > >
> > > > 2) Cholesky decomposition (tested)
> > > >
> > > > 3) SVD (tested)
> > > >
> > >
> > > Ted,
> > > so we don't have an eigensolver for the in-core Matrix?
> > >
> >
> > Yes.  We do.
> >
> > See org.apache.mahout.math.solver.EigenDecomposition
> >
> > Looking at the history, I am slightly surprised to see that I was the one
> > who copied it from JAMA, replacing the Colt version and adding tests.
> >
> >
> > > I understand that svd can be solved with an eigen decomposition but not
> > the
> > > other way around, right?
> > >
> >
> > Well, the eigen decomposition of the normal matrix can give the SVD, but
> > this is often not recommended due to poor conditioning.  In fact, the
> eigen
> > decomposition of any positive definite matrix is the same as the SVD.
> >
> > Where eigen values are complex, it is common to decompose to a block
> > diagonal form where real values are on the diagonal and complex
> > eigen-values are represented as 2x2 blocks.  Our decomposition does this.
> >
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Excellent!

so I guess SSVD can be divorced from apache-math solver then.

Actually it all shaping up surprisingly well, with scala DSL for both
in-core and mahout DRMS and spark solvers. I haven't been able to pay as
much attention to this as i hoped due to being pretty sick last month. But
even with very few time, I think DRM+DSL drivers and in-core scala DSL for
this might earn much easier acceptance for in-core and distributed linear
algebra in Mahout. Not to mention memory-cached DRM spark representation is
a door to iterative solvers. It's been coming together quite nicely and
in-core eigen decomposition makes it a really rounded offer. (i of course
was after eigen for the spark version of SSVD/PCA).

I guess i will report back when i get basic Bagel-based primitives working
for DRMs.


On Wed, Jul 3, 2013 at 8:53 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Jul 3, 2013 at 6:25 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > On Wed, Jun 19, 2013 at 12:20 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > >
> > > As far as in-memory solvers, we have:
> > >
> > > 1) LR decomposition (tested and kinda fast)
> > >
> > > 2) Cholesky decomposition (tested)
> > >
> > > 3) SVD (tested)
> > >
> >
> > Ted,
> > so we don't have an eigensolver for the in-core Matrix?
> >
>
> Yes.  We do.
>
> See org.apache.mahout.math.solver.EigenDecomposition
>
> Looking at the history, I am slightly surprised to see that I was the one
> who copied it from JAMA, replacing the Colt version and adding tests.
>
>
> > I understand that svd can be solved with an eigen decomposition but not
> the
> > other way around, right?
> >
>
> Well, the eigen decomposition of the normal matrix can give the SVD, but
> this is often not recommended due to poor conditioning.  In fact, the eigen
> decomposition of any positive definite matrix is the same as the SVD.
>
> Where eigen values are complex, it is common to decompose to a block
> diagonal form where real values are on the diagonal and complex
> eigen-values are represented as 2x2 blocks.  Our decomposition does this.
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Jul 3, 2013 at 6:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Wed, Jun 19, 2013 at 12:20 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> >
> > As far as in-memory solvers, we have:
> >
> > 1) LR decomposition (tested and kinda fast)
> >
> > 2) Cholesky decomposition (tested)
> >
> > 3) SVD (tested)
> >
>
> Ted,
> so we don't have an eigensolver for the in-core Matrix?
>

Yes.  We do.

See org.apache.mahout.math.solver.EigenDecomposition

Looking at the history, I am slightly surprised to see that I was the one
who copied it from JAMA, replacing the Colt version and adding tests.


> I understand that svd can be solved with an eigen decomposition but not the
> other way around, right?
>

Well, the eigen decomposition of the normal matrix can give the SVD, but
this is often not recommended due to poor conditioning.  In fact, the eigen
decomposition of any positive definite matrix is the same as the SVD.

Where eigen values are complex, it is common to decompose to a block
diagonal form where real values are on the diagonal and complex
eigen-values are represented as 2x2 blocks.  Our decomposition does this.

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Wed, Jun 19, 2013 at 12:20 AM, Ted Dunning <te...@gmail.com> wrote:

>
> As far as in-memory solvers, we have:
>
> 1) LR decomposition (tested and kinda fast)
>
> 2) Cholesky decomposition (tested)
>
> 3) SVD (tested)
>

Ted,
so we don't have an eigensolver for the in-core Matrix?

I understand that svd can be solved with an eigen decomposition but not the
other way around, right?

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Well one fundamental step to get there in Mahout realm, the way i see it,
is to create DSLs for Mahout's DRMs in spark. That's actually one of the
other reasons i chose not to follow Breeze. When we unwind Mahout DRM's, we
may see sparse or dense slices there with named vectors. To translate that
into Breeze blocks would be a problem (and annotations/named vector
treatment is yet another problem i guess).


On Mon, Jun 24, 2013 at 2:08 PM, Nick Pentreath <ni...@gmail.com>wrote:

> You're right on that - so far doubles is all I've needed and all I can
> currently see needing.
>
>
> I'll take a look at your project and see how easy it is to integrate with
> my Spark ALS and other code - syntax wise it looks almost the same so
> swapping out the linear algebra backend would be quite trivial in theory.
>
>
> So far I've a working implementation of both implicit and explicit ALS
> versions that matches Mahout in RMSE given same parameters on the 3
> movielens data sets. Still some work to do and more testing at scale, plus
> framework stuff. But hopefully I'd like to open source this at some point
> (but the Spark guys have a few projects upcoming so I'm also waiting a bit
> to see what happens there as it may end up duplicating a lot of what
> they're doing).
>
> —
> Sent from Mailbox for iPhone
>
> On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath <
> nick.pentreath@gmail.com>wrote:
> >> That looks great Dmitry!
> >>
> >>
> >> The thing about Breeze that drives the complexity in it is partly
> >> specialization for Float, Double and Int matrices, and partly getting
> the
> >> syntax to "just work" for all combinations of matrix types and operands
> >> etc. mostly it does "just work" but occasionally not.
> > yes i noticed that, but since i am wrapping Mahout matrices, there's
> only a
> > choice of double-filled matrices and vectors. Actually, i would argue
> > that's the way it is supposed to be in the interest of KISS principle. I
> am
> > not sure i see a value in "int" matrices for any problem i ever worked
> on,
> > and skipping on precision to save the space is even more far-fetched
> notion
> > as in real life numbers don't take as much space as their pre-vectorized
> > features and annotations. In fact. model training parts and linear
> algebra
> > are not where memory bottleneck seems to fat-up at all in my experience.
> > There's often exponentially growing cpu-bound behavior, yes, but not RAM.
> >>
> >>
> >> I am surprised that dense * sparse matrix doesn't work but I guess as I
> >> previously mentioned the sparse matrix support is a bit shaky.
> >>
> > This is solely based on eye-balling the trait architecture. I did not
> > actually attempt it. But there's no single unifying trait for sure.
> >>
> >>
> >> David Hall is pretty happy to both look into enhancements and help out
> for
> >> contributions (eg I'm hoping to find time to look into a proper Diagonal
> >> matrix implementation and he was very helpful with pointers etc), so
> please
> >> do drop things into the google group mailing list. Hopefully wider
> adoption
> >> especially by this type of community will drive Breeze development.
> >>
> >>
> >> In another note I also really like Scaldings matrix API so scala ish
> >> wrappers for mahout would be cool - another pet project of mine is a
> port
> >> of that API to spark too :)
> >>
> >>
> >> N
> >>
> >>
> >>
> >> —
> >> Sent from Mailbox for iPhone
> >>
> >> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <ja...@gmail.com>
> >> wrote:
> >>
> >> > Yeah, I'm totally on board with a pretty scala DSL on top of some of
> our
> >> > stuff.
> >> > In particular, I've been experimenting with with wrapping the
> >> > DistributedRowMatrix
> >> > in a scalding wrapper, so we can do things like
> >> > val matrixAsTypedPipe =
> >> >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> >> > path, conf))
> >> > // e.g. L1 normalize:
> >> >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
> >> v.normalize(1)) )
> >> >                                  .write(new
> >> > DistributedRowMatrixPipe(outputPath, conf))
> >> > // and anything else you would want to do with a scalding
> TypedPipe[Int,
> >> > Vector]
> >> > Currently I've been doing this with a package structure directly in
> >> Mahout,
> >> > in:
> >> >    mahout/contrib/scalding
> >> > What do people think about having this be something real, after 0.8
> goes
> >> > out?  Are
> >> > we ready for contrib modules which fold in diverse external projects
> in
> >> new
> >> > ways?
> >> > Integrating directly with pig and scalding is a bit too wide of a tent
> >> for
> >> > Mahout core,
> >> > but putting these integrations in entirely new projects is maybe a bit
> >> too
> >> > far away.
> >> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> >> Dmitriy,
> >> >>
> >> >> This is very pretty.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dlieu.7@gmail.com
> >
> >> >> wrote:
> >> >>
> >> >> > Ok, so i was fairly easily able to build some DSL for our matrix
> >> >> > manipulation (similar to breeze) in scala:
> >> >> >
> >> >> > inline matrix or vector:
> >> >> >
> >> >> > val  a = dense((1, 2, 3), (3, 4, 5))
> >> >> >
> >> >> > val b:Vector = (1,2,3)
> >> >> >
> >> >> > block views and assignments (element/row/vector/block/block of row
> or
> >> >> > vector)
> >> >> >
> >> >> >
> >> >> > a(::, 0)
> >> >> > a(1, ::)
> >> >> > a(0 to 1, 1 to 2)
> >> >> >
> >> >> > assignments
> >> >> >
> >> >> > a(0, ::) :=(3, 5, 7)
> >> >> > a(0, 0 to 1) :=(3, 5)
> >> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> >> >> >
> >> >> > operators
> >> >> >
> >> >> > // hadamard
> >> >> > val c = a * b
> >> >> >  a *= b
> >> >> >
> >> >> > // matrix mul
> >> >> >  val m = a %*% b
> >> >> >
> >> >> > and bunch of other little things like sum, mean, colMeans etc. That
> >> much
> >> >> is
> >> >> > easy.
> >> >> >
> >> >> > Also stuff like the ones found in breeze along the lines
> >> >> >
> >> >> > val (u,v,s) = svd(a)
> >> >> >
> >> >> > diag ((1,2,3))
> >> >> >
> >> >> > and Cholesky in similar ways.
> >> >> >
> >> >> > I don't have "inline" initialization for sparse things (yet) simply
> >> >> because
> >> >> > i don't need them, but of course all regular java constructors and
> >> >> methods
> >> >> > are retained, all that is just a syntactic sugar in the spirit of
> >> DSLs in
> >> >> > hope to make things a bit mroe readable.
> >> >> >
> >> >> > my (very little, and very insignificantly opinionated, really)
> >> criticism
> >> >> of
> >> >> > Breeze in this context is its inconsistency between dense and
> sparse
> >> >> > representations, namely, lack of consistent overarching trait(s),
> so
> >> that
> >> >> > building structure-agnostic solvers like Mahout's Cholesky solver
> is
> >> >> > impossible, as well as cross-type matrix use (say, the way i
> >> understand
> >> >> it,
> >> >> > it is pretty much imposible to multiply a sparse matrix by a dense
> >> >> matrix).
> >> >> >
> >> >> > I suspect these problems stem from the fact that the authors for
> >> whatever
> >> >> > reason decided to hardwire dense things with JBlas solvers whereas
> i
> >> dont
> >> >> > believe matrix storage structures must be. But these problems do
> >> appear
> >> >> to
> >> >> > be serious enough  for me to ignore Breeze for now. If i decide to
> >> plug
> >> >> in
> >> >> > jblas dense solvers, i guess i will just have them as yet another
> >> >> top-level
> >> >> > routine interface taking any Matrix, e.g.
> >> >> >
> >> >> > val (u,v,s) = svd(m, jblas=true)
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> > > Thank you.
> >> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com>
> >> wrote:
> >> >> > >
> >> >> > >> I think that this contract has migrated a bit from the first
> >> starting
> >> >> > >> point.
> >> >> > >>
> >> >> > >> My feeling is that there is a de facto contract now that the
> matrix
> >> >> > slice
> >> >> > >> is a single row.
> >> >> > >>
> >> >> > >> Sent from my iPhone
> >> >> > >>
> >> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
> >> >> wrote:
> >> >> > >>
> >> >> > >> > What does Matrix. iterateAll() contractually do? Practically
> it
> >> >> seems
> >> >> > >> to be
> >> >> > >> > row wise iteration for some implementations but it doesnt seem
> >> >> > >> > contractually state so in the javadoc. What is MatrixSlice if
> it
> >> is
> >> >> > >> neither
> >> >> > >> > a row nor a colimn? How can i tell what exactly it is i am
> >> iterating
> >> >> > >> over?
> >> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <
> ted.dunning@gmail.com>
> >> >> > wrote:
> >> >> > >> >
> >> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> >> >> jake.mannix@gmail.com>
> >> >> > >> >> wrote:
> >> >> > >> >>
> >> >> > >> >>>> Question #2: which in-core solvers are available for Mahout
> >> >> > >> matrices? I
> >> >> > >> >>>> know there's SSVD, probably Cholesky, is there something
> >> else? In
> >> >> > >> >>>> paticular, i need to be solving linear systems, I guess
> >> Cholesky
> >> >> > >> should
> >> >> > >> >>> be
> >> >> > >> >>>> equipped enough to do just that?
> >> >> > >> >>>>
> >> >> > >> >>>> Question #3: why did we try to import Colt solvers rather
> than
> >> >> > >> actually
> >> >> > >> >>>> depend on Colt in the first place? Why did we not accept
> >> Colt's
> >> >> > >> sparse
> >> >> > >> >>>> matrices and created native ones instead?
> >> >> > >> >>>>
> >> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too
> and
> >> >> seems
> >> >> > >> >> like
> >> >> > >> >>> a
> >> >> > >> >>>> well-rounded solution. However, it doesn't seem like being
> >> >> actively
> >> >> > >> >>>> supported, whereas I know Mahout experienced continued
> >> >> enhancements
> >> >> > >> to
> >> >> > >> >>> the
> >> >> > >> >>>> in-core matrix support.
> >> >> > >> >>>>
> >> >> > >> >>>
> >> >> > >> >>> Colt was totally abandoned, and I talked to the original
> author
> >> >> and
> >> >> > he
> >> >> > >> >>> blessed it's adoption.  When we pulled it in, we found it
> was
> >> >> > woefully
> >> >> > >> >>> undertested,
> >> >> > >> >>> and tried our best to hook it in with proper tests and use
> APIs
> >> >> that
> >> >> > >> fit
> >> >> > >> >>> with
> >> >> > >> >>> the use cases we had.  Plus, we already had the start of
> some
> >> >> linear
> >> >> > >> apis
> >> >> > >> >>> (i.e.
> >> >> > >> >>> the Vector interface) and dropping the API completely seemed
> >> not
> >> >> > >> terribly
> >> >> > >> >>> worth it at the time.
> >> >> > >> >>>
> >> >> > >> >>
> >> >> > >> >> There was even more to it than that.
> >> >> > >> >>
> >> >> > >> >> Colt was under-tested and there have been warts that had to
> be
> >> >> pulled
> >> >> > >> out
> >> >> > >> >> in much of the code.
> >> >> > >> >>
> >> >> > >> >> But, worse than that, Colt's matrix and vector structure was
> a
> >> real
> >> >> > >> bugger
> >> >> > >> >> to extend or change.  It also had all kinds of cruft where it
> >> >> > >> pretended to
> >> >> > >> >> support matrices of things, but in fact only supported
> matrices
> >> of
> >> >> > >> doubles
> >> >> > >> >> and floats.
> >> >> > >> >>
> >> >> > >> >> So using Colt as it was (and is since it is largely
> abandoned)
> >> was
> >> >> a
> >> >> > >> >> non-starter.
> >> >> > >> >>
> >> >> > >> >> As far as in-memory solvers, we have:
> >> >> > >> >>
> >> >> > >> >> 1) LR decomposition (tested and kinda fast)
> >> >> > >> >>
> >> >> > >> >> 2) Cholesky decomposition (tested)
> >> >> > >> >>
> >> >> > >> >> 3) SVD (tested)
> >> >> > >> >>
> >> >> > >>
> >> >> > >
> >> >> >
> >> >>
> >> > --
> >> >   -jake
> >>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Nick Pentreath <ni...@gmail.com>.
You're right on that - so far doubles is all I've needed and all I can currently see needing. 


​I'll take a look at your project and see how easy it is to integrate with my Spark ALS and other code - syntax wise it looks almost the same so swapping out the linear algebra backend would be quite trivial in theory.


So far I've a working implementation of both implicit and explicit ALS versions that matches Mahout in RMSE given same parameters on the 3 movielens data sets. Still some work to do and more testing at scale, plus framework stuff. But hopefully I'd like to open source this at some point (but the Spark guys have a few projects upcoming so I'm also waiting a bit to see what happens there as it may end up duplicating a lot of what they're doing).

—
Sent from Mailbox for iPhone

On Mon, Jun 24, 2013 at 10:55 PM, Dmitriy Lyubimov <dl...@gmail.com>
wrote:

> On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath <ni...@gmail.com>wrote:
>> That looks great Dmitry!
>>
>>
>> The thing about Breeze that drives the complexity in it is partly
>> specialization for Float, Double and Int matrices, and partly getting the
>> syntax to "just work" for all combinations of matrix types and operands
>> etc. mostly it does "just work" but occasionally not.
> yes i noticed that, but since i am wrapping Mahout matrices, there's only a
> choice of double-filled matrices and vectors. Actually, i would argue
> that's the way it is supposed to be in the interest of KISS principle. I am
> not sure i see a value in "int" matrices for any problem i ever worked on,
> and skipping on precision to save the space is even more far-fetched notion
> as in real life numbers don't take as much space as their pre-vectorized
> features and annotations. In fact. model training parts and linear algebra
> are not where memory bottleneck seems to fat-up at all in my experience.
> There's often exponentially growing cpu-bound behavior, yes, but not RAM.
>>
>>
>> I am surprised that dense * sparse matrix doesn't work but I guess as I
>> previously mentioned the sparse matrix support is a bit shaky.
>>
> This is solely based on eye-balling the trait architecture. I did not
> actually attempt it. But there's no single unifying trait for sure.
>>
>>
>> David Hall is pretty happy to both look into enhancements and help out for
>> contributions (eg I'm hoping to find time to look into a proper Diagonal
>> matrix implementation and he was very helpful with pointers etc), so please
>> do drop things into the google group mailing list. Hopefully wider adoption
>> especially by this type of community will drive Breeze development.
>>
>>
>> In another note I also really like Scaldings matrix API so scala ish
>> wrappers for mahout would be cool - another pet project of mine is a port
>> of that API to spark too :)
>>
>>
>> N
>>
>>
>>
>> —
>> Sent from Mailbox for iPhone
>>
>> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <ja...@gmail.com>
>> wrote:
>>
>> > Yeah, I'm totally on board with a pretty scala DSL on top of some of our
>> > stuff.
>> > In particular, I've been experimenting with with wrapping the
>> > DistributedRowMatrix
>> > in a scalding wrapper, so we can do things like
>> > val matrixAsTypedPipe =
>> >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
>> > path, conf))
>> > // e.g. L1 normalize:
>> >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
>> v.normalize(1)) )
>> >                                  .write(new
>> > DistributedRowMatrixPipe(outputPath, conf))
>> > // and anything else you would want to do with a scalding TypedPipe[Int,
>> > Vector]
>> > Currently I've been doing this with a package structure directly in
>> Mahout,
>> > in:
>> >    mahout/contrib/scalding
>> > What do people think about having this be something real, after 0.8 goes
>> > out?  Are
>> > we ready for contrib modules which fold in diverse external projects in
>> new
>> > ways?
>> > Integrating directly with pig and scalding is a bit too wide of a tent
>> for
>> > Mahout core,
>> > but putting these integrations in entirely new projects is maybe a bit
>> too
>> > far away.
>> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >> Dmitriy,
>> >>
>> >> This is very pretty.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> >> wrote:
>> >>
>> >> > Ok, so i was fairly easily able to build some DSL for our matrix
>> >> > manipulation (similar to breeze) in scala:
>> >> >
>> >> > inline matrix or vector:
>> >> >
>> >> > val  a = dense((1, 2, 3), (3, 4, 5))
>> >> >
>> >> > val b:Vector = (1,2,3)
>> >> >
>> >> > block views and assignments (element/row/vector/block/block of row or
>> >> > vector)
>> >> >
>> >> >
>> >> > a(::, 0)
>> >> > a(1, ::)
>> >> > a(0 to 1, 1 to 2)
>> >> >
>> >> > assignments
>> >> >
>> >> > a(0, ::) :=(3, 5, 7)
>> >> > a(0, 0 to 1) :=(3, 5)
>> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
>> >> >
>> >> > operators
>> >> >
>> >> > // hadamard
>> >> > val c = a * b
>> >> >  a *= b
>> >> >
>> >> > // matrix mul
>> >> >  val m = a %*% b
>> >> >
>> >> > and bunch of other little things like sum, mean, colMeans etc. That
>> much
>> >> is
>> >> > easy.
>> >> >
>> >> > Also stuff like the ones found in breeze along the lines
>> >> >
>> >> > val (u,v,s) = svd(a)
>> >> >
>> >> > diag ((1,2,3))
>> >> >
>> >> > and Cholesky in similar ways.
>> >> >
>> >> > I don't have "inline" initialization for sparse things (yet) simply
>> >> because
>> >> > i don't need them, but of course all regular java constructors and
>> >> methods
>> >> > are retained, all that is just a syntactic sugar in the spirit of
>> DSLs in
>> >> > hope to make things a bit mroe readable.
>> >> >
>> >> > my (very little, and very insignificantly opinionated, really)
>> criticism
>> >> of
>> >> > Breeze in this context is its inconsistency between dense and sparse
>> >> > representations, namely, lack of consistent overarching trait(s), so
>> that
>> >> > building structure-agnostic solvers like Mahout's Cholesky solver is
>> >> > impossible, as well as cross-type matrix use (say, the way i
>> understand
>> >> it,
>> >> > it is pretty much imposible to multiply a sparse matrix by a dense
>> >> matrix).
>> >> >
>> >> > I suspect these problems stem from the fact that the authors for
>> whatever
>> >> > reason decided to hardwire dense things with JBlas solvers whereas i
>> dont
>> >> > believe matrix storage structures must be. But these problems do
>> appear
>> >> to
>> >> > be serious enough  for me to ignore Breeze for now. If i decide to
>> plug
>> >> in
>> >> > jblas dense solvers, i guess i will just have them as yet another
>> >> top-level
>> >> > routine interface taking any Matrix, e.g.
>> >> >
>> >> > val (u,v,s) = svd(m, jblas=true)
>> >> >
>> >> >
>> >> >
>> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Thank you.
>> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com>
>> wrote:
>> >> > >
>> >> > >> I think that this contract has migrated a bit from the first
>> starting
>> >> > >> point.
>> >> > >>
>> >> > >> My feeling is that there is a de facto contract now that the matrix
>> >> > slice
>> >> > >> is a single row.
>> >> > >>
>> >> > >> Sent from my iPhone
>> >> > >>
>> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
>> >> wrote:
>> >> > >>
>> >> > >> > What does Matrix. iterateAll() contractually do? Practically it
>> >> seems
>> >> > >> to be
>> >> > >> > row wise iteration for some implementations but it doesnt seem
>> >> > >> > contractually state so in the javadoc. What is MatrixSlice if it
>> is
>> >> > >> neither
>> >> > >> > a row nor a colimn? How can i tell what exactly it is i am
>> iterating
>> >> > >> over?
>> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com>
>> >> > wrote:
>> >> > >> >
>> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
>> >> jake.mannix@gmail.com>
>> >> > >> >> wrote:
>> >> > >> >>
>> >> > >> >>>> Question #2: which in-core solvers are available for Mahout
>> >> > >> matrices? I
>> >> > >> >>>> know there's SSVD, probably Cholesky, is there something
>> else? In
>> >> > >> >>>> paticular, i need to be solving linear systems, I guess
>> Cholesky
>> >> > >> should
>> >> > >> >>> be
>> >> > >> >>>> equipped enough to do just that?
>> >> > >> >>>>
>> >> > >> >>>> Question #3: why did we try to import Colt solvers rather than
>> >> > >> actually
>> >> > >> >>>> depend on Colt in the first place? Why did we not accept
>> Colt's
>> >> > >> sparse
>> >> > >> >>>> matrices and created native ones instead?
>> >> > >> >>>>
>> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
>> >> seems
>> >> > >> >> like
>> >> > >> >>> a
>> >> > >> >>>> well-rounded solution. However, it doesn't seem like being
>> >> actively
>> >> > >> >>>> supported, whereas I know Mahout experienced continued
>> >> enhancements
>> >> > >> to
>> >> > >> >>> the
>> >> > >> >>>> in-core matrix support.
>> >> > >> >>>>
>> >> > >> >>>
>> >> > >> >>> Colt was totally abandoned, and I talked to the original author
>> >> and
>> >> > he
>> >> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
>> >> > woefully
>> >> > >> >>> undertested,
>> >> > >> >>> and tried our best to hook it in with proper tests and use APIs
>> >> that
>> >> > >> fit
>> >> > >> >>> with
>> >> > >> >>> the use cases we had.  Plus, we already had the start of some
>> >> linear
>> >> > >> apis
>> >> > >> >>> (i.e.
>> >> > >> >>> the Vector interface) and dropping the API completely seemed
>> not
>> >> > >> terribly
>> >> > >> >>> worth it at the time.
>> >> > >> >>>
>> >> > >> >>
>> >> > >> >> There was even more to it than that.
>> >> > >> >>
>> >> > >> >> Colt was under-tested and there have been warts that had to be
>> >> pulled
>> >> > >> out
>> >> > >> >> in much of the code.
>> >> > >> >>
>> >> > >> >> But, worse than that, Colt's matrix and vector structure was a
>> real
>> >> > >> bugger
>> >> > >> >> to extend or change.  It also had all kinds of cruft where it
>> >> > >> pretended to
>> >> > >> >> support matrices of things, but in fact only supported matrices
>> of
>> >> > >> doubles
>> >> > >> >> and floats.
>> >> > >> >>
>> >> > >> >> So using Colt as it was (and is since it is largely abandoned)
>> was
>> >> a
>> >> > >> >> non-starter.
>> >> > >> >>
>> >> > >> >> As far as in-memory solvers, we have:
>> >> > >> >>
>> >> > >> >> 1) LR decomposition (tested and kinda fast)
>> >> > >> >>
>> >> > >> >> 2) Cholesky decomposition (tested)
>> >> > >> >>
>> >> > >> >> 3) SVD (tested)
>> >> > >> >>
>> >> > >>
>> >> > >
>> >> >
>> >>
>> > --
>> >   -jake
>>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
I think that contrib modules would be very interesting.  Specifically, good
Scala DSL, pig integration and so on.


On Mon, Jun 24, 2013 at 9:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath <nick.pentreath@gmail.com
> >wrote:
>
> > That looks great Dmitry!
> >
> >
> > The thing about Breeze that drives the complexity in it is partly
> > specialization for Float, Double and Int matrices, and partly getting the
> > syntax to "just work" for all combinations of matrix types and operands
> > etc. mostly it does "just work" but occasionally not.
>
> yes i noticed that, but since i am wrapping Mahout matrices, there's only a
> choice of double-filled matrices and vectors. Actually, i would argue
> that's the way it is supposed to be in the interest of KISS principle. I am
> not sure i see a value in "int" matrices for any problem i ever worked on,
> and skipping on precision to save the space is even more far-fetched notion
> as in real life numbers don't take as much space as their pre-vectorized
> features and annotations. In fact. model training parts and linear algebra
> are not where memory bottleneck seems to fat-up at all in my experience.
> There's often exponentially growing cpu-bound behavior, yes, but not RAM.
>
>
>
> >
> >
> > I am surprised that dense * sparse matrix doesn't work but I guess as I
> > previously mentioned the sparse matrix support is a bit shaky.
> >
> This is solely based on eye-balling the trait architecture. I did not
> actually attempt it. But there's no single unifying trait for sure.
>
> >
> >
> > David Hall is pretty happy to both look into enhancements and help out
> for
> > contributions (eg I'm hoping to find time to look into a proper Diagonal
> > matrix implementation and he was very helpful with pointers etc), so
> please
> > do drop things into the google group mailing list. Hopefully wider
> adoption
> > especially by this type of community will drive Breeze development.
> >
> >
> > In another note I also really like Scaldings matrix API so scala ish
> > wrappers for mahout would be cool - another pet project of mine is a port
> > of that API to spark too :)
> >
> >
> > N
> >
> >
> >
> > —
> > Sent from Mailbox for iPhone
> >
> > On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <ja...@gmail.com>
> > wrote:
> >
> > > Yeah, I'm totally on board with a pretty scala DSL on top of some of
> our
> > > stuff.
> > > In particular, I've been experimenting with with wrapping the
> > > DistributedRowMatrix
> > > in a scalding wrapper, so we can do things like
> > > val matrixAsTypedPipe =
> > >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> > > path, conf))
> > > // e.g. L1 normalize:
> > >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
> > v.normalize(1)) )
> > >                                  .write(new
> > > DistributedRowMatrixPipe(outputPath, conf))
> > > // and anything else you would want to do with a scalding
> TypedPipe[Int,
> > > Vector]
> > > Currently I've been doing this with a package structure directly in
> > Mahout,
> > > in:
> > >    mahout/contrib/scalding
> > > What do people think about having this be something real, after 0.8
> goes
> > > out?  Are
> > > we ready for contrib modules which fold in diverse external projects in
> > new
> > > ways?
> > > Integrating directly with pig and scalding is a bit too wide of a tent
> > for
> > > Mahout core,
> > > but putting these integrations in entirely new projects is maybe a bit
> > too
> > > far away.
> > > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >> Dmitriy,
> > >>
> > >> This is very pretty.
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > >> wrote:
> > >>
> > >> > Ok, so i was fairly easily able to build some DSL for our matrix
> > >> > manipulation (similar to breeze) in scala:
> > >> >
> > >> > inline matrix or vector:
> > >> >
> > >> > val  a = dense((1, 2, 3), (3, 4, 5))
> > >> >
> > >> > val b:Vector = (1,2,3)
> > >> >
> > >> > block views and assignments (element/row/vector/block/block of row
> or
> > >> > vector)
> > >> >
> > >> >
> > >> > a(::, 0)
> > >> > a(1, ::)
> > >> > a(0 to 1, 1 to 2)
> > >> >
> > >> > assignments
> > >> >
> > >> > a(0, ::) :=(3, 5, 7)
> > >> > a(0, 0 to 1) :=(3, 5)
> > >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> > >> >
> > >> > operators
> > >> >
> > >> > // hadamard
> > >> > val c = a * b
> > >> >  a *= b
> > >> >
> > >> > // matrix mul
> > >> >  val m = a %*% b
> > >> >
> > >> > and bunch of other little things like sum, mean, colMeans etc. That
> > much
> > >> is
> > >> > easy.
> > >> >
> > >> > Also stuff like the ones found in breeze along the lines
> > >> >
> > >> > val (u,v,s) = svd(a)
> > >> >
> > >> > diag ((1,2,3))
> > >> >
> > >> > and Cholesky in similar ways.
> > >> >
> > >> > I don't have "inline" initialization for sparse things (yet) simply
> > >> because
> > >> > i don't need them, but of course all regular java constructors and
> > >> methods
> > >> > are retained, all that is just a syntactic sugar in the spirit of
> > DSLs in
> > >> > hope to make things a bit mroe readable.
> > >> >
> > >> > my (very little, and very insignificantly opinionated, really)
> > criticism
> > >> of
> > >> > Breeze in this context is its inconsistency between dense and sparse
> > >> > representations, namely, lack of consistent overarching trait(s), so
> > that
> > >> > building structure-agnostic solvers like Mahout's Cholesky solver is
> > >> > impossible, as well as cross-type matrix use (say, the way i
> > understand
> > >> it,
> > >> > it is pretty much imposible to multiply a sparse matrix by a dense
> > >> matrix).
> > >> >
> > >> > I suspect these problems stem from the fact that the authors for
> > whatever
> > >> > reason decided to hardwire dense things with JBlas solvers whereas i
> > dont
> > >> > believe matrix storage structures must be. But these problems do
> > appear
> > >> to
> > >> > be serious enough  for me to ignore Breeze for now. If i decide to
> > plug
> > >> in
> > >> > jblas dense solvers, i guess i will just have them as yet another
> > >> top-level
> > >> > routine interface taking any Matrix, e.g.
> > >> >
> > >> > val (u,v,s) = svd(m, jblas=true)
> > >> >
> > >> >
> > >> >
> > >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Thank you.
> > >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com>
> > wrote:
> > >> > >
> > >> > >> I think that this contract has migrated a bit from the first
> > starting
> > >> > >> point.
> > >> > >>
> > >> > >> My feeling is that there is a de facto contract now that the
> matrix
> > >> > slice
> > >> > >> is a single row.
> > >> > >>
> > >> > >> Sent from my iPhone
> > >> > >>
> > >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
> > >> wrote:
> > >> > >>
> > >> > >> > What does Matrix. iterateAll() contractually do? Practically it
> > >> seems
> > >> > >> to be
> > >> > >> > row wise iteration for some implementations but it doesnt seem
> > >> > >> > contractually state so in the javadoc. What is MatrixSlice if
> it
> > is
> > >> > >> neither
> > >> > >> > a row nor a colimn? How can i tell what exactly it is i am
> > iterating
> > >> > >> over?
> > >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <ted.dunning@gmail.com
> >
> > >> > wrote:
> > >> > >> >
> > >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> > >> jake.mannix@gmail.com>
> > >> > >> >> wrote:
> > >> > >> >>
> > >> > >> >>>> Question #2: which in-core solvers are available for Mahout
> > >> > >> matrices? I
> > >> > >> >>>> know there's SSVD, probably Cholesky, is there something
> > else? In
> > >> > >> >>>> paticular, i need to be solving linear systems, I guess
> > Cholesky
> > >> > >> should
> > >> > >> >>> be
> > >> > >> >>>> equipped enough to do just that?
> > >> > >> >>>>
> > >> > >> >>>> Question #3: why did we try to import Colt solvers rather
> than
> > >> > >> actually
> > >> > >> >>>> depend on Colt in the first place? Why did we not accept
> > Colt's
> > >> > >> sparse
> > >> > >> >>>> matrices and created native ones instead?
> > >> > >> >>>>
> > >> > >> >>>> Colt seems to have a notion of parse in-core matrices too
> and
> > >> seems
> > >> > >> >> like
> > >> > >> >>> a
> > >> > >> >>>> well-rounded solution. However, it doesn't seem like being
> > >> actively
> > >> > >> >>>> supported, whereas I know Mahout experienced continued
> > >> enhancements
> > >> > >> to
> > >> > >> >>> the
> > >> > >> >>>> in-core matrix support.
> > >> > >> >>>>
> > >> > >> >>>
> > >> > >> >>> Colt was totally abandoned, and I talked to the original
> author
> > >> and
> > >> > he
> > >> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
> > >> > woefully
> > >> > >> >>> undertested,
> > >> > >> >>> and tried our best to hook it in with proper tests and use
> APIs
> > >> that
> > >> > >> fit
> > >> > >> >>> with
> > >> > >> >>> the use cases we had.  Plus, we already had the start of some
> > >> linear
> > >> > >> apis
> > >> > >> >>> (i.e.
> > >> > >> >>> the Vector interface) and dropping the API completely seemed
> > not
> > >> > >> terribly
> > >> > >> >>> worth it at the time.
> > >> > >> >>>
> > >> > >> >>
> > >> > >> >> There was even more to it than that.
> > >> > >> >>
> > >> > >> >> Colt was under-tested and there have been warts that had to be
> > >> pulled
> > >> > >> out
> > >> > >> >> in much of the code.
> > >> > >> >>
> > >> > >> >> But, worse than that, Colt's matrix and vector structure was a
> > real
> > >> > >> bugger
> > >> > >> >> to extend or change.  It also had all kinds of cruft where it
> > >> > >> pretended to
> > >> > >> >> support matrices of things, but in fact only supported
> matrices
> > of
> > >> > >> doubles
> > >> > >> >> and floats.
> > >> > >> >>
> > >> > >> >> So using Colt as it was (and is since it is largely abandoned)
> > was
> > >> a
> > >> > >> >> non-starter.
> > >> > >> >>
> > >> > >> >> As far as in-memory solvers, we have:
> > >> > >> >>
> > >> > >> >> 1) LR decomposition (tested and kinda fast)
> > >> > >> >>
> > >> > >> >> 2) Cholesky decomposition (tested)
> > >> > >> >>
> > >> > >> >> 3) SVD (tested)
> > >> > >> >>
> > >> > >>
> > >> > >
> > >> >
> > >>
> > > --
> > >   -jake
> >
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, Jun 24, 2013 at 1:46 PM, Nick Pentreath <ni...@gmail.com>wrote:

> That looks great Dmitry!
>
>
> The thing about Breeze that drives the complexity in it is partly
> specialization for Float, Double and Int matrices, and partly getting the
> syntax to "just work" for all combinations of matrix types and operands
> etc. mostly it does "just work" but occasionally not.

yes i noticed that, but since i am wrapping Mahout matrices, there's only a
choice of double-filled matrices and vectors. Actually, i would argue
that's the way it is supposed to be in the interest of KISS principle. I am
not sure i see a value in "int" matrices for any problem i ever worked on,
and skipping on precision to save the space is even more far-fetched notion
as in real life numbers don't take as much space as their pre-vectorized
features and annotations. In fact. model training parts and linear algebra
are not where memory bottleneck seems to fat-up at all in my experience.
There's often exponentially growing cpu-bound behavior, yes, but not RAM.



>
>
> I am surprised that dense * sparse matrix doesn't work but I guess as I
> previously mentioned the sparse matrix support is a bit shaky.
>
This is solely based on eye-balling the trait architecture. I did not
actually attempt it. But there's no single unifying trait for sure.

>
>
> David Hall is pretty happy to both look into enhancements and help out for
> contributions (eg I'm hoping to find time to look into a proper Diagonal
> matrix implementation and he was very helpful with pointers etc), so please
> do drop things into the google group mailing list. Hopefully wider adoption
> especially by this type of community will drive Breeze development.
>
>
> In another note I also really like Scaldings matrix API so scala ish
> wrappers for mahout would be cool - another pet project of mine is a port
> of that API to spark too :)
>
>
> N
>
>
>
> —
> Sent from Mailbox for iPhone
>
> On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Yeah, I'm totally on board with a pretty scala DSL on top of some of our
> > stuff.
> > In particular, I've been experimenting with with wrapping the
> > DistributedRowMatrix
> > in a scalding wrapper, so we can do things like
> > val matrixAsTypedPipe =
> >    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> > path, conf))
> > // e.g. L1 normalize:
> >   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx,
> v.normalize(1)) )
> >                                  .write(new
> > DistributedRowMatrixPipe(outputPath, conf))
> > // and anything else you would want to do with a scalding TypedPipe[Int,
> > Vector]
> > Currently I've been doing this with a package structure directly in
> Mahout,
> > in:
> >    mahout/contrib/scalding
> > What do people think about having this be something real, after 0.8 goes
> > out?  Are
> > we ready for contrib modules which fold in diverse external projects in
> new
> > ways?
> > Integrating directly with pig and scalding is a bit too wide of a tent
> for
> > Mahout core,
> > but putting these integrations in entirely new projects is maybe a bit
> too
> > far away.
> > On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >> Dmitriy,
> >>
> >> This is very pretty.
> >>
> >>
> >>
> >>
> >> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >>
> >> > Ok, so i was fairly easily able to build some DSL for our matrix
> >> > manipulation (similar to breeze) in scala:
> >> >
> >> > inline matrix or vector:
> >> >
> >> > val  a = dense((1, 2, 3), (3, 4, 5))
> >> >
> >> > val b:Vector = (1,2,3)
> >> >
> >> > block views and assignments (element/row/vector/block/block of row or
> >> > vector)
> >> >
> >> >
> >> > a(::, 0)
> >> > a(1, ::)
> >> > a(0 to 1, 1 to 2)
> >> >
> >> > assignments
> >> >
> >> > a(0, ::) :=(3, 5, 7)
> >> > a(0, 0 to 1) :=(3, 5)
> >> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> >> >
> >> > operators
> >> >
> >> > // hadamard
> >> > val c = a * b
> >> >  a *= b
> >> >
> >> > // matrix mul
> >> >  val m = a %*% b
> >> >
> >> > and bunch of other little things like sum, mean, colMeans etc. That
> much
> >> is
> >> > easy.
> >> >
> >> > Also stuff like the ones found in breeze along the lines
> >> >
> >> > val (u,v,s) = svd(a)
> >> >
> >> > diag ((1,2,3))
> >> >
> >> > and Cholesky in similar ways.
> >> >
> >> > I don't have "inline" initialization for sparse things (yet) simply
> >> because
> >> > i don't need them, but of course all regular java constructors and
> >> methods
> >> > are retained, all that is just a syntactic sugar in the spirit of
> DSLs in
> >> > hope to make things a bit mroe readable.
> >> >
> >> > my (very little, and very insignificantly opinionated, really)
> criticism
> >> of
> >> > Breeze in this context is its inconsistency between dense and sparse
> >> > representations, namely, lack of consistent overarching trait(s), so
> that
> >> > building structure-agnostic solvers like Mahout's Cholesky solver is
> >> > impossible, as well as cross-type matrix use (say, the way i
> understand
> >> it,
> >> > it is pretty much imposible to multiply a sparse matrix by a dense
> >> matrix).
> >> >
> >> > I suspect these problems stem from the fact that the authors for
> whatever
> >> > reason decided to hardwire dense things with JBlas solvers whereas i
> dont
> >> > believe matrix storage structures must be. But these problems do
> appear
> >> to
> >> > be serious enough  for me to ignore Breeze for now. If i decide to
> plug
> >> in
> >> > jblas dense solvers, i guess i will just have them as yet another
> >> top-level
> >> > routine interface taking any Matrix, e.g.
> >> >
> >> > val (u,v,s) = svd(m, jblas=true)
> >> >
> >> >
> >> >
> >> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
> >> > wrote:
> >> >
> >> > > Thank you.
> >> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com>
> wrote:
> >> > >
> >> > >> I think that this contract has migrated a bit from the first
> starting
> >> > >> point.
> >> > >>
> >> > >> My feeling is that there is a de facto contract now that the matrix
> >> > slice
> >> > >> is a single row.
> >> > >>
> >> > >> Sent from my iPhone
> >> > >>
> >> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
> >> wrote:
> >> > >>
> >> > >> > What does Matrix. iterateAll() contractually do? Practically it
> >> seems
> >> > >> to be
> >> > >> > row wise iteration for some implementations but it doesnt seem
> >> > >> > contractually state so in the javadoc. What is MatrixSlice if it
> is
> >> > >> neither
> >> > >> > a row nor a colimn? How can i tell what exactly it is i am
> iterating
> >> > >> over?
> >> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com>
> >> > wrote:
> >> > >> >
> >> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> >> jake.mannix@gmail.com>
> >> > >> >> wrote:
> >> > >> >>
> >> > >> >>>> Question #2: which in-core solvers are available for Mahout
> >> > >> matrices? I
> >> > >> >>>> know there's SSVD, probably Cholesky, is there something
> else? In
> >> > >> >>>> paticular, i need to be solving linear systems, I guess
> Cholesky
> >> > >> should
> >> > >> >>> be
> >> > >> >>>> equipped enough to do just that?
> >> > >> >>>>
> >> > >> >>>> Question #3: why did we try to import Colt solvers rather than
> >> > >> actually
> >> > >> >>>> depend on Colt in the first place? Why did we not accept
> Colt's
> >> > >> sparse
> >> > >> >>>> matrices and created native ones instead?
> >> > >> >>>>
> >> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
> >> seems
> >> > >> >> like
> >> > >> >>> a
> >> > >> >>>> well-rounded solution. However, it doesn't seem like being
> >> actively
> >> > >> >>>> supported, whereas I know Mahout experienced continued
> >> enhancements
> >> > >> to
> >> > >> >>> the
> >> > >> >>>> in-core matrix support.
> >> > >> >>>>
> >> > >> >>>
> >> > >> >>> Colt was totally abandoned, and I talked to the original author
> >> and
> >> > he
> >> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
> >> > woefully
> >> > >> >>> undertested,
> >> > >> >>> and tried our best to hook it in with proper tests and use APIs
> >> that
> >> > >> fit
> >> > >> >>> with
> >> > >> >>> the use cases we had.  Plus, we already had the start of some
> >> linear
> >> > >> apis
> >> > >> >>> (i.e.
> >> > >> >>> the Vector interface) and dropping the API completely seemed
> not
> >> > >> terribly
> >> > >> >>> worth it at the time.
> >> > >> >>>
> >> > >> >>
> >> > >> >> There was even more to it than that.
> >> > >> >>
> >> > >> >> Colt was under-tested and there have been warts that had to be
> >> pulled
> >> > >> out
> >> > >> >> in much of the code.
> >> > >> >>
> >> > >> >> But, worse than that, Colt's matrix and vector structure was a
> real
> >> > >> bugger
> >> > >> >> to extend or change.  It also had all kinds of cruft where it
> >> > >> pretended to
> >> > >> >> support matrices of things, but in fact only supported matrices
> of
> >> > >> doubles
> >> > >> >> and floats.
> >> > >> >>
> >> > >> >> So using Colt as it was (and is since it is largely abandoned)
> was
> >> a
> >> > >> >> non-starter.
> >> > >> >>
> >> > >> >> As far as in-memory solvers, we have:
> >> > >> >>
> >> > >> >> 1) LR decomposition (tested and kinda fast)
> >> > >> >>
> >> > >> >> 2) Cholesky decomposition (tested)
> >> > >> >>
> >> > >> >> 3) SVD (tested)
> >> > >> >>
> >> > >>
> >> > >
> >> >
> >>
> > --
> >   -jake
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Nick Pentreath <ni...@gmail.com>.
That looks great Dmitry! 


​The thing about Breeze that drives the complexity in it is partly specialization for Float, Double and Int matrices, and partly getting the syntax to "just work" for all combinations of matrix types and operands etc. mostly it does "just work" but occasionally not.



​I am surprised that dense * sparse matrix doesn't work but I guess as I previously mentioned the sparse matrix support is a bit shaky. 


David Hall is pretty happy to both look into enhancements and help out for contributions (eg I'm hoping to find time to look into a proper Diagonal matrix implementation and he was very helpful with pointers etc), so please do drop things into the google group mailing list. Hopefully wider adoption especially by this type of community will drive Breeze development.


​In another note I also really like Scaldings matrix API so scala ish wrappers for mahout would be cool - another pet project of mine is a port of that API to spark too :)


​N 



—
Sent from Mailbox for iPhone

On Mon, Jun 24, 2013 at 10:25 PM, Jake Mannix <ja...@gmail.com>
wrote:

> Yeah, I'm totally on board with a pretty scala DSL on top of some of our
> stuff.
> In particular, I've been experimenting with with wrapping the
> DistributedRowMatrix
> in a scalding wrapper, so we can do things like
> val matrixAsTypedPipe =
>    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> path, conf))
> // e.g. L1 normalize:
>   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx, v.normalize(1)) )
>                                  .write(new
> DistributedRowMatrixPipe(outputPath, conf))
> // and anything else you would want to do with a scalding TypedPipe[Int,
> Vector]
> Currently I've been doing this with a package structure directly in Mahout,
> in:
>    mahout/contrib/scalding
> What do people think about having this be something real, after 0.8 goes
> out?  Are
> we ready for contrib modules which fold in diverse external projects in new
> ways?
> Integrating directly with pig and scalding is a bit too wide of a tent for
> Mahout core,
> but putting these integrations in entirely new projects is maybe a bit too
> far away.
> On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com> wrote:
>> Dmitriy,
>>
>> This is very pretty.
>>
>>
>>
>>
>> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>>
>> > Ok, so i was fairly easily able to build some DSL for our matrix
>> > manipulation (similar to breeze) in scala:
>> >
>> > inline matrix or vector:
>> >
>> > val  a = dense((1, 2, 3), (3, 4, 5))
>> >
>> > val b:Vector = (1,2,3)
>> >
>> > block views and assignments (element/row/vector/block/block of row or
>> > vector)
>> >
>> >
>> > a(::, 0)
>> > a(1, ::)
>> > a(0 to 1, 1 to 2)
>> >
>> > assignments
>> >
>> > a(0, ::) :=(3, 5, 7)
>> > a(0, 0 to 1) :=(3, 5)
>> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
>> >
>> > operators
>> >
>> > // hadamard
>> > val c = a * b
>> >  a *= b
>> >
>> > // matrix mul
>> >  val m = a %*% b
>> >
>> > and bunch of other little things like sum, mean, colMeans etc. That much
>> is
>> > easy.
>> >
>> > Also stuff like the ones found in breeze along the lines
>> >
>> > val (u,v,s) = svd(a)
>> >
>> > diag ((1,2,3))
>> >
>> > and Cholesky in similar ways.
>> >
>> > I don't have "inline" initialization for sparse things (yet) simply
>> because
>> > i don't need them, but of course all regular java constructors and
>> methods
>> > are retained, all that is just a syntactic sugar in the spirit of DSLs in
>> > hope to make things a bit mroe readable.
>> >
>> > my (very little, and very insignificantly opinionated, really) criticism
>> of
>> > Breeze in this context is its inconsistency between dense and sparse
>> > representations, namely, lack of consistent overarching trait(s), so that
>> > building structure-agnostic solvers like Mahout's Cholesky solver is
>> > impossible, as well as cross-type matrix use (say, the way i understand
>> it,
>> > it is pretty much imposible to multiply a sparse matrix by a dense
>> matrix).
>> >
>> > I suspect these problems stem from the fact that the authors for whatever
>> > reason decided to hardwire dense things with JBlas solvers whereas i dont
>> > believe matrix storage structures must be. But these problems do appear
>> to
>> > be serious enough  for me to ignore Breeze for now. If i decide to plug
>> in
>> > jblas dense solvers, i guess i will just have them as yet another
>> top-level
>> > routine interface taking any Matrix, e.g.
>> >
>> > val (u,v,s) = svd(m, jblas=true)
>> >
>> >
>> >
>> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
>> > wrote:
>> >
>> > > Thank you.
>> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com> wrote:
>> > >
>> > >> I think that this contract has migrated a bit from the first starting
>> > >> point.
>> > >>
>> > >> My feeling is that there is a de facto contract now that the matrix
>> > slice
>> > >> is a single row.
>> > >>
>> > >> Sent from my iPhone
>> > >>
>> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
>> wrote:
>> > >>
>> > >> > What does Matrix. iterateAll() contractually do? Practically it
>> seems
>> > >> to be
>> > >> > row wise iteration for some implementations but it doesnt seem
>> > >> > contractually state so in the javadoc. What is MatrixSlice if it is
>> > >> neither
>> > >> > a row nor a colimn? How can i tell what exactly it is i am iterating
>> > >> over?
>> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com>
>> > wrote:
>> > >> >
>> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
>> jake.mannix@gmail.com>
>> > >> >> wrote:
>> > >> >>
>> > >> >>>> Question #2: which in-core solvers are available for Mahout
>> > >> matrices? I
>> > >> >>>> know there's SSVD, probably Cholesky, is there something else? In
>> > >> >>>> paticular, i need to be solving linear systems, I guess Cholesky
>> > >> should
>> > >> >>> be
>> > >> >>>> equipped enough to do just that?
>> > >> >>>>
>> > >> >>>> Question #3: why did we try to import Colt solvers rather than
>> > >> actually
>> > >> >>>> depend on Colt in the first place? Why did we not accept Colt's
>> > >> sparse
>> > >> >>>> matrices and created native ones instead?
>> > >> >>>>
>> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
>> seems
>> > >> >> like
>> > >> >>> a
>> > >> >>>> well-rounded solution. However, it doesn't seem like being
>> actively
>> > >> >>>> supported, whereas I know Mahout experienced continued
>> enhancements
>> > >> to
>> > >> >>> the
>> > >> >>>> in-core matrix support.
>> > >> >>>>
>> > >> >>>
>> > >> >>> Colt was totally abandoned, and I talked to the original author
>> and
>> > he
>> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
>> > woefully
>> > >> >>> undertested,
>> > >> >>> and tried our best to hook it in with proper tests and use APIs
>> that
>> > >> fit
>> > >> >>> with
>> > >> >>> the use cases we had.  Plus, we already had the start of some
>> linear
>> > >> apis
>> > >> >>> (i.e.
>> > >> >>> the Vector interface) and dropping the API completely seemed not
>> > >> terribly
>> > >> >>> worth it at the time.
>> > >> >>>
>> > >> >>
>> > >> >> There was even more to it than that.
>> > >> >>
>> > >> >> Colt was under-tested and there have been warts that had to be
>> pulled
>> > >> out
>> > >> >> in much of the code.
>> > >> >>
>> > >> >> But, worse than that, Colt's matrix and vector structure was a real
>> > >> bugger
>> > >> >> to extend or change.  It also had all kinds of cruft where it
>> > >> pretended to
>> > >> >> support matrices of things, but in fact only supported matrices of
>> > >> doubles
>> > >> >> and floats.
>> > >> >>
>> > >> >> So using Colt as it was (and is since it is largely abandoned) was
>> a
>> > >> >> non-starter.
>> > >> >>
>> > >> >> As far as in-memory solvers, we have:
>> > >> >>
>> > >> >> 1) LR decomposition (tested and kinda fast)
>> > >> >>
>> > >> >> 2) Cholesky decomposition (tested)
>> > >> >>
>> > >> >> 3) SVD (tested)
>> > >> >>
>> > >>
>> > >
>> >
>>
> -- 
>   -jake

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
On Mon, Jun 24, 2013 at 1:24 PM, Jake Mannix <ja...@gmail.com> wrote:

> Yeah, I'm totally on board with a pretty scala DSL on top of some of our
> stuff.
>
> In particular, I've been experimenting with with wrapping the
> DistributedRowMatrix
> in a scalding wrapper, so we can do things like
>
> val matrixAsTypedPipe =
>    DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
> path, conf))
>
> // e.g. L1 normalize:
>   matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx, v.normalize(1)) )
>                                  .write(new
> DistributedRowMatrixPipe(outputPath, conf))
>
> // and anything else you would want to do with a scalding TypedPipe[Int,
> Vector]
>
> Currently I've been doing this with a package structure directly in Mahout,
> in:
>
>    mahout/contrib/scalding
>
> What do people think about having this be something real, after 0.8 goes
> out?  Are
> we ready for contrib modules which fold in diverse external projects in new
> ways?
> Integrating directly with pig and scalding is a bit too wide of a tent for
> Mahout core,
> but putting these integrations in entirely new projects is maybe a bit too
> far away.
>
> +1,

i've been putting this into module mahout-math-scala for the past couple of
days on the Mahout itself and keep merging with trunk. (here:
https://github.com/dlyubimov/mahout-commits/tree/dev-0.8.x-scala/math-scala)
. In case anyone wants to look. Not much to look at at the moment though i
guess.

Since it is seamlessly compiled by maven and all scala stuff is readily
available in maven repo, i don't see any operational reason not to include
it in post-0.8 tree.

However, this probably on its own is not terribly useful until I get
spark-based distributed solver collection rolled in another module that
depends on it. (well, it may turn out to be  an ugly battle with our bosses
to contribute it). but, i probably will have some straightforward
Hu-Koren-Volinsky spark-based stuff in a couple of days on top of it.




>
> On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Dmitriy,
> >
> > This is very pretty.
> >
> >
> >
> >
> > On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Ok, so i was fairly easily able to build some DSL for our matrix
> > > manipulation (similar to breeze) in scala:
> > >
> > > inline matrix or vector:
> > >
> > > val  a = dense((1, 2, 3), (3, 4, 5))
> > >
> > > val b:Vector = (1,2,3)
> > >
> > > block views and assignments (element/row/vector/block/block of row or
> > > vector)
> > >
> > >
> > > a(::, 0)
> > > a(1, ::)
> > > a(0 to 1, 1 to 2)
> > >
> > > assignments
> > >
> > > a(0, ::) :=(3, 5, 7)
> > > a(0, 0 to 1) :=(3, 5)
> > > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> > >
> > > operators
> > >
> > > // hadamard
> > > val c = a * b
> > >  a *= b
> > >
> > > // matrix mul
> > >  val m = a %*% b
> > >
> > > and bunch of other little things like sum, mean, colMeans etc. That
> much
> > is
> > > easy.
> > >
> > > Also stuff like the ones found in breeze along the lines
> > >
> > > val (u,v,s) = svd(a)
> > >
> > > diag ((1,2,3))
> > >
> > > and Cholesky in similar ways.
> > >
> > > I don't have "inline" initialization for sparse things (yet) simply
> > because
> > > i don't need them, but of course all regular java constructors and
> > methods
> > > are retained, all that is just a syntactic sugar in the spirit of DSLs
> in
> > > hope to make things a bit mroe readable.
> > >
> > > my (very little, and very insignificantly opinionated, really)
> criticism
> > of
> > > Breeze in this context is its inconsistency between dense and sparse
> > > representations, namely, lack of consistent overarching trait(s), so
> that
> > > building structure-agnostic solvers like Mahout's Cholesky solver is
> > > impossible, as well as cross-type matrix use (say, the way i understand
> > it,
> > > it is pretty much imposible to multiply a sparse matrix by a dense
> > matrix).
> > >
> > > I suspect these problems stem from the fact that the authors for
> whatever
> > > reason decided to hardwire dense things with JBlas solvers whereas i
> dont
> > > believe matrix storage structures must be. But these problems do appear
> > to
> > > be serious enough  for me to ignore Breeze for now. If i decide to plug
> > in
> > > jblas dense solvers, i guess i will just have them as yet another
> > top-level
> > > routine interface taking any Matrix, e.g.
> > >
> > > val (u,v,s) = svd(m, jblas=true)
> > >
> > >
> > >
> > > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > > wrote:
> > >
> > > > Thank you.
> > > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com>
> wrote:
> > > >
> > > >> I think that this contract has migrated a bit from the first
> starting
> > > >> point.
> > > >>
> > > >> My feeling is that there is a de facto contract now that the matrix
> > > slice
> > > >> is a single row.
> > > >>
> > > >> Sent from my iPhone
> > > >>
> > > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> > > >>
> > > >> > What does Matrix. iterateAll() contractually do? Practically it
> > seems
> > > >> to be
> > > >> > row wise iteration for some implementations but it doesnt seem
> > > >> > contractually state so in the javadoc. What is MatrixSlice if it
> is
> > > >> neither
> > > >> > a row nor a colimn? How can i tell what exactly it is i am
> iterating
> > > >> over?
> > > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com>
> > > wrote:
> > > >> >
> > > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> > jake.mannix@gmail.com>
> > > >> >> wrote:
> > > >> >>
> > > >> >>>> Question #2: which in-core solvers are available for Mahout
> > > >> matrices? I
> > > >> >>>> know there's SSVD, probably Cholesky, is there something else?
> In
> > > >> >>>> paticular, i need to be solving linear systems, I guess
> Cholesky
> > > >> should
> > > >> >>> be
> > > >> >>>> equipped enough to do just that?
> > > >> >>>>
> > > >> >>>> Question #3: why did we try to import Colt solvers rather than
> > > >> actually
> > > >> >>>> depend on Colt in the first place? Why did we not accept Colt's
> > > >> sparse
> > > >> >>>> matrices and created native ones instead?
> > > >> >>>>
> > > >> >>>> Colt seems to have a notion of parse in-core matrices too and
> > seems
> > > >> >> like
> > > >> >>> a
> > > >> >>>> well-rounded solution. However, it doesn't seem like being
> > actively
> > > >> >>>> supported, whereas I know Mahout experienced continued
> > enhancements
> > > >> to
> > > >> >>> the
> > > >> >>>> in-core matrix support.
> > > >> >>>>
> > > >> >>>
> > > >> >>> Colt was totally abandoned, and I talked to the original author
> > and
> > > he
> > > >> >>> blessed it's adoption.  When we pulled it in, we found it was
> > > woefully
> > > >> >>> undertested,
> > > >> >>> and tried our best to hook it in with proper tests and use APIs
> > that
> > > >> fit
> > > >> >>> with
> > > >> >>> the use cases we had.  Plus, we already had the start of some
> > linear
> > > >> apis
> > > >> >>> (i.e.
> > > >> >>> the Vector interface) and dropping the API completely seemed not
> > > >> terribly
> > > >> >>> worth it at the time.
> > > >> >>>
> > > >> >>
> > > >> >> There was even more to it than that.
> > > >> >>
> > > >> >> Colt was under-tested and there have been warts that had to be
> > pulled
> > > >> out
> > > >> >> in much of the code.
> > > >> >>
> > > >> >> But, worse than that, Colt's matrix and vector structure was a
> real
> > > >> bugger
> > > >> >> to extend or change.  It also had all kinds of cruft where it
> > > >> pretended to
> > > >> >> support matrices of things, but in fact only supported matrices
> of
> > > >> doubles
> > > >> >> and floats.
> > > >> >>
> > > >> >> So using Colt as it was (and is since it is largely abandoned)
> was
> > a
> > > >> >> non-starter.
> > > >> >>
> > > >> >> As far as in-memory solvers, we have:
> > > >> >>
> > > >> >> 1) LR decomposition (tested and kinda fast)
> > > >> >>
> > > >> >> 2) Cholesky decomposition (tested)
> > > >> >>
> > > >> >> 3) SVD (tested)
> > > >> >>
> > > >>
> > > >
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Jake Mannix <ja...@gmail.com>.
Yeah, I'm totally on board with a pretty scala DSL on top of some of our
stuff.

In particular, I've been experimenting with with wrapping the
DistributedRowMatrix
in a scalding wrapper, so we can do things like

val matrixAsTypedPipe =
   DistributedRowMatrixPipe(new DistributedRowMatrix(numRows, numCols,
path, conf))

// e.g. L1 normalize:
  matrixAsTypedPipe.map((idx, v) : (Int, Vector) => (idx, v.normalize(1)) )
                                 .write(new
DistributedRowMatrixPipe(outputPath, conf))

// and anything else you would want to do with a scalding TypedPipe[Int,
Vector]

Currently I've been doing this with a package structure directly in Mahout,
in:

   mahout/contrib/scalding

What do people think about having this be something real, after 0.8 goes
out?  Are
we ready for contrib modules which fold in diverse external projects in new
ways?
Integrating directly with pig and scalding is a bit too wide of a tent for
Mahout core,
but putting these integrations in entirely new projects is maybe a bit too
far away.


On Mon, Jun 24, 2013 at 11:30 AM, Ted Dunning <te...@gmail.com> wrote:

> Dmitriy,
>
> This is very pretty.
>
>
>
>
> On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Ok, so i was fairly easily able to build some DSL for our matrix
> > manipulation (similar to breeze) in scala:
> >
> > inline matrix or vector:
> >
> > val  a = dense((1, 2, 3), (3, 4, 5))
> >
> > val b:Vector = (1,2,3)
> >
> > block views and assignments (element/row/vector/block/block of row or
> > vector)
> >
> >
> > a(::, 0)
> > a(1, ::)
> > a(0 to 1, 1 to 2)
> >
> > assignments
> >
> > a(0, ::) :=(3, 5, 7)
> > a(0, 0 to 1) :=(3, 5)
> > a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
> >
> > operators
> >
> > // hadamard
> > val c = a * b
> >  a *= b
> >
> > // matrix mul
> >  val m = a %*% b
> >
> > and bunch of other little things like sum, mean, colMeans etc. That much
> is
> > easy.
> >
> > Also stuff like the ones found in breeze along the lines
> >
> > val (u,v,s) = svd(a)
> >
> > diag ((1,2,3))
> >
> > and Cholesky in similar ways.
> >
> > I don't have "inline" initialization for sparse things (yet) simply
> because
> > i don't need them, but of course all regular java constructors and
> methods
> > are retained, all that is just a syntactic sugar in the spirit of DSLs in
> > hope to make things a bit mroe readable.
> >
> > my (very little, and very insignificantly opinionated, really) criticism
> of
> > Breeze in this context is its inconsistency between dense and sparse
> > representations, namely, lack of consistent overarching trait(s), so that
> > building structure-agnostic solvers like Mahout's Cholesky solver is
> > impossible, as well as cross-type matrix use (say, the way i understand
> it,
> > it is pretty much imposible to multiply a sparse matrix by a dense
> matrix).
> >
> > I suspect these problems stem from the fact that the authors for whatever
> > reason decided to hardwire dense things with JBlas solvers whereas i dont
> > believe matrix storage structures must be. But these problems do appear
> to
> > be serious enough  for me to ignore Breeze for now. If i decide to plug
> in
> > jblas dense solvers, i guess i will just have them as yet another
> top-level
> > routine interface taking any Matrix, e.g.
> >
> > val (u,v,s) = svd(m, jblas=true)
> >
> >
> >
> > On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
> > wrote:
> >
> > > Thank you.
> > > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com> wrote:
> > >
> > >> I think that this contract has migrated a bit from the first starting
> > >> point.
> > >>
> > >> My feeling is that there is a de facto contract now that the matrix
> > slice
> > >> is a single row.
> > >>
> > >> Sent from my iPhone
> > >>
> > >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
> > >>
> > >> > What does Matrix. iterateAll() contractually do? Practically it
> seems
> > >> to be
> > >> > row wise iteration for some implementations but it doesnt seem
> > >> > contractually state so in the javadoc. What is MatrixSlice if it is
> > >> neither
> > >> > a row nor a colimn? How can i tell what exactly it is i am iterating
> > >> over?
> > >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com>
> > wrote:
> > >> >
> > >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <
> jake.mannix@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >>>> Question #2: which in-core solvers are available for Mahout
> > >> matrices? I
> > >> >>>> know there's SSVD, probably Cholesky, is there something else? In
> > >> >>>> paticular, i need to be solving linear systems, I guess Cholesky
> > >> should
> > >> >>> be
> > >> >>>> equipped enough to do just that?
> > >> >>>>
> > >> >>>> Question #3: why did we try to import Colt solvers rather than
> > >> actually
> > >> >>>> depend on Colt in the first place? Why did we not accept Colt's
> > >> sparse
> > >> >>>> matrices and created native ones instead?
> > >> >>>>
> > >> >>>> Colt seems to have a notion of parse in-core matrices too and
> seems
> > >> >> like
> > >> >>> a
> > >> >>>> well-rounded solution. However, it doesn't seem like being
> actively
> > >> >>>> supported, whereas I know Mahout experienced continued
> enhancements
> > >> to
> > >> >>> the
> > >> >>>> in-core matrix support.
> > >> >>>>
> > >> >>>
> > >> >>> Colt was totally abandoned, and I talked to the original author
> and
> > he
> > >> >>> blessed it's adoption.  When we pulled it in, we found it was
> > woefully
> > >> >>> undertested,
> > >> >>> and tried our best to hook it in with proper tests and use APIs
> that
> > >> fit
> > >> >>> with
> > >> >>> the use cases we had.  Plus, we already had the start of some
> linear
> > >> apis
> > >> >>> (i.e.
> > >> >>> the Vector interface) and dropping the API completely seemed not
> > >> terribly
> > >> >>> worth it at the time.
> > >> >>>
> > >> >>
> > >> >> There was even more to it than that.
> > >> >>
> > >> >> Colt was under-tested and there have been warts that had to be
> pulled
> > >> out
> > >> >> in much of the code.
> > >> >>
> > >> >> But, worse than that, Colt's matrix and vector structure was a real
> > >> bugger
> > >> >> to extend or change.  It also had all kinds of cruft where it
> > >> pretended to
> > >> >> support matrices of things, but in fact only supported matrices of
> > >> doubles
> > >> >> and floats.
> > >> >>
> > >> >> So using Colt as it was (and is since it is largely abandoned) was
> a
> > >> >> non-starter.
> > >> >>
> > >> >> As far as in-memory solvers, we have:
> > >> >>
> > >> >> 1) LR decomposition (tested and kinda fast)
> > >> >>
> > >> >> 2) Cholesky decomposition (tested)
> > >> >>
> > >> >> 3) SVD (tested)
> > >> >>
> > >>
> > >
> >
>



-- 

  -jake

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
Dmitriy,

This is very pretty.




On Mon, Jun 24, 2013 at 6:48 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Ok, so i was fairly easily able to build some DSL for our matrix
> manipulation (similar to breeze) in scala:
>
> inline matrix or vector:
>
> val  a = dense((1, 2, 3), (3, 4, 5))
>
> val b:Vector = (1,2,3)
>
> block views and assignments (element/row/vector/block/block of row or
> vector)
>
>
> a(::, 0)
> a(1, ::)
> a(0 to 1, 1 to 2)
>
> assignments
>
> a(0, ::) :=(3, 5, 7)
> a(0, 0 to 1) :=(3, 5)
> a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))
>
> operators
>
> // hadamard
> val c = a * b
>  a *= b
>
> // matrix mul
>  val m = a %*% b
>
> and bunch of other little things like sum, mean, colMeans etc. That much is
> easy.
>
> Also stuff like the ones found in breeze along the lines
>
> val (u,v,s) = svd(a)
>
> diag ((1,2,3))
>
> and Cholesky in similar ways.
>
> I don't have "inline" initialization for sparse things (yet) simply because
> i don't need them, but of course all regular java constructors and methods
> are retained, all that is just a syntactic sugar in the spirit of DSLs in
> hope to make things a bit mroe readable.
>
> my (very little, and very insignificantly opinionated, really) criticism of
> Breeze in this context is its inconsistency between dense and sparse
> representations, namely, lack of consistent overarching trait(s), so that
> building structure-agnostic solvers like Mahout's Cholesky solver is
> impossible, as well as cross-type matrix use (say, the way i understand it,
> it is pretty much imposible to multiply a sparse matrix by a dense matrix).
>
> I suspect these problems stem from the fact that the authors for whatever
> reason decided to hardwire dense things with JBlas solvers whereas i dont
> believe matrix storage structures must be. But these problems do appear to
> be serious enough  for me to ignore Breeze for now. If i decide to plug in
> jblas dense solvers, i guess i will just have them as yet another top-level
> routine interface taking any Matrix, e.g.
>
> val (u,v,s) = svd(m, jblas=true)
>
>
>
> On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Thank you.
> > On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com> wrote:
> >
> >> I think that this contract has migrated a bit from the first starting
> >> point.
> >>
> >> My feeling is that there is a de facto contract now that the matrix
> slice
> >> is a single row.
> >>
> >> Sent from my iPhone
> >>
> >> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> >>
> >> > What does Matrix. iterateAll() contractually do? Practically it seems
> >> to be
> >> > row wise iteration for some implementations but it doesnt seem
> >> > contractually state so in the javadoc. What is MatrixSlice if it is
> >> neither
> >> > a row nor a colimn? How can i tell what exactly it is i am iterating
> >> over?
> >> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com>
> wrote:
> >> >
> >> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com>
> >> >> wrote:
> >> >>
> >> >>>> Question #2: which in-core solvers are available for Mahout
> >> matrices? I
> >> >>>> know there's SSVD, probably Cholesky, is there something else? In
> >> >>>> paticular, i need to be solving linear systems, I guess Cholesky
> >> should
> >> >>> be
> >> >>>> equipped enough to do just that?
> >> >>>>
> >> >>>> Question #3: why did we try to import Colt solvers rather than
> >> actually
> >> >>>> depend on Colt in the first place? Why did we not accept Colt's
> >> sparse
> >> >>>> matrices and created native ones instead?
> >> >>>>
> >> >>>> Colt seems to have a notion of parse in-core matrices too and seems
> >> >> like
> >> >>> a
> >> >>>> well-rounded solution. However, it doesn't seem like being actively
> >> >>>> supported, whereas I know Mahout experienced continued enhancements
> >> to
> >> >>> the
> >> >>>> in-core matrix support.
> >> >>>>
> >> >>>
> >> >>> Colt was totally abandoned, and I talked to the original author and
> he
> >> >>> blessed it's adoption.  When we pulled it in, we found it was
> woefully
> >> >>> undertested,
> >> >>> and tried our best to hook it in with proper tests and use APIs that
> >> fit
> >> >>> with
> >> >>> the use cases we had.  Plus, we already had the start of some linear
> >> apis
> >> >>> (i.e.
> >> >>> the Vector interface) and dropping the API completely seemed not
> >> terribly
> >> >>> worth it at the time.
> >> >>>
> >> >>
> >> >> There was even more to it than that.
> >> >>
> >> >> Colt was under-tested and there have been warts that had to be pulled
> >> out
> >> >> in much of the code.
> >> >>
> >> >> But, worse than that, Colt's matrix and vector structure was a real
> >> bugger
> >> >> to extend or change.  It also had all kinds of cruft where it
> >> pretended to
> >> >> support matrices of things, but in fact only supported matrices of
> >> doubles
> >> >> and floats.
> >> >>
> >> >> So using Colt as it was (and is since it is largely abandoned) was a
> >> >> non-starter.
> >> >>
> >> >> As far as in-memory solvers, we have:
> >> >>
> >> >> 1) LR decomposition (tested and kinda fast)
> >> >>
> >> >> 2) Cholesky decomposition (tested)
> >> >>
> >> >> 3) SVD (tested)
> >> >>
> >>
> >
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Ok, so i was fairly easily able to build some DSL for our matrix
manipulation (similar to breeze) in scala:

inline matrix or vector:

val  a = dense((1, 2, 3), (3, 4, 5))

val b:Vector = (1,2,3)

block views and assignments (element/row/vector/block/block of row or
vector)


a(::, 0)
a(1, ::)
a(0 to 1, 1 to 2)

assignments

a(0, ::) :=(3, 5, 7)
a(0, 0 to 1) :=(3, 5)
a(0 to 1, 0 to 1) := dense((1, 1), (2, 2.5))

operators

// hadamard
val c = a * b
 a *= b

// matrix mul
 val m = a %*% b

and bunch of other little things like sum, mean, colMeans etc. That much is
easy.

Also stuff like the ones found in breeze along the lines

val (u,v,s) = svd(a)

diag ((1,2,3))

and Cholesky in similar ways.

I don't have "inline" initialization for sparse things (yet) simply because
i don't need them, but of course all regular java constructors and methods
are retained, all that is just a syntactic sugar in the spirit of DSLs in
hope to make things a bit mroe readable.

my (very little, and very insignificantly opinionated, really) criticism of
Breeze in this context is its inconsistency between dense and sparse
representations, namely, lack of consistent overarching trait(s), so that
building structure-agnostic solvers like Mahout's Cholesky solver is
impossible, as well as cross-type matrix use (say, the way i understand it,
it is pretty much imposible to multiply a sparse matrix by a dense matrix).

I suspect these problems stem from the fact that the authors for whatever
reason decided to hardwire dense things with JBlas solvers whereas i dont
believe matrix storage structures must be. But these problems do appear to
be serious enough  for me to ignore Breeze for now. If i decide to plug in
jblas dense solvers, i guess i will just have them as yet another top-level
routine interface taking any Matrix, e.g.

val (u,v,s) = svd(m, jblas=true)



On Sun, Jun 23, 2013 at 7:08 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Thank you.
> On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com> wrote:
>
>> I think that this contract has migrated a bit from the first starting
>> point.
>>
>> My feeling is that there is a de facto contract now that the matrix slice
>> is a single row.
>>
>> Sent from my iPhone
>>
>> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>> > What does Matrix. iterateAll() contractually do? Practically it seems
>> to be
>> > row wise iteration for some implementations but it doesnt seem
>> > contractually state so in the javadoc. What is MatrixSlice if it is
>> neither
>> > a row nor a colimn? How can i tell what exactly it is i am iterating
>> over?
>> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com> wrote:
>> >
>> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com>
>> >> wrote:
>> >>
>> >>>> Question #2: which in-core solvers are available for Mahout
>> matrices? I
>> >>>> know there's SSVD, probably Cholesky, is there something else? In
>> >>>> paticular, i need to be solving linear systems, I guess Cholesky
>> should
>> >>> be
>> >>>> equipped enough to do just that?
>> >>>>
>> >>>> Question #3: why did we try to import Colt solvers rather than
>> actually
>> >>>> depend on Colt in the first place? Why did we not accept Colt's
>> sparse
>> >>>> matrices and created native ones instead?
>> >>>>
>> >>>> Colt seems to have a notion of parse in-core matrices too and seems
>> >> like
>> >>> a
>> >>>> well-rounded solution. However, it doesn't seem like being actively
>> >>>> supported, whereas I know Mahout experienced continued enhancements
>> to
>> >>> the
>> >>>> in-core matrix support.
>> >>>>
>> >>>
>> >>> Colt was totally abandoned, and I talked to the original author and he
>> >>> blessed it's adoption.  When we pulled it in, we found it was woefully
>> >>> undertested,
>> >>> and tried our best to hook it in with proper tests and use APIs that
>> fit
>> >>> with
>> >>> the use cases we had.  Plus, we already had the start of some linear
>> apis
>> >>> (i.e.
>> >>> the Vector interface) and dropping the API completely seemed not
>> terribly
>> >>> worth it at the time.
>> >>>
>> >>
>> >> There was even more to it than that.
>> >>
>> >> Colt was under-tested and there have been warts that had to be pulled
>> out
>> >> in much of the code.
>> >>
>> >> But, worse than that, Colt's matrix and vector structure was a real
>> bugger
>> >> to extend or change.  It also had all kinds of cruft where it
>> pretended to
>> >> support matrices of things, but in fact only supported matrices of
>> doubles
>> >> and floats.
>> >>
>> >> So using Colt as it was (and is since it is largely abandoned) was a
>> >> non-starter.
>> >>
>> >> As far as in-memory solvers, we have:
>> >>
>> >> 1) LR decomposition (tested and kinda fast)
>> >>
>> >> 2) Cholesky decomposition (tested)
>> >>
>> >> 3) SVD (tested)
>> >>
>>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you.
On Jun 23, 2013 6:16 PM, "Ted Dunning" <te...@gmail.com> wrote:

> I think that this contract has migrated a bit from the first starting
> point.
>
> My feeling is that there is a de facto contract now that the matrix slice
> is a single row.
>
> Sent from my iPhone
>
> On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
> > What does Matrix. iterateAll() contractually do? Practically it seems to
> be
> > row wise iteration for some implementations but it doesnt seem
> > contractually state so in the javadoc. What is MatrixSlice if it is
> neither
> > a row nor a colimn? How can i tell what exactly it is i am iterating
> over?
> > On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com> wrote:
> >
> >> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com>
> >> wrote:
> >>
> >>>> Question #2: which in-core solvers are available for Mahout matrices?
> I
> >>>> know there's SSVD, probably Cholesky, is there something else? In
> >>>> paticular, i need to be solving linear systems, I guess Cholesky
> should
> >>> be
> >>>> equipped enough to do just that?
> >>>>
> >>>> Question #3: why did we try to import Colt solvers rather than
> actually
> >>>> depend on Colt in the first place? Why did we not accept Colt's sparse
> >>>> matrices and created native ones instead?
> >>>>
> >>>> Colt seems to have a notion of parse in-core matrices too and seems
> >> like
> >>> a
> >>>> well-rounded solution. However, it doesn't seem like being actively
> >>>> supported, whereas I know Mahout experienced continued enhancements to
> >>> the
> >>>> in-core matrix support.
> >>>>
> >>>
> >>> Colt was totally abandoned, and I talked to the original author and he
> >>> blessed it's adoption.  When we pulled it in, we found it was woefully
> >>> undertested,
> >>> and tried our best to hook it in with proper tests and use APIs that
> fit
> >>> with
> >>> the use cases we had.  Plus, we already had the start of some linear
> apis
> >>> (i.e.
> >>> the Vector interface) and dropping the API completely seemed not
> terribly
> >>> worth it at the time.
> >>>
> >>
> >> There was even more to it than that.
> >>
> >> Colt was under-tested and there have been warts that had to be pulled
> out
> >> in much of the code.
> >>
> >> But, worse than that, Colt's matrix and vector structure was a real
> bugger
> >> to extend or change.  It also had all kinds of cruft where it pretended
> to
> >> support matrices of things, but in fact only supported matrices of
> doubles
> >> and floats.
> >>
> >> So using Colt as it was (and is since it is largely abandoned) was a
> >> non-starter.
> >>
> >> As far as in-memory solvers, we have:
> >>
> >> 1) LR decomposition (tested and kinda fast)
> >>
> >> 2) Cholesky decomposition (tested)
> >>
> >> 3) SVD (tested)
> >>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
I think that this contract has migrated a bit from the first starting point.  

My feeling is that there is a de facto contract now that the matrix slice is a single row.  

Sent from my iPhone

On Jun 23, 2013, at 16:32, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> What does Matrix. iterateAll() contractually do? Practically it seems to be
> row wise iteration for some implementations but it doesnt seem
> contractually state so in the javadoc. What is MatrixSlice if it is neither
> a row nor a colimn? How can i tell what exactly it is i am iterating over?
> On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com> wrote:
> 
>> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com>
>> wrote:
>> 
>>>> Question #2: which in-core solvers are available for Mahout matrices? I
>>>> know there's SSVD, probably Cholesky, is there something else? In
>>>> paticular, i need to be solving linear systems, I guess Cholesky should
>>> be
>>>> equipped enough to do just that?
>>>> 
>>>> Question #3: why did we try to import Colt solvers rather than actually
>>>> depend on Colt in the first place? Why did we not accept Colt's sparse
>>>> matrices and created native ones instead?
>>>> 
>>>> Colt seems to have a notion of parse in-core matrices too and seems
>> like
>>> a
>>>> well-rounded solution. However, it doesn't seem like being actively
>>>> supported, whereas I know Mahout experienced continued enhancements to
>>> the
>>>> in-core matrix support.
>>>> 
>>> 
>>> Colt was totally abandoned, and I talked to the original author and he
>>> blessed it's adoption.  When we pulled it in, we found it was woefully
>>> undertested,
>>> and tried our best to hook it in with proper tests and use APIs that fit
>>> with
>>> the use cases we had.  Plus, we already had the start of some linear apis
>>> (i.e.
>>> the Vector interface) and dropping the API completely seemed not terribly
>>> worth it at the time.
>>> 
>> 
>> There was even more to it than that.
>> 
>> Colt was under-tested and there have been warts that had to be pulled out
>> in much of the code.
>> 
>> But, worse than that, Colt's matrix and vector structure was a real bugger
>> to extend or change.  It also had all kinds of cruft where it pretended to
>> support matrices of things, but in fact only supported matrices of doubles
>> and floats.
>> 
>> So using Colt as it was (and is since it is largely abandoned) was a
>> non-starter.
>> 
>> As far as in-memory solvers, we have:
>> 
>> 1) LR decomposition (tested and kinda fast)
>> 
>> 2) Cholesky decomposition (tested)
>> 
>> 3) SVD (tested)
>> 

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
What does Matrix. iterateAll() contractually do? Practically it seems to be
row wise iteration for some implementations but it doesnt seem
contractually state so in the javadoc. What is MatrixSlice if it is neither
a row nor a colimn? How can i tell what exactly it is i am iterating over?
On Jun 19, 2013 12:21 AM, "Ted Dunning" <te...@gmail.com> wrote:

> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > > Question #2: which in-core solvers are available for Mahout matrices? I
> > > know there's SSVD, probably Cholesky, is there something else? In
> > > paticular, i need to be solving linear systems, I guess Cholesky should
> > be
> > > equipped enough to do just that?
> > >
> > > Question #3: why did we try to import Colt solvers rather than actually
> > > depend on Colt in the first place? Why did we not accept Colt's sparse
> > > matrices and created native ones instead?
> > >
> > > Colt seems to have a notion of parse in-core matrices too and seems
> like
> > a
> > > well-rounded solution. However, it doesn't seem like being actively
> > > supported, whereas I know Mahout experienced continued enhancements to
> > the
> > > in-core matrix support.
> > >
> >
> > Colt was totally abandoned, and I talked to the original author and he
> > blessed it's adoption.  When we pulled it in, we found it was woefully
> > undertested,
> > and tried our best to hook it in with proper tests and use APIs that fit
> > with
> > the use cases we had.  Plus, we already had the start of some linear apis
> > (i.e.
> > the Vector interface) and dropping the API completely seemed not terribly
> > worth it at the time.
> >
>
> There was even more to it than that.
>
> Colt was under-tested and there have been warts that had to be pulled out
> in much of the code.
>
> But, worse than that, Colt's matrix and vector structure was a real bugger
> to extend or change.  It also had all kinds of cruft where it pretended to
> support matrices of things, but in fact only supported matrices of doubles
> and floats.
>
> So using Colt as it was (and is since it is largely abandoned) was a
> non-starter.
>
> As far as in-memory solvers, we have:
>
> 1) LR decomposition (tested and kinda fast)
>
> 2) Cholesky decomposition (tested)
>
> 3) SVD (tested)
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Ted.


On Wed, Jun 19, 2013 at 12:20 AM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > > Question #2: which in-core solvers are available for Mahout matrices? I
> > > know there's SSVD, probably Cholesky, is there something else? In
> > > paticular, i need to be solving linear systems, I guess Cholesky should
> > be
> > > equipped enough to do just that?
> > >
> > > Question #3: why did we try to import Colt solvers rather than actually
> > > depend on Colt in the first place? Why did we not accept Colt's sparse
> > > matrices and created native ones instead?
> > >
> > > Colt seems to have a notion of parse in-core matrices too and seems
> like
> > a
> > > well-rounded solution. However, it doesn't seem like being actively
> > > supported, whereas I know Mahout experienced continued enhancements to
> > the
> > > in-core matrix support.
> > >
> >
> > Colt was totally abandoned, and I talked to the original author and he
> > blessed it's adoption.  When we pulled it in, we found it was woefully
> > undertested,
> > and tried our best to hook it in with proper tests and use APIs that fit
> > with
> > the use cases we had.  Plus, we already had the start of some linear apis
> > (i.e.
> > the Vector interface) and dropping the API completely seemed not terribly
> > worth it at the time.
> >
>
> There was even more to it than that.
>
> Colt was under-tested and there have been warts that had to be pulled out
> in much of the code.
>
> But, worse than that, Colt's matrix and vector structure was a real bugger
> to extend or change.  It also had all kinds of cruft where it pretended to
> support matrices of things, but in fact only supported matrices of doubles
> and floats.
>
> So using Colt as it was (and is since it is largely abandoned) was a
> non-starter.
>
> As far as in-memory solvers, we have:
>
> 1) LR decomposition (tested and kinda fast)
>
> 2) Cholesky decomposition (tested)
>
> 3) SVD (tested)
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Jun 19, 2013 at 5:29 AM, Jake Mannix <ja...@gmail.com> wrote:

> > Question #2: which in-core solvers are available for Mahout matrices? I
> > know there's SSVD, probably Cholesky, is there something else? In
> > paticular, i need to be solving linear systems, I guess Cholesky should
> be
> > equipped enough to do just that?
> >
> > Question #3: why did we try to import Colt solvers rather than actually
> > depend on Colt in the first place? Why did we not accept Colt's sparse
> > matrices and created native ones instead?
> >
> > Colt seems to have a notion of parse in-core matrices too and seems like
> a
> > well-rounded solution. However, it doesn't seem like being actively
> > supported, whereas I know Mahout experienced continued enhancements to
> the
> > in-core matrix support.
> >
>
> Colt was totally abandoned, and I talked to the original author and he
> blessed it's adoption.  When we pulled it in, we found it was woefully
> undertested,
> and tried our best to hook it in with proper tests and use APIs that fit
> with
> the use cases we had.  Plus, we already had the start of some linear apis
> (i.e.
> the Vector interface) and dropping the API completely seemed not terribly
> worth it at the time.
>

There was even more to it than that.

Colt was under-tested and there have been warts that had to be pulled out
in much of the code.

But, worse than that, Colt's matrix and vector structure was a real bugger
to extend or change.  It also had all kinds of cruft where it pretended to
support matrices of things, but in fact only supported matrices of doubles
and floats.

So using Colt as it was (and is since it is largely abandoned) was a
non-starter.

As far as in-memory solvers, we have:

1) LR decomposition (tested and kinda fast)

2) Cholesky decomposition (tested)

3) SVD (tested)

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Jake. I suspected as much about Colt.
On Jun 18, 2013 8:30 PM, "Jake Mannix" <ja...@gmail.com> wrote:

> On Tue, Jun 18, 2013 at 6:14 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > Hello,
> >
> > so i finally got around to actually do it.
> >
> > I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
> > solvers using spark and Bagel /scala.
> >
> > I also want to use in-core solvers that run directly on Mahout.
> >
> > Question #1: which mahout artifacts are better be imported if I don't
> want
> > to pick the hadoop stuff dependencies? Is there even such a separation of
> > code? I know mahout-math seems to try to avoid being hadoop specfic but
> not
> > sure if it is followed strictly.
> >
>
> mahout-math should not depend on hadoop apis at all, if you build it and
> hadoop gets pulled in via maven, then file a ticket, that's a bug.
>
>
> > Question #2: which in-core solvers are available for Mahout matrices? I
> > know there's SSVD, probably Cholesky, is there something else? In
> > paticular, i need to be solving linear systems, I guess Cholesky should
> be
> > equipped enough to do just that?
> >
> > Question #3: why did we try to import Colt solvers rather than actually
> > depend on Colt in the first place? Why did we not accept Colt's sparse
> > matrices and created native ones instead?
> >
> > Colt seems to have a notion of parse in-core matrices too and seems like
> a
> > well-rounded solution. However, it doesn't seem like being actively
> > supported, whereas I know Mahout experienced continued enhancements to
> the
> > in-core matrix support.
> >
>
> Colt was totally abandoned, and I talked to the original author and he
> blessed
> it's adoption.  When we pulled it in, we found it was woefully undertested,
> and
> tried our best to hook it in with proper tests and use APIs that fit with
> the use
> cases we had.  Plus, we already had the start of some linear apis (i.e.
> the Vector interface) and dropping the API completely seemed not terribly
> worth it at the time.
>
>
> >
> > Thanks in advance
> > -Dmitriy
> >
>
>
>
> --
>
>   -jake
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jun 18, 2013 at 6:14 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Hello,
>
> so i finally got around to actually do it.
>
> I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
> solvers using spark and Bagel /scala.
>
> I also want to use in-core solvers that run directly on Mahout.
>
> Question #1: which mahout artifacts are better be imported if I don't want
> to pick the hadoop stuff dependencies? Is there even such a separation of
> code? I know mahout-math seems to try to avoid being hadoop specfic but not
> sure if it is followed strictly.
>

mahout-math should not depend on hadoop apis at all, if you build it and
hadoop gets pulled in via maven, then file a ticket, that's a bug.


> Question #2: which in-core solvers are available for Mahout matrices? I
> know there's SSVD, probably Cholesky, is there something else? In
> paticular, i need to be solving linear systems, I guess Cholesky should be
> equipped enough to do just that?
>
> Question #3: why did we try to import Colt solvers rather than actually
> depend on Colt in the first place? Why did we not accept Colt's sparse
> matrices and created native ones instead?
>
> Colt seems to have a notion of parse in-core matrices too and seems like a
> well-rounded solution. However, it doesn't seem like being actively
> supported, whereas I know Mahout experienced continued enhancements to the
> in-core matrix support.
>

Colt was totally abandoned, and I talked to the original author and he
blessed
it's adoption.  When we pulled it in, we found it was woefully undertested,
and
tried our best to hook it in with proper tests and use APIs that fit with
the use
cases we had.  Plus, we already had the start of some linear apis (i.e.
the Vector interface) and dropping the API completely seemed not terribly
worth it at the time.


>
> Thanks in advance
> -Dmitriy
>



-- 

  -jake

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Nick, thank you for the hints and poniters! I will check out the Breeze.

Let me take a look.

as far as collaboration, unfortunately i think the only way to go for me
and my employer is to cut it, test it and then (after long negotiations
with CEO) donate if accepted. They are ok with my small participation in
OSS but  I don't have permission to outwardly start a fundamentally new OSS
stuff on my paid time any more. (They seemed to be OSS friendly when they
were hiring me, but once i was in, the situation seemed to have reversed
itself).


On Wed, Jun 19, 2013 at 3:50 AM, Nick Pentreath <ni...@gmail.com>wrote:

> Hi Dmitriy
>
> I'd be interested to look at helping with this potentially (time
> permitting).
>
> I've recently been working on a port of Mahout's ALS implementation to
> Spark. I spent a bit of time thinking about how much of mahout-math to use.
>
> For now I found that using the Breeze linear algebra library I could get
> what I needed, ie DenseVector, SparseVector, DenseMatrix, all with
> in-memory multiply and solve that is backed by JBLAS (so very quick if you
> have the native libraries active). It comes with very nice "Matlab-like"
> syntax in Scala. So it ended up being a bit of a rewrite rather than a port
> of the Mahout code.
>
> The sparse matrix support is however a bit... well, sparse :) There is a
> CSC matrix and some operations but Sparse SVD is not there, and the solvers
> I think are not there just yet (in-core).
>
> But of course the linear algebra objects are not easily usable from Java
> due to the syntax and the heavy use of implicits. So for a fully functional
> Java API version that can use the vectors/matrices directly, the options
> would be to create a Java bridge to the Breeze vectors/matrices, or to
> instead look at using mahout-math to drive the linear algebra. In that case
> the Scala syntax would not be as nice, but some sugar can be added again
> using implicits for common operations (I've tested this a bit and it can
> work and probably be made reasonably efficient if copies are avoided in the
> implicit conversion).
>
> Anyway, I'd be happy to offer assistance.
>
> Nick
>
>
> On Wed, Jun 19, 2013 at 8:09 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> > Let us know how I went, I'm pretty interested to see how well our stuff
> > integrates with Spark, especially since Spark is in the process of
> > joining Apache.
> >
> > -sebastian
> >
> > On 19.06.2013 03:14, Dmitriy Lyubimov wrote:
> > > Hello,
> > >
> > > so i finally got around to actually do it.
> > >
> > > I want to get Mahout sparse vectors and matrices (DRMs) and rebuild
> some
> > > solvers using spark and Bagel /scala.
> > >
> > > I also want to use in-core solvers that run directly on Mahout.
> > >
> > > Question #1: which mahout artifacts are better be imported if I don't
> > want
> > > to pick the hadoop stuff dependencies? Is there even such a separation
> of
> > > code? I know mahout-math seems to try to avoid being hadoop specfic but
> > not
> > > sure if it is followed strictly.
> > >
> > > Question #2: which in-core solvers are available for Mahout matrices? I
> > > know there's SSVD, probably Cholesky, is there something else? In
> > > paticular, i need to be solving linear systems, I guess Cholesky should
> > be
> > > equipped enough to do just that?
> > >
> > > Question #3: why did we try to import Colt solvers rather than actually
> > > depend on Colt in the first place? Why did we not accept Colt's sparse
> > > matrices and created native ones instead?
> > >
> > > Colt seems to have a notion of parse in-core matrices too and seems
> like
> > a
> > > well-rounded solution. However, it doesn't seem like being actively
> > > supported, whereas I know Mahout experienced continued enhancements to
> > the
> > > in-core matrix support.
> > >
> > > Thanks in advance
> > > -Dmitriy
> > >
> >
> >
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Thank you, Sebastian.

Actually ALS flavours are indeed one of my first pragmatic goals -- i have
also done a few customization for my employer -- so i probably will
pragmatically pursue those customizations first. In particular, i do use
Koren-Volinsky confidence weighting, but assume we still work with sparse
observations and therefore sparse algebra of ALS-WR still applies. I
provide fold-in routine for users with fewer than N observations and just
new users thus adding incremental approach to learning. I also spend a lot
of time of adaptive validation of weights and regularization (which is why
my R prototypes are no longer sufficient here, actually, my prototype
doesn't take the load of a midsize customer anymore.)


On Wed, Jun 19, 2013 at 3:54 AM, Sebastian Schelter <ss...@apache.org> wrote:

> I have a JBlas version of our ALS solving code lying around [1], feel
> free to use it. Would also be interested to see the Spark port.
>
> -sebastian
>
>
> [1]
>
> https://github.com/sscdotopen/mahout-als/blob/jblas/math/src/main/java/org/apache/mahout/math/als/JBlasAlternatingLeastSquaresSolver.java
>
> On 19.06.2013 12:50, Nick Pentreath wrote:
> > Hi Dmitriy
> >
> > I'd be interested to look at helping with this potentially (time
> > permitting).
> >
> > I've recently been working on a port of Mahout's ALS implementation to
> > Spark. I spent a bit of time thinking about how much of mahout-math to
> use.
> >
> > For now I found that using the Breeze linear algebra library I could get
> > what I needed, ie DenseVector, SparseVector, DenseMatrix, all with
> > in-memory multiply and solve that is backed by JBLAS (so very quick if
> you
> > have the native libraries active). It comes with very nice "Matlab-like"
> > syntax in Scala. So it ended up being a bit of a rewrite rather than a
> port
> > of the Mahout code.
> >
> > The sparse matrix support is however a bit... well, sparse :) There is a
> > CSC matrix and some operations but Sparse SVD is not there, and the
> solvers
> > I think are not there just yet (in-core).
> >
> > But of course the linear algebra objects are not easily usable from Java
> > due to the syntax and the heavy use of implicits. So for a fully
> functional
> > Java API version that can use the vectors/matrices directly, the options
> > would be to create a Java bridge to the Breeze vectors/matrices, or to
> > instead look at using mahout-math to drive the linear algebra. In that
> case
> > the Scala syntax would not be as nice, but some sugar can be added again
> > using implicits for common operations (I've tested this a bit and it can
> > work and probably be made reasonably efficient if copies are avoided in
> the
> > implicit conversion).
> >
> > Anyway, I'd be happy to offer assistance.
> >
> > Nick
> >
> >
> > On Wed, Jun 19, 2013 at 8:09 AM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >
> >> Let us know how I went, I'm pretty interested to see how well our stuff
> >> integrates with Spark, especially since Spark is in the process of
> >> joining Apache.
> >>
> >> -sebastian
> >>
> >> On 19.06.2013 03:14, Dmitriy Lyubimov wrote:
> >>> Hello,
> >>>
> >>> so i finally got around to actually do it.
> >>>
> >>> I want to get Mahout sparse vectors and matrices (DRMs) and rebuild
> some
> >>> solvers using spark and Bagel /scala.
> >>>
> >>> I also want to use in-core solvers that run directly on Mahout.
> >>>
> >>> Question #1: which mahout artifacts are better be imported if I don't
> >> want
> >>> to pick the hadoop stuff dependencies? Is there even such a separation
> of
> >>> code? I know mahout-math seems to try to avoid being hadoop specfic but
> >> not
> >>> sure if it is followed strictly.
> >>>
> >>> Question #2: which in-core solvers are available for Mahout matrices? I
> >>> know there's SSVD, probably Cholesky, is there something else? In
> >>> paticular, i need to be solving linear systems, I guess Cholesky should
> >> be
> >>> equipped enough to do just that?
> >>>
> >>> Question #3: why did we try to import Colt solvers rather than actually
> >>> depend on Colt in the first place? Why did we not accept Colt's sparse
> >>> matrices and created native ones instead?
> >>>
> >>> Colt seems to have a notion of parse in-core matrices too and seems
> like
> >> a
> >>> well-rounded solution. However, it doesn't seem like being actively
> >>> supported, whereas I know Mahout experienced continued enhancements to
> >> the
> >>> in-core matrix support.
> >>>
> >>> Thanks in advance
> >>> -Dmitriy
> >>>
> >>
> >>
> >
>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Sebastian Schelter <ss...@apache.org>.
I have a JBlas version of our ALS solving code lying around [1], feel
free to use it. Would also be interested to see the Spark port.

-sebastian


[1]
https://github.com/sscdotopen/mahout-als/blob/jblas/math/src/main/java/org/apache/mahout/math/als/JBlasAlternatingLeastSquaresSolver.java

On 19.06.2013 12:50, Nick Pentreath wrote:
> Hi Dmitriy
> 
> I'd be interested to look at helping with this potentially (time
> permitting).
> 
> I've recently been working on a port of Mahout's ALS implementation to
> Spark. I spent a bit of time thinking about how much of mahout-math to use.
> 
> For now I found that using the Breeze linear algebra library I could get
> what I needed, ie DenseVector, SparseVector, DenseMatrix, all with
> in-memory multiply and solve that is backed by JBLAS (so very quick if you
> have the native libraries active). It comes with very nice "Matlab-like"
> syntax in Scala. So it ended up being a bit of a rewrite rather than a port
> of the Mahout code.
> 
> The sparse matrix support is however a bit... well, sparse :) There is a
> CSC matrix and some operations but Sparse SVD is not there, and the solvers
> I think are not there just yet (in-core).
> 
> But of course the linear algebra objects are not easily usable from Java
> due to the syntax and the heavy use of implicits. So for a fully functional
> Java API version that can use the vectors/matrices directly, the options
> would be to create a Java bridge to the Breeze vectors/matrices, or to
> instead look at using mahout-math to drive the linear algebra. In that case
> the Scala syntax would not be as nice, but some sugar can be added again
> using implicits for common operations (I've tested this a bit and it can
> work and probably be made reasonably efficient if copies are avoided in the
> implicit conversion).
> 
> Anyway, I'd be happy to offer assistance.
> 
> Nick
> 
> 
> On Wed, Jun 19, 2013 at 8:09 AM, Sebastian Schelter <ss...@apache.org> wrote:
> 
>> Let us know how I went, I'm pretty interested to see how well our stuff
>> integrates with Spark, especially since Spark is in the process of
>> joining Apache.
>>
>> -sebastian
>>
>> On 19.06.2013 03:14, Dmitriy Lyubimov wrote:
>>> Hello,
>>>
>>> so i finally got around to actually do it.
>>>
>>> I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
>>> solvers using spark and Bagel /scala.
>>>
>>> I also want to use in-core solvers that run directly on Mahout.
>>>
>>> Question #1: which mahout artifacts are better be imported if I don't
>> want
>>> to pick the hadoop stuff dependencies? Is there even such a separation of
>>> code? I know mahout-math seems to try to avoid being hadoop specfic but
>> not
>>> sure if it is followed strictly.
>>>
>>> Question #2: which in-core solvers are available for Mahout matrices? I
>>> know there's SSVD, probably Cholesky, is there something else? In
>>> paticular, i need to be solving linear systems, I guess Cholesky should
>> be
>>> equipped enough to do just that?
>>>
>>> Question #3: why did we try to import Colt solvers rather than actually
>>> depend on Colt in the first place? Why did we not accept Colt's sparse
>>> matrices and created native ones instead?
>>>
>>> Colt seems to have a notion of parse in-core matrices too and seems like
>> a
>>> well-rounded solution. However, it doesn't seem like being actively
>>> supported, whereas I know Mahout experienced continued enhancements to
>> the
>>> in-core matrix support.
>>>
>>> Thanks in advance
>>> -Dmitriy
>>>
>>
>>
> 


Re: Mahout vectors/matrices/solvers on spark

Posted by Nick Pentreath <ni...@gmail.com>.
Hi Dmitriy

I'd be interested to look at helping with this potentially (time
permitting).

I've recently been working on a port of Mahout's ALS implementation to
Spark. I spent a bit of time thinking about how much of mahout-math to use.

For now I found that using the Breeze linear algebra library I could get
what I needed, ie DenseVector, SparseVector, DenseMatrix, all with
in-memory multiply and solve that is backed by JBLAS (so very quick if you
have the native libraries active). It comes with very nice "Matlab-like"
syntax in Scala. So it ended up being a bit of a rewrite rather than a port
of the Mahout code.

The sparse matrix support is however a bit... well, sparse :) There is a
CSC matrix and some operations but Sparse SVD is not there, and the solvers
I think are not there just yet (in-core).

But of course the linear algebra objects are not easily usable from Java
due to the syntax and the heavy use of implicits. So for a fully functional
Java API version that can use the vectors/matrices directly, the options
would be to create a Java bridge to the Breeze vectors/matrices, or to
instead look at using mahout-math to drive the linear algebra. In that case
the Scala syntax would not be as nice, but some sugar can be added again
using implicits for common operations (I've tested this a bit and it can
work and probably be made reasonably efficient if copies are avoided in the
implicit conversion).

Anyway, I'd be happy to offer assistance.

Nick


On Wed, Jun 19, 2013 at 8:09 AM, Sebastian Schelter <ss...@apache.org> wrote:

> Let us know how I went, I'm pretty interested to see how well our stuff
> integrates with Spark, especially since Spark is in the process of
> joining Apache.
>
> -sebastian
>
> On 19.06.2013 03:14, Dmitriy Lyubimov wrote:
> > Hello,
> >
> > so i finally got around to actually do it.
> >
> > I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
> > solvers using spark and Bagel /scala.
> >
> > I also want to use in-core solvers that run directly on Mahout.
> >
> > Question #1: which mahout artifacts are better be imported if I don't
> want
> > to pick the hadoop stuff dependencies? Is there even such a separation of
> > code? I know mahout-math seems to try to avoid being hadoop specfic but
> not
> > sure if it is followed strictly.
> >
> > Question #2: which in-core solvers are available for Mahout matrices? I
> > know there's SSVD, probably Cholesky, is there something else? In
> > paticular, i need to be solving linear systems, I guess Cholesky should
> be
> > equipped enough to do just that?
> >
> > Question #3: why did we try to import Colt solvers rather than actually
> > depend on Colt in the first place? Why did we not accept Colt's sparse
> > matrices and created native ones instead?
> >
> > Colt seems to have a notion of parse in-core matrices too and seems like
> a
> > well-rounded solution. However, it doesn't seem like being actively
> > supported, whereas I know Mahout experienced continued enhancements to
> the
> > in-core matrix support.
> >
> > Thanks in advance
> > -Dmitriy
> >
>
>

Re: Mahout vectors/matrices/solvers on spark

Posted by Sebastian Schelter <ss...@apache.org>.
Let us know how I went, I'm pretty interested to see how well our stuff
integrates with Spark, especially since Spark is in the process of
joining Apache.

-sebastian

On 19.06.2013 03:14, Dmitriy Lyubimov wrote:
> Hello,
> 
> so i finally got around to actually do it.
> 
> I want to get Mahout sparse vectors and matrices (DRMs) and rebuild some
> solvers using spark and Bagel /scala.
> 
> I also want to use in-core solvers that run directly on Mahout.
> 
> Question #1: which mahout artifacts are better be imported if I don't want
> to pick the hadoop stuff dependencies? Is there even such a separation of
> code? I know mahout-math seems to try to avoid being hadoop specfic but not
> sure if it is followed strictly.
> 
> Question #2: which in-core solvers are available for Mahout matrices? I
> know there's SSVD, probably Cholesky, is there something else? In
> paticular, i need to be solving linear systems, I guess Cholesky should be
> equipped enough to do just that?
> 
> Question #3: why did we try to import Colt solvers rather than actually
> depend on Colt in the first place? Why did we not accept Colt's sparse
> matrices and created native ones instead?
> 
> Colt seems to have a notion of parse in-core matrices too and seems like a
> well-rounded solution. However, it doesn't seem like being actively
> supported, whereas I know Mahout experienced continued enhancements to the
> in-core matrix support.
> 
> Thanks in advance
> -Dmitriy
>