You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2013/01/13 08:58:47 UTC

Re: scalding and mahout vector

This might be more appropriate on the Mahout list.  I have copied that list
in order to gain the largest audience for the answers.

It is an absolute requirement in Mahout to have multiple vector
implementations.  It is also a requirement that the math library not depend
on Hadoop.

A third absolute requirement in Mahout is that very simple Java programming
suffice for working with Vectors of many types as well as Matrix values.

In order to meet these requirements and allow the simplest form of
map-reduce programming, we implemented a class VectorWritable which will
wrap any kind of vector as a writable object.  You can retrieve the
underlying vector from the VectorWritable and there is some discusion about
making VW implement the Vector interface as well.

If your code returns a VectorWritable, then Hadoop should be able to
serialize it trivially.

If your code returns a Vector, however, it will not natively be
serializable.  It should be possible to inject a single registration into
Kryo, however, that will understand how to serialize Vector's using the
VectorWritable infrastructure.


On Sat, Jan 12, 2013 at 11:49 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i would like to have some mahout vectors flow through a scalding job. i
> thought at first that this should be easy since the mahout vector is a
> writable so if i put it in the tuple all will be fine. but then i realized
> mahout did this thing where they split up the vector in a whole bunch of
> classes and interfaces: they have the Vector interface, implementations
> such as DenseVector and SparseSequentialAcccessVector, and then the class
> VectorWritable which takes a Vector and turns it into a Writable. argh. so
> now if i have for example a DenseVector then i think it will not get
> serialized as a Writable and then kryo will attempt to serialize it
> instead, which is not what i want. any ideas for an elegant solution (i
> wish a simple scala implicit conversion would do the trick!). should i add
> a custom hadoop Serializer to catch these (seems ugly)?
>
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To post to this group, send email to cascading-user@googlegroups.com.
> To unsubscribe from this group, send email to
> cascading-user+unsubscribe@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/cascading-user?hl=en.
>

Re: scalding and mahout vector

Posted by Jake Mannix <ja...@gmail.com>.
I think the key is, as Ted says, every place where you want to emit a
writable form of vector, to wrap it in a VectorWritable.

In scala terms, there is certainly two implicit conversions (a, ahem
bijection in fact) between Vector and VectorWritable, by the get/set
encapsulation of the latter around the former.

On Saturday, January 12, 2013, Ted Dunning wrote:

> This might be more appropriate on the Mahout list.  I have copied that list
> in order to gain the largest audience for the answers.
>
> It is an absolute requirement in Mahout to have multiple vector
> implementations.  It is also a requirement that the math library not depend
> on Hadoop.
>
> A third absolute requirement in Mahout is that very simple Java programming
> suffice for working with Vectors of many types as well as Matrix values.
>
> In order to meet these requirements and allow the simplest form of
> map-reduce programming, we implemented a class VectorWritable which will
> wrap any kind of vector as a writable object.  You can retrieve the
> underlying vector from the VectorWritable and there is some discusion about
> making VW implement the Vector interface as well.
>
> If your code returns a VectorWritable, then Hadoop should be able to
> serialize it trivially.
>
> If your code returns a Vector, however, it will not natively be
> serializable.  It should be possible to inject a single registration into
> Kryo, however, that will understand how to serialize Vector's using the
> VectorWritable infrastructure.
>
>
> On Sat, Jan 12, 2013 at 11:49 PM, Koert Kuipers <koert@tresata.com<javascript:;>>
> wrote:
>
> > i would like to have some mahout vectors flow through a scalding job. i
> > thought at first that this should be easy since the mahout vector is a
> > writable so if i put it in the tuple all will be fine. but then i
> realized
> > mahout did this thing where they split up the vector in a whole bunch of
> > classes and interfaces: they have the Vector interface, implementations
> > such as DenseVector and SparseSequentialAcccessVector, and then the class
> > VectorWritable which takes a Vector and turns it into a Writable. argh.
> so
> > now if i have for example a DenseVector then i think it will not get
> > serialized as a Writable and then kryo will attempt to serialize it
> > instead, which is not what i want. any ideas for an elegant solution (i
> > wish a simple scala implicit conversion would do the trick!). should i
> add
> > a custom hadoop Serializer to catch these (seems ugly)?
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "cascading-user" group.
> > To post to this group, send email to cascading-user@googlegroups.com<javascript:;>
> .
> > To unsubscribe from this group, send email to
> > cascading-user+unsubscribe@googlegroups.com <javascript:;>.
> > For more options, visit this group at
> > http://groups.google.com/group/cascading-user?hl=en.
> >
>


-- 

  -jake