You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2010/01/05 10:18:09 UTC

Writables and Inheritance

Hey gang,

  I'm working on getting MAHOUT-206 (splitting SparseVector into the two
primary specialized forms - map-based and array based),  and MAHOUT-205
 (pulling Writable out of the math package) finished up, but in digging into
the unit tests and usages of Vectors as Writable thingees, I come upon
either some annoyance at how Hadoop deals with serialization, or else a
misunderstanding of how one goes about using Writables properly when you
want to play nicely with inheritance and nice abstraction.

  If you've got a SequenceFile, you'd like to have it be a
SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd
like the serialization technique and storage to be decoupled from the
algorithims using such a set of data (for example, your algorithm shouldn't
care whether there's a SparseVector or a DenseVector - it may be optimal for
one case over the other, but that's another story).  What is the right way
to do this with Writables?

  From what I can tell, in SequenceFile.Writer#append(Object key, Object
value) (why on earth is it taking Objects?  shouldn't these be Writables?),
it does an explicit check of key.getClass == this.keyClass and
value.getClass() == this.valueClass, which won't do any subclass matching
(and so will fail if value.getClass() is DenseVector.class, and valueClass
is SparseVector.class, or just Vector.class).

  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
would be to have one overall VectorWritable class, which can
serialize/deserialize all Vector implementations.  Right?  This is how I've
in general looked at Writables - they tend very much to be very loosely
object oriented, in the sense that they are typically just wrappers around
some data object, and provide marshalling/unmarshalling capabilities for
said object (but the Writable itself rarely (ever?) actually also implements
any useful interface the held object implements - when you want said object,
you Writable.get() on it to fetch the inner guy).

  Of course, while writing out generically classed vectors is easy without
knowing internals (numNonDefaultElements() == size() tells you whether it's
sparse or not, and in fact you could optimize this further by saying that if
numNonDefaultElements is greater than about size()/2, then switch to a Dense
representation), reading in and choosing which vector class to instantiate
is a pain - you need to either move all the write(DataOutput) and
readFields(DataInput) methods from the vector implementations into the new
VectorWritable, and have a big switch statement deciding which one to call,
or else you need Writable subclasses of each and every concrete vector
implementation which has said methods (and go back and make all nontransient
fields protected instead of private, so the subclass can properly serialize
out said data) - and even this has the big switch effectively, somewhere.
 My default feeling is the latter technique is the way to go, but it still
looks a little ugly.

  Or is there a better way to do this?  What I really think is necessary, as
an end-goal, is for us to be able to spit out int + Vector key-value pairs
from mappers and reducers, and not need to know which kind they are in the
mapper or reducer (because you may get them from doing
someMatrix.times(someVector), in which case all you know is that you have a
Vector), as well as do the other direction (so you can read a
SequenceFile<IntWritable, VectorWritable> and just pop out some Vector
instances).

  -jake

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 3:16 AM, Grant Ingersoll <gs...@apache.org> wrote:

> FWIW,
> http://hadoop.markmail.org/message/jr4cbem46erlhgzu?q=gsingers+from:%22Grant+Ingersoll%22got no response.
>

Wow, yes you've been thinking of exactly this problem.  And not a response.
 Lame lame lame.  Oh well.


> Totally agree on everything, so if you can make it work, +1!  I think up
> until now, we basically took the "let's punt" approach.  I definitely would
> like to remove the need for a user to, in 99% of the cases, ever think about
> which vector implementation they are using.
>

Yeah, especially once you have two different SparseVector impls (which what
I was actually working on, and saw this mess pop up).


> Perhaps it might be worth delving into Hadoop at a bit lower level and see
> if there is anything that can be done there.  Of course, that could be a
> rat's nest.
>

Yes, the other option is write our own
o.a.hadoop.io.serializer.Serializer/Deserializer, which can also implement
Configurable (and thus make fancy choices that the user wants to do if this
is needed/wanted).  This might be another place to stick this code.  The
issue with this is that it requires people writing hadoop jobs to set one
more Configuration parameter - "io.serializations", as opposed to the logic
living in the one VectorWritable class.  Of course, that is also the
*benefit*, in that it allows alternative Serializer/Deserializers to be
written and wired in, and Mahout users could define their own without
monkeying with a VectorWritable subclass or whatnot.  Tha seems like a
really good reason to do it this way, actually.

  -jake


>
> -Grant
>
>
> On Jan 5, 2010, at 4:18 AM, Jake Mannix wrote:
>
> > Hey gang,
> >
> >  I'm working on getting MAHOUT-206 (splitting SparseVector into the two
> > primary specialized forms - map-based and array based),  and MAHOUT-205
> > (pulling Writable out of the math package) finished up, but in digging
> into
> > the unit tests and usages of Vectors as Writable thingees, I come upon
> > either some annoyance at how Hadoop deals with serialization, or else a
> > misunderstanding of how one goes about using Writables properly when you
> > want to play nicely with inheritance and nice abstraction.
> >
> >  If you've got a SequenceFile, you'd like to have it be a
> > SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd
> > like the serialization technique and storage to be decoupled from the
> > algorithims using such a set of data (for example, your algorithm
> shouldn't
> > care whether there's a SparseVector or a DenseVector - it may be optimal
> for
> > one case over the other, but that's another story).  What is the right
> way
> > to do this with Writables?
> >
> >  From what I can tell, in SequenceFile.Writer#append(Object key, Object
> > value) (why on earth is it taking Objects?  shouldn't these be
> Writables?),
> > it does an explicit check of key.getClass == this.keyClass and
> > value.getClass() == this.valueClass, which won't do any subclass matching
> > (and so will fail if value.getClass() is DenseVector.class, and
> valueClass
> > is SparseVector.class, or just Vector.class).
> >
> >  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
> > would be to have one overall VectorWritable class, which can
> > serialize/deserialize all Vector implementations.  Right?  This is how
> I've
> > in general looked at Writables - they tend very much to be very loosely
> > object oriented, in the sense that they are typically just wrappers
> around
> > some data object, and provide marshalling/unmarshalling capabilities for
> > said object (but the Writable itself rarely (ever?) actually also
> implements
> > any useful interface the held object implements - when you want said
> object,
> > you Writable.get() on it to fetch the inner guy).
> >
> >  Of course, while writing out generically classed vectors is easy without
> > knowing internals (numNonDefaultElements() == size() tells you whether
> it's
> > sparse or not, and in fact you could optimize this further by saying that
> if
> > numNonDefaultElements is greater than about size()/2, then switch to a
> Dense
> > representation), reading in and choosing which vector class to
> instantiate
> > is a pain - you need to either move all the write(DataOutput) and
> > readFields(DataInput) methods from the vector implementations into the
> new
> > VectorWritable, and have a big switch statement deciding which one to
> call,
> > or else you need Writable subclasses of each and every concrete vector
> > implementation which has said methods (and go back and make all
> nontransient
> > fields protected instead of private, so the subclass can properly
> serialize
> > out said data) - and even this has the big switch effectively, somewhere.
> > My default feeling is the latter technique is the way to go, but it still
> > looks a little ugly.
> >
> >  Or is there a better way to do this?  What I really think is necessary,
> as
> > an end-goal, is for us to be able to spit out int + Vector key-value
> pairs
> > from mappers and reducers, and not need to know which kind they are in
> the
> > mapper or reducer (because you may get them from doing
> > someMatrix.times(someVector), in which case all you know is that you have
> a
> > Vector), as well as do the other direction (so you can read a
> > SequenceFile<IntWritable, VectorWritable> and just pop out some Vector
> > instances).
> >
> >  -jake
>
>

Re: Writables and Inheritance

Posted by Grant Ingersoll <gs...@apache.org>.
FWIW, http://hadoop.markmail.org/message/jr4cbem46erlhgzu?q=gsingers+from:%22Grant+Ingersoll%22 got no response.

Totally agree on everything, so if you can make it work, +1!  I think up until now, we basically took the "let's punt" approach.  I definitely would like to remove the need for a user to, in 99% of the cases, ever think about which vector implementation they are using.

Perhaps it might be worth delving into Hadoop at a bit lower level and see if there is anything that can be done there.  Of course, that could be a rat's nest.

-Grant


On Jan 5, 2010, at 4:18 AM, Jake Mannix wrote:

> Hey gang,
> 
>  I'm working on getting MAHOUT-206 (splitting SparseVector into the two
> primary specialized forms - map-based and array based),  and MAHOUT-205
> (pulling Writable out of the math package) finished up, but in digging into
> the unit tests and usages of Vectors as Writable thingees, I come upon
> either some annoyance at how Hadoop deals with serialization, or else a
> misunderstanding of how one goes about using Writables properly when you
> want to play nicely with inheritance and nice abstraction.
> 
>  If you've got a SequenceFile, you'd like to have it be a
> SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd
> like the serialization technique and storage to be decoupled from the
> algorithims using such a set of data (for example, your algorithm shouldn't
> care whether there's a SparseVector or a DenseVector - it may be optimal for
> one case over the other, but that's another story).  What is the right way
> to do this with Writables?
> 
>  From what I can tell, in SequenceFile.Writer#append(Object key, Object
> value) (why on earth is it taking Objects?  shouldn't these be Writables?),
> it does an explicit check of key.getClass == this.keyClass and
> value.getClass() == this.valueClass, which won't do any subclass matching
> (and so will fail if value.getClass() is DenseVector.class, and valueClass
> is SparseVector.class, or just Vector.class).
> 
>  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
> would be to have one overall VectorWritable class, which can
> serialize/deserialize all Vector implementations.  Right?  This is how I've
> in general looked at Writables - they tend very much to be very loosely
> object oriented, in the sense that they are typically just wrappers around
> some data object, and provide marshalling/unmarshalling capabilities for
> said object (but the Writable itself rarely (ever?) actually also implements
> any useful interface the held object implements - when you want said object,
> you Writable.get() on it to fetch the inner guy).
> 
>  Of course, while writing out generically classed vectors is easy without
> knowing internals (numNonDefaultElements() == size() tells you whether it's
> sparse or not, and in fact you could optimize this further by saying that if
> numNonDefaultElements is greater than about size()/2, then switch to a Dense
> representation), reading in and choosing which vector class to instantiate
> is a pain - you need to either move all the write(DataOutput) and
> readFields(DataInput) methods from the vector implementations into the new
> VectorWritable, and have a big switch statement deciding which one to call,
> or else you need Writable subclasses of each and every concrete vector
> implementation which has said methods (and go back and make all nontransient
> fields protected instead of private, so the subclass can properly serialize
> out said data) - and even this has the big switch effectively, somewhere.
> My default feeling is the latter technique is the way to go, but it still
> looks a little ugly.
> 
>  Or is there a better way to do this?  What I really think is necessary, as
> an end-goal, is for us to be able to spit out int + Vector key-value pairs
> from mappers and reducers, and not need to know which kind they are in the
> mapper or reducer (because you may get them from doing
> someMatrix.times(someVector), in which case all you know is that you have a
> Vector), as well as do the other direction (so you can read a
> SequenceFile<IntWritable, VectorWritable> and just pop out some Vector
> instances).
> 
>  -jake


Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 12:31 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> Of course, now I'm having a real headache with the LDA code - it really
> doesn't
> want me to refactor it to split Writable off of Vector... ugh.
>
>
I guess I need to be more invasive though.  I'll just dig in and swap out
Vector for
VectorWritable everwhere I find it in there, because the LDA code is very
tied to
Writable as it is.

  -jake

>

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 12:26 PM, Ted Dunning <te...@gmail.com> wrote:

> On Tue, Jan 5, 2010 at 1:18 AM, Jake Mannix <ja...@gmail.com> wrote:
>
> >  Or is there a better way to do this?
>
>
> Yes.
>
> http://hadoop.apache.org/avro/
>
>
> The problems you are having is exactly why hadoop will be (already is)
> switching to avro.  It may be that we can switch now and that your writable
> wrapper could use avro under the covers.


Ok, good.  Why there needs to be so many new wheels invented is beyond me
though (Avro, Thrift, Protobuf, etc...).

Of course, now I'm having a real headache with the LDA code - it really
doesn't
want me to refactor it to split Writable off of Vector... ugh.

  -jake

Re: Writables and Inheritance

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Jan 5, 2010 at 1:18 AM, Jake Mannix <ja...@gmail.com> wrote:

>  Or is there a better way to do this?


Yes.

http://hadoop.apache.org/avro/


The problems you are having is exactly why hadoop will be (already is)
switching to avro.  It may be that we can switch now and that your writable
wrapper could use avro under the covers.


-- 
Ted Dunning, CTO
DeepDyve

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 12:32 PM, Ted Dunning <te...@gmail.com> wrote:

> "same representation" doesn't have to mean that the representation doesn't
> have magic internally.
>
> It just means that if you put the same content into three different kinds
> of
> vectors, you plausibly ought to see roughly the same thing go out the wire.
> This is subject to a few caveats like the fact that a dense vector doesn't
> really know if it has only a few non-zero elements.  I would be happy if
> the
> serialized form decided that it had lots of non-zeros and thus could do
> away
> with writing all of the indexes.


Yeah, ok, I guess that is what I was getting at when Drew mentioned fully
decoupling the serialized form from the in-memory representation.  That
would
be ideal, but might be a little more work.

  -jake

Re: Writables and Inheritance

Posted by Ted Dunning <te...@gmail.com>.
"same representation" doesn't have to mean that the representation doesn't
have magic internally.

It just means that if you put the same content into three different kinds of
vectors, you plausibly ought to see roughly the same thing go out the wire.
This is subject to a few caveats like the fact that a dense vector doesn't
really know if it has only a few non-zero elements.  I would be happy if the
serialized form decided that it had lots of non-zeros and thus could do away
with writing all of the indexes.

It might also be that we should write the indexes using a compressed bit
vector format such as a run-length encoding.  That gives low overhead for
very sparse and for very dense vectors.

On Tue, Jan 5, 2010 at 8:38 AM, Jake Mannix <ja...@gmail.com> wrote:

> > I would imagine the serialized form of a vector is the same for
> > SparseVector, DenseVector, etc. There's no question of representation.
> > You write out all the non-default elements.
> >
>
> This will be twice as large in the dense case (there's no need to write out
> indices). Ok, not twice as large but size() * (4 + 8) instead of size() *
> 8.
> That's a pretty significant cost in terms of disk space and IO time.




-- 
Ted Dunning, CTO
DeepDyve

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 8:51 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > Similarly, while write out can be the same for different Sparse impls,
> one
> > will be writing the index/value pairs in order, the other will not, and
> > this
> > will affect what needs to be done on reading in...
>
> You can always serialize the type of vector being written first or
> something. This is what Java serialization does too.
>

Yeah, there's nothing wrong with that, that's what I was thinking too.


> The fact that we're led to java.io.Serializable reinforces the
> question I've always had about this aspect of Hadoop -- what was so
> unusable about the existing serialization mechanism? seems like a
> needless reinvention, that foregoes some of the nice aspects of the
> serialization mechanism.
>
> ... which further leads me to comment that the *best* way of
> approaching all this would be to implement Serializable correctly.
> Then generically create one Writable wrapper that leverages
> Serializable to do its work. Then we have everything implemented
> nicely. You can use Vectors in any context that leverages standard
> serialization -- which is a big deal, for example, in J2EE.
>
> I stand by that until someone points out why this won't work or is
> slow or something at runtime; don't see it yet.
>

I can certainly try and see how doing it that way helps.


> I think the world of vectors probably does break down into, at most,
> sparse and dense representations. So maybe there are at most two
> serialization routines to write. Not bad. I don't really see what's so
> wrong-ish about needing a serialization mechanism for every distinct
> representation -- that would make logical sense at least.
>
> What other representations are we anticipating anyhow?
>

In MAHOUT-206, there comes two SparseVectors - one map-based,
one array based.  They are efficient in different ways.

Other than that, yes, there is one other case I can think of off the
top of my head: RandomVector - you don't need to keep more than
a seed, and a couple of parameters, and it can reconstruct itself
on the fly.

There may be others which are the same data structure as the ones
we have, but have methods overridden in funky ways, maybe.

  -jake

Re: Writables and Inheritance

Posted by Drew Farris <dr...@gmail.com>.
On Tue, Jan 5, 2010 at 11:51 AM, Sean Owen <sr...@gmail.com> wrote:

> ... which further leads me to comment that the *best* way of
> approaching all this would be to implement Serializable correctly.
> Then generically create one Writable wrapper that leverages
> Serializable to do its work. Then we have everything implemented
> nicely. You can use Vectors in any context that leverages standard
> serialization -- which is a big deal, for example, in J2EE.
>
> I stand by that until someone points out why this won't work or is
> slow or something at runtime; don't see it yet.
>

I've seen that Java's serialization mechanism through
ObjectInput/ObjectOutputStream tends to be slower and more
memory-intensive than hand-rolled serialization with writing to
ByteBuffers or byte[]. A part of that may be just the way the Java
default mechanism handles allocating memory and moving the data
around, but the use of reflection and the verbosity of built-in
serialization are certainly factors.

In one case, hand-rolled serialization resulted in a 10x performance
improvement over ObjectOutput/ObjectInput plumbed into
ByteArray(I/O)Streams.

That doesn't mean one shouldn't attempt to implement Serializable for
convenience. I just wanted to make a point about the runtime
performance.

Drew

Re: Writables and Inheritance

Posted by Sean Owen <sr...@gmail.com>.
On Tue, Jan 5, 2010 at 4:38 PM, Jake Mannix <ja...@gmail.com> wrote:
> How does this work?  The very class in use meaning?  If you make a
> SequenceFile<IntWritable, Vector>, with the valueClass == Vector.class,
> you can never pass in something whose runtime class is just Vector, because
> it's non-instantiatable.  You can pass in something which is instanceof
> Vector,
> but getClass() != Vector.class.  Or am I confused?

Er I may be speaking about something different, that I thought was the
same thing. In a Reducer, for example, you can't specify that the
value output type is "Vector" -- has to be "SparseVector" or other
implementation.

The restriction isn't due to the generic types or anything but a
result of checks in the code like this.

It's not possible that an object's type is "just Vector" since it's an
interface, so yes the check could never pass.


> Similarly, while write out can be the same for different Sparse impls, one
> will be writing the index/value pairs in order, the other will not, and
> this
> will affect what needs to be done on reading in...

You can always serialize the type of vector being written first or
something. This is what Java serialization does too.

The fact that we're led to java.io.Serializable reinforces the
question I've always had about this aspect of Hadoop -- what was so
unusable about the existing serialization mechanism? seems like a
needless reinvention, that foregoes some of the nice aspects of the
serialization mechanism.

... which further leads me to comment that the *best* way of
approaching all this would be to implement Serializable correctly.
Then generically create one Writable wrapper that leverages
Serializable to do its work. Then we have everything implemented
nicely. You can use Vectors in any context that leverages standard
serialization -- which is a big deal, for example, in J2EE.

I stand by that until someone points out why this won't work or is
slow or something at runtime; don't see it yet.


> Oh, it'll work - I'm just seeing that if we get more Vector
> implementations,
> to keep serialization separate from the math stuff, it means a proliferation
> of classes (FooVectorWritable...) and an ever expanding switch statement
> in the VectorWritable, which could get fragile, and wondered whether there
> was a "best practice" way this should be done in Hadoop, so you can have
> Writables which actually live in a useful hierarchy decorating some helpful
> information on top of classes which do other things.  Mahout trunk treats
> Writable like it was Serializable (or more precisely: Externalizable),
> which
> is great and object-oriented and nice.  Except that Hadoop totally breaks
> proper OOP and doesn't let you do that right.

I think the world of vectors probably does break down into, at most,
sparse and dense representations. So maybe there are at most two
serialization routines to write. Not bad. I don't really see what's so
wrong-ish about needing a serialization mechanism for every distinct
representation -- that would make logical sense at least.

What other representations are we anticipating anyhow?

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 1:36 AM, Sean Owen <sr...@gmail.com> wrote:

> On Tue, Jan 5, 2010 at 9:18 AM, Jake Mannix <ja...@gmail.com> wrote:
> >  From what I can tell, in SequenceFile.Writer#append(Object key, Object
> > value) (why on earth is it taking Objects?  shouldn't these be
> Writables?),
>
> There's also a version that takes Writables. Why, I don't know, but
> assume you're triggering the other one.
>

Yeah, I saw that after I wrote it.  I realized why: because Serializer can
take
non-writable objects and write them if they are implemented to do so.


> > it does an explicit check of key.getClass == this.keyClass and
> > value.getClass() == this.valueClass, which won't do any subclass matching
> > (and so will fail if value.getClass() is DenseVector.class, and
> valueClass
> > is SparseVector.class, or just Vector.class).
>
> Yes, I've hit this too. You can say the value or key class is an
> interface for this reason. It has to be the very class in use. I can
> imagine reasons for this.
>

How does this work?  The very class in use meaning?  If you make a
SequenceFile<IntWritable, Vector>, with the valueClass == Vector.class,
you can never pass in something whose runtime class is just Vector, because
it's non-instantiatable.  You can pass in something which is instanceof
Vector,
but getClass() != Vector.class.  Or am I confused?


> >  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
> > would be to have one overall VectorWritable class, which can
> > serialize/deserialize all Vector implementations.  Right?  This is how
> I've
>
> Yes this is what I imagine.
>

Ok, good. That's how I've been working.


> > is a pain - you need to either move all the write(DataOutput) and
> > readFields(DataInput) methods from the vector implementations into the
> new
> > VectorWritable, and have a big switch statement deciding which one to
> call,
>
> I would imagine the serialized form of a vector is the same for
> SparseVector, DenseVector, etc. There's no question of representation.
> You write out all the non-default elements.
>

This will be twice as large in the dense case (there's no need to write out
indices). Ok, not twice as large but size() * (4 + 8) instead of size() *
8.
That's a pretty significant cost in terms of disk space and IO time.

Similarly, while write out can be the same for different Sparse impls, one
will be writing the index/value pairs in order, the other will not, and
this
will affect what needs to be done on reading in...


> Reading in, yes there is some element of choice, and your heuristic is
> fine. VectorWritable creates a Vector which can be obtained by the
> caller, and could be sparse or dense.
>
> Is your point that this won't do for some reason?
>

Oh, it'll work - I'm just seeing that if we get more Vector
implementations,
to keep serialization separate from the math stuff, it means a proliferation
of classes (FooVectorWritable...) and an ever expanding switch statement
in the VectorWritable, which could get fragile, and wondered whether there
was a "best practice" way this should be done in Hadoop, so you can have
Writables which actually live in a useful hierarchy decorating some helpful
information on top of classes which do other things.  Mahout trunk treats
Writable like it was Serializable (or more precisely: Externalizable),
which
is great and object-oriented and nice.  Except that Hadoop totally breaks
proper OOP and doesn't let you do that right.

I was just hoping that I was misunderstanding how Hadoop works in some
way.  At this point I don't think I was, unfortunately.

This should work fine, it's just not the way I'd do it if I were designing
it
myself.

  -jake

Re: Writables and Inheritance

Posted by Sean Owen <sr...@gmail.com>.
On Tue, Jan 5, 2010 at 9:18 AM, Jake Mannix <ja...@gmail.com> wrote:
>  From what I can tell, in SequenceFile.Writer#append(Object key, Object
> value) (why on earth is it taking Objects?  shouldn't these be Writables?),

There's also a version that takes Writables. Why, I don't know, but
assume you're triggering the other one.

> it does an explicit check of key.getClass == this.keyClass and
> value.getClass() == this.valueClass, which won't do any subclass matching
> (and so will fail if value.getClass() is DenseVector.class, and valueClass
> is SparseVector.class, or just Vector.class).

Yes, I've hit this too. You can say the value or key class is an
interface for this reason. It has to be the very class in use. I can
imagine reasons for this.


>  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
> would be to have one overall VectorWritable class, which can
> serialize/deserialize all Vector implementations.  Right?  This is how I've

Yes this is what I imagine.


> is a pain - you need to either move all the write(DataOutput) and
> readFields(DataInput) methods from the vector implementations into the new
> VectorWritable, and have a big switch statement deciding which one to call,

I would imagine the serialized form of a vector is the same for
SparseVector, DenseVector, etc. There's no question of representation.
You write out all the non-default elements.

Reading in, yes there is some element of choice, and your heuristic is
fine. VectorWritable creates a Vector which can be obtained by the
caller, and could be sparse or dense.

Is your point that this won't do for some reason?

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 10:27 AM, Drew Farris <dr...@gmail.com> wrote:

> On Tue, Jan 5, 2010 at 1:08 PM, Jake Mannix <ja...@gmail.com> wrote:
>
> I assumed it could be done it similarly to the way in which it is
> currently done for Vector implementations, but I'm surprised this even
> works in light of Hadoop's exact class matching. I'll have to crack
> open the Mahout code later to take a closer look.
>
>
It's hard to find where the ugly little pieces bit you - because you can
certainly have
Mapper<IntWritable, Vector,  Foo, Bar> (and we do - CanopyMapper for
instance),
and if you unit test with DummyOutputCollector, you'll never see the
ugliness - it
only rears its head once you whip out the SequenceFile.

See: HADOOP-5452 <http://issues.apache.org/jira/browse/HADOOP-5452> for more
details.

  -jake

Re: Writables and Inheritance

Posted by Drew Farris <dr...@gmail.com>.
On Tue, Jan 5, 2010 at 1:08 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> How would you specify which Writable implementation at runtime?  You
> have Mapper and Reducers which are keyed on Writable types... you need
> to pick which one to use.

I assumed it could be done it similarly to the way in which it is
currently done for Vector implementations, but I'm surprised this even
works in light of Hadoop's exact class matching. I'll have to crack
open the Mahout code later to take a closer look.

Drew

Re: Writables and Inheritance

Posted by Jake Mannix <ja...@gmail.com>.
On Tue, Jan 5, 2010 at 10:02 AM, Drew Farris <dr...@gmail.com> wrote:

> On Tue, Jan 5, 2010 at 11:46 AM, Drew Farris <dr...@gmail.com>
> wrote:
>
> >
> > Have you seen any cases where a class hierarchy of Writables is
> > established to do something like that? E.g the mapreduce jobs are
> > written to use VectorWritable, but subclasses (e.g
> > SparseVectorWritable) are available for specific needs?
> >
> >
> Bah, nevermind -- this is precisely what Mahout does today without
> separating the Vector and Writable portions into two separate classes.
> Serious brain lapse that one.
>

Yeah, that was what isn't working well - Hadoop likes to check exact
match on classes, and kills proper OOD.  There may be a reason for it,
but I can't see it.


> Of course this would probably be a very straightforward approach to
> implement: Simply separate out the Writable portions of each Vector
> implementation into its own class. The Writable implementation to use would
> specified at runtime and this would also determine which underlying Vector
> implementation is used. The price we pay for separating the Writable stuff
> from the Vectors is an extra class that implements Writable for each
> implementation. Since the Writable (an thus implementation) to use is
> specified at runtime via options, there's no need for an ugly switch
> statement anywhere.
>

How would you specify which Writable implementation at runtime?  You
have Mapper and Reducers which are keyed on Writable types... you need
to pick which one to use.


> Theoretically one could even decouple the writable (serialization style)
> from the (in-memory) implementation, but I don't know if there is any need
> for that whatsoever.
>

Yeah, I'd like this, because the two different SparseVector impls have
different
in-memory structure, but basically the same serialization (key-value pairs
of
int and double).  I think I can work around a way to get this to work.  Just
not
sure how ugly it would get.

  -jake

Re: Writables and Inheritance

Posted by Drew Farris <dr...@gmail.com>.
On Tue, Jan 5, 2010 at 11:46 AM, Drew Farris <dr...@gmail.com> wrote:

>
> Have you seen any cases where a class hierarchy of Writables is
> established to do something like that? E.g the mapreduce jobs are
> written to use VectorWritable, but subclasses (e.g
> SparseVectorWritable) are available for specific needs?
>
>
Bah, nevermind -- this is precisely what Mahout does today without
separating the Vector and Writable portions into two separate classes.
Serious brain lapse that one.

Of course this would probably be a very straightforward approach to
implement: Simply separate out the Writable portions of each Vector
implementation into its own class. The Writable implementation to use would
specified at runtime and this would also determine which underlying Vector
implementation is used. The price we pay for separating the Writable stuff
from the Vectors is an extra class that implements Writable for each
implementation. Since the Writable (an thus implementation) to use is
specified at runtime via options, there's no need for an ugly switch
statement anywhere.

Theoretically one could even decouple the writable (serialization style)
from the (in-memory) implementation, but I don't know if there is any need
for that whatsoever.

Drew

Re: Writables and Inheritance

Posted by Drew Farris <dr...@gmail.com>.
On Tue, Jan 5, 2010 at 4:18 AM, Jake Mannix <ja...@gmail.com> wrote:

> This is how I've
> in general looked at Writables - they tend very much to be very loosely
> object oriented, in the sense that they are typically just wrappers around
> some data object, and provide marshalling/unmarshalling capabilities for
> said object (but the Writable itself rarely (ever?) actually also implements
> any useful interface the held object implements - when you want said object,
> you Writable.get() on it to fetch the inner guy).

Yes, this is exactly the approach taken (from an api perspective at
least) with the Text Writable in hadoop-core for instance.

>  Or is there a better way to do this?  What I really think is necessary, as
> an end-goal, is for us to be able to spit out int + Vector key-value pairs
> from mappers and reducers, and not need to know which kind they are in the
> mapper or reducer

Perhaps the space advantages for sparse and dense serialized forms
suggest the need for SparseVectorWritable and DenseVectorWritable?
Implementors could make either a choice which to use, or perhaps allow
a specific implementation to be plugged in at runtime in a way similar
to how similarity measures are injected. I suspect there must be some
way to hint to a single VectorWritable class what sort of vector the
sparse data must be read into.

Have you seen any cases where a class hierarchy of Writables is
established to do something like that? E.g the mapreduce jobs are
written to use VectorWritable, but subclasses (e.g
SparseVectorWritable) are available for specific needs?

Drew