You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/09/13 14:22:20 UTC

VectorWritable subclassing?

Another nugget of design to consider over tea time.

I have an potential concern about VectorWritable. It is now subclassed
by MultiLabelVectorWritable and WeightedVectorWritable. That's OK, if
the output of these Writables is never intended for consumption by a
job that more generically expects Vectors.

The way Writables work, the way to implement that behavior would not
be subclassing. VectorWritable is a factory class for Vector
subclasses, in essence, and so is not subclassed itself.

Even if there is no such interoperability intended, I think the design
here should be adjusted. This is a case for composition rather than
inheritance, since subclassing VectorWritable suggests it can be
substituted for VectorWritable, and it cannot. The structure has also
caused, for example, a private field to become protected and things
like that.

Re: VectorWritable subclassing?

Posted by Drew Farris <dr...@apache.org>.
I'm think I'm missing something. Could I use a
SequenceFile<Integer,MultiLabelVectorWritable> as input to anything
that takes SequenceFile<Integer,VectorWritable>? The get() method
would continue to return a Vector of some sort. I have this vague
recollection that hadoop has problems with this sort of subclassing,
but I don't remember the details.

> The way Writables work, the way to implement that behavior would not
> be subclassing. VectorWritable is a factory class for Vector
> subclasses, in essence, and so is not subclassed itself.

To this end, should VectorWritable be marked final?

Re: VectorWritable subclassing?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  On 9/13/10 5:29 AM, Robin Anil wrote:
> On top of that I question whether we need MultiLabelVector and
> WeightedVector at all, cant multilabel and instance weight be nullable
> fields inside Vector. I dont see a big win (space efficiency) in keeping a
> verbose name and separate subclasses just to add 2 fields. A boolean bit can
> be kept to see if weight or labels are serialized.
>
Didn't we start out with names as vector instance variables and refactor 
them into NamedVector? I agree with Sean's points about subclassing 
VectorWritable and would be +1 to convert WeightedVectorWritable into a 
WeightedVector wrapper instead. The same for MultiLabelVector.

Re: VectorWritable subclassing?

Posted by Sean Owen <sr...@gmail.com>.
No I wouldn't back it out. I'd like to hear more comments if any and
then just evolve it a little.

On Mon, Sep 13, 2010 at 4:59 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> That's just what WeightedVectorWritable did prior to r996363 yesterday. We
> should back out that commit?
>

Re: VectorWritable subclassing?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  On 9/13/10 8:36 AM, Sean Owen wrote:
> Well that goes down another interesting road. I think we have all
> enjoyed the idea of keeping "decoration" out of the core Vector
> implementation. Vector and its subclasses represent only different
> ways of representing elements and values.
>
> Notions like name (and ideally, labels) are farmed out to a decorator
> like NamedVector.
>
> This is all wonderful in the object-oriented world. The language and
> object layout in memory are most happy for you to treat a NamedVector
> as just a Vector; the extra data in memory is irrelevant.
>
> It gets tricky when trying to write Writables for all this, since when
> reading a sequence of objects from a stream you can't somehow know a
> priori that there's more data in there than you expect and what to
> ignore, and how. You don't have a parallel hierarchy of Writables --
> it doesn't work that way. Instead you need one factory
> (VectorWritable) that has knowledge of the serialized form of all
> these things.
>
> (Well, we did initially just serialize with each Vector the name of
> its corresponding Writable. This is a tidy solution indeed, but is a
> lot of overhead. So that went away.)
>
>
> Back to Robin's point:
> If there is a need for such a thing as a "weighted vector", then I
> suggest that instead of injecting a field in Vector, it become another
> decorator class. Likewise, labels should really be handled this way.
> Yes, then VectorWritable needs another header bit for "weighted" and
> needs to reconstruct the vector appropriately. It starts to get messy,
> but works.
>
> My original question was, do we need a "weighted vector" entity? or is
> this only used in a context where one needs to serialize "a vector,
> and a weight too". In the latter case, fine, easy: it should simply
> compose rather than extend VectorWritable IMHO.

That's just what WeightedVectorWritable did prior to r996363 yesterday. 
We should back out that commit?

Re: VectorWritable subclassing?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  +1 That change would touch everything and for what? To avoid a couple 
extra vector wrappers, maybe an explicit cast and a bit more headache in 
VectorWritable? I'm happy to change WeightedVectorWritable back to not 
subclass VW or into a WeightedVector wrapper liked NamedVector. It's 
only used by the clustering steps and the cluster dumper anyway.

On 9/13/10 2:17 PM, Ted Dunning wrote:
> No need to fix a problem until it exists.
>
> On Mon, Sep 13, 2010 at 2:10 PM, Sean Owen<sr...@gmail.com>  wrote:
>
>> Adding a generic type means all "VectorWritable" become
>> "VectorWritable<Vector>" by default and that could be more mess than
>> it's worth. I'm leaning towards not adding that generic type.
>>


Re: VectorWritable subclassing?

Posted by Ted Dunning <te...@gmail.com>.
No need to fix a problem until it exists.

On Mon, Sep 13, 2010 at 2:10 PM, Sean Owen <sr...@gmail.com> wrote:

> Adding a generic type means all "VectorWritable" become
> "VectorWritable<Vector>" by default and that could be more mess than
> it's worth. I'm leaning towards not adding that generic type.
>

Re: VectorWritable subclassing?

Posted by Sean Owen <sr...@gmail.com>.
I glanced, and I couldn't find any instance where the deserialized
Vector is actually cast to a concrete subclass. That's good. It
suggests just about nothing is written to depend on a particular
implementation.

Adding a generic type means all "VectorWritable" become
"VectorWritable<Vector>" by default and that could be more mess than
it's worth. I'm leaning towards not adding that generic type.

Sean

On Mon, Sep 13, 2010 at 9:48 PM, Ted Dunning <te...@gmail.com> wrote:
> Much better answer.  I don't think it is abuse at all ... generics are there
> to save explicit casts.
>
> Keeping the final in place is also good.

Re: VectorWritable subclassing?

Posted by Ted Dunning <te...@gmail.com>.
Much better answer.  I don't think it is abuse at all ... generics are there
to save explicit casts.

Keeping the final in place is also good.

On Mon, Sep 13, 2010 at 1:43 PM, Sean Owen <sr...@gmail.com> wrote:

> You could save this cast with a generic type on
> VectorWritable too. While it might be abusing the meaning of generics
> a little, it would serve a practical purpose. No subclasses needed.
>

Re: VectorWritable subclassing?

Posted by Sean Owen <sr...@gmail.com>.
On Mon, Sep 13, 2010 at 9:35 PM, Ted Dunning <te...@gmail.com> wrote:
> BUT... there is one possible role for sub-classes of VectorWritable.  That
> would be to avoid the otherwise necessary cast
> of the object that is produced by the VectorWritable.  Thus a
> MumbleVectorWritable would delegate all reading to VectorWritable
> but would cast the result to a MumbleVector before returning it.  That cast
> would fail if the objects being read don't sub-class
> MumbleVector and the user code would not need a cast.

It's possible. The implementation-specific Writable saves you cast --
but yes internally it is probably just doing a cast of the result from
VectorWritable. You could save this cast with a generic type on
VectorWritable too. While it might be abusing the meaning of generics
a little, it would serve a practical purpose. No subclasses needed.

Re: VectorWritable subclassing?

Posted by Ted Dunning <te...@gmail.com>.
I agree that VectorWritable should handle construction of all vector types
and it should understand how to do that.

BUT... there is one possible role for sub-classes of VectorWritable.  That
would be to avoid the otherwise necessary cast
of the object that is produced by the VectorWritable.  Thus a
MumbleVectorWritable would delegate all reading to VectorWritable
but would cast the result to a MumbleVector before returning it.  That cast
would fail if the objects being read don't sub-class
MumbleVector and the user code would not need a cast.

That isn't a big deal, though, and I would be +epsilon for the final marking
since it might be more maintainable in the long run since
anybody who hasn't heard this discussion would almost have to look at the
comment if they tried to sub-class VectorWritable.

On Mon, Sep 13, 2010 at 8:36 AM, Sean Owen <sr...@gmail.com> wrote:

> No, and that's the issue, really. A file of MultiLableVectorWritable
> cannot be read by VectorWritable since the latter does not expect that
> extra data. It's not quite a Hadoop issue, but simply that the OO
> world's object representation in memory doesn't exactly translate to
> serializing to a stream neatly.
>
> Yes I would mark VectorWritable final.
>

Re: VectorWritable subclassing?

Posted by Sean Owen <sr...@gmail.com>.
Well that goes down another interesting road. I think we have all
enjoyed the idea of keeping "decoration" out of the core Vector
implementation. Vector and its subclasses represent only different
ways of representing elements and values.

Notions like name (and ideally, labels) are farmed out to a decorator
like NamedVector.

This is all wonderful in the object-oriented world. The language and
object layout in memory are most happy for you to treat a NamedVector
as just a Vector; the extra data in memory is irrelevant.

It gets tricky when trying to write Writables for all this, since when
reading a sequence of objects from a stream you can't somehow know a
priori that there's more data in there than you expect and what to
ignore, and how. You don't have a parallel hierarchy of Writables --
it doesn't work that way. Instead you need one factory
(VectorWritable) that has knowledge of the serialized form of all
these things.

(Well, we did initially just serialize with each Vector the name of
its corresponding Writable. This is a tidy solution indeed, but is a
lot of overhead. So that went away.)


Back to Robin's point:
If there is a need for such a thing as a "weighted vector", then I
suggest that instead of injecting a field in Vector, it become another
decorator class. Likewise, labels should really be handled this way.
Yes, then VectorWritable needs another header bit for "weighted" and
needs to reconstruct the vector appropriately. It starts to get messy,
but works.

My original question was, do we need a "weighted vector" entity? or is
this only used in a context where one needs to serialize "a vector,
and a weight too". In the latter case, fine, easy: it should simply
compose rather than extend VectorWritable IMHO.


To Drew's question:

No, and that's the issue, really. A file of MultiLableVectorWritable
cannot be read by VectorWritable since the latter does not expect that
extra data. It's not quite a Hadoop issue, but simply that the OO
world's object representation in memory doesn't exactly translate to
serializing to a stream neatly.

Yes I would mark VectorWritable final.


To Jeff:

Yeah I guess we're agreed there. In the interest of not rocking the
boat too much I'd be pleased to tease apart these Writables to start,
have a good discussion here, and then make these wrappers if needed
later. It's not yet clear to me we need "WeightedVector", for
instance.



On Mon, Sep 13, 2010 at 1:29 PM, Robin Anil <ro...@gmail.com> wrote:
> On top of that I question whether we need MultiLabelVector and
> WeightedVector at all, cant multilabel and instance weight be nullable
> fields inside Vector. I dont see a big win (space efficiency) in keeping a
> verbose name and separate subclasses just to add 2 fields. A boolean bit can
> be kept to see if weight or labels are serialized.
>

Re: VectorWritable subclassing?

Posted by Robin Anil <ro...@gmail.com>.
On top of that I question whether we need MultiLabelVector and
WeightedVector at all, cant multilabel and instance weight be nullable
fields inside Vector. I dont see a big win (space efficiency) in keeping a
verbose name and separate subclasses just to add 2 fields. A boolean bit can
be kept to see if weight or labels are serialized.