You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/06/14 21:55:09 UTC

Questions about Vector

While leafing through the Vector API today, I had a few questions...

toArray() -- should we really have this? the values, without the
dimensions/indices, don't seem necessarily meaningful. It is used in
one place, DenseMatrix.assignRow(), but seems like there is a
potential bug there. The way its used, assumes the vector is
non-sparse.

get()/getQuick() -- can these be merged now? seems like we would
rather not have this distinction.

copy() -- the semantics here seem identical to the standard clone()
method on Object. Should we just switch to clone()?

Also what is the intended difference between size and cardinality --
size is number of elements actually set to a non-zero value,
cardinality is the actual dimension of the vector?

Re: Questions about Vector

Posted by Benson Margulies <bi...@gmail.com>.

I don't claim that an 'end-user' has much business. Just a
class-that-extends. I'll all for your proposed names. Watch out for
'zero' if the 'default' value is in fact configurable.

On Mon, Jun 15, 2009 at 7:54 PM, Sean Owen<sr...@gmail.com> wrote:
> Yeah exactly, not entirely clear why the *caller* should care about
> this; the class should take care of stuff like that.
>
> At least I propose a rename of size() to getNumNonzeroElements() or
> somesuch, and rename of cardinality() to size().
>
> On Tue, Jun 16, 2009 at 12:46 AM, Benson Margulies<bi...@gmail.com> wrote:
>> Lets say you have an algorithm in which a vector starts out sparse,
>> but gradually fills in. At some point, you want to give up and switch
>> to an ordinary vector. This came up in Brown and DiPietro's
>> entropy-based bigram information word clustering algorithm, and I
>> think it could come up elsewhere.
>>
>> A more complex class based on what you're doing here would use this
>> value to decide when to give up and use a dense vector instead.
>

Re: Questions about Vector

Posted by Sean Owen <sr...@gmail.com>.

Yeah exactly, not entirely clear why the *caller* should care about
this; the class should take care of stuff like that.

At least I propose a rename of size() to getNumNonzeroElements() or
somesuch, and rename of cardinality() to size().

On Tue, Jun 16, 2009 at 12:46 AM, Benson Margulies<bi...@gmail.com> wrote:
> Lets say you have an algorithm in which a vector starts out sparse,
> but gradually fills in. At some point, you want to give up and switch
> to an ordinary vector. This came up in Brown and DiPietro's
> entropy-based bigram information word clustering algorithm, and I
> think it could come up elsewhere.
>
> A more complex class based on what you're doing here would use this
> value to decide when to give up and use a dense vector instead.

Re: Questions about Vector

Posted by Benson Margulies <bi...@gmail.com>.

Lets say you have an algorithm in which a vector starts out sparse,
but gradually fills in. At some point, you want to give up and switch
to an ordinary vector. This came up in Brown and DiPietro's
entropy-based bigram information word clustering algorithm, and I
think it could come up elsewhere.

A more complex class based on what you're doing here would use this
value to decide when to give up and use a dense vector instead.

On Mon, Jun 15, 2009 at 7:42 PM, Sean Owen<sr...@gmail.com> wrote:
> I don't dispute it but what is the use case? I am mostly curious at this point.
>
> On Tue, Jun 16, 2009 at 12:40 AM, Benson Margulies<bi...@gmail.com> wrote:
>> The 'numberOfNonDefaultValueElements' is useful. I'd give it an
>> accessor, with, well, that very name.
>>
>> On Mon, Jun 15, 2009 at 5:20 PM, Sean Owen<sr...@gmail.com> wrote:
>>> OK don't want to push on this last bit too much, but I still see a
>>> small concern here.
>

Re: Questions about Vector

Posted by Sean Owen <sr...@gmail.com>.

I don't dispute it but what is the use case? I am mostly curious at this point.

On Tue, Jun 16, 2009 at 12:40 AM, Benson Margulies<bi...@gmail.com> wrote:
> The 'numberOfNonDefaultValueElements' is useful. I'd give it an
> accessor, with, well, that very name.
>
> On Mon, Jun 15, 2009 at 5:20 PM, Sean Owen<sr...@gmail.com> wrote:
>> OK don't want to push on this last bit too much, but I still see a
>> small concern here.

Re: Questions about Vector

Posted by Benson Margulies <bi...@gmail.com>.

The 'numberOfNonDefaultValueElements' is useful. I'd give it an
accessor, with, well, that very name.

On Mon, Jun 15, 2009 at 5:20 PM, Sean Owen<sr...@gmail.com> wrote:
> OK don't want to push on this last bit too much, but I still see a
> small concern here.
>
> I think it will be error-prone to call this method size(). On Vector,
> I would strongly expect that this tells me the logical number of
> elements of the vector -- which is its cardinality. This is what
> seeing methods like List.size() trains one to expect. It's not that
> though, and, right off the bat we already see a bug caused by this.
>
> One option is renaming. I might further ask, while we're at it,
> whether we need to expose both values. When do I need to know the
> number of non-zero elements, separately from any other operation?
>
> To give another example, Vector 'cardinality' maps to List 'size'
> right now, and you could say Vector 'size' maps to List 'capacity'.
> But List exposes no method regarding its capacity.
>
> Don't want to push too hard on this; it's a minor thing. But I figure
> it's early, and such a key abstraction, that it is worth a bit of
> discussion.
>
> Sean
>
> On Mon, Jun 15, 2009 at 6:16 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>> Whoops, yes, cardinality is intended. Fortunately, the alpha vector in
>> Dirichlet is dense so it does not impact correctness, such as it is.
>>
>>
>

Re: Questions about Vector

Posted by Sean Owen <sr...@gmail.com>.

OK don't want to push on this last bit too much, but I still see a
small concern here.

I think it will be error-prone to call this method size(). On Vector,
I would strongly expect that this tells me the logical number of
elements of the vector -- which is its cardinality. This is what
seeing methods like List.size() trains one to expect. It's not that
though, and, right off the bat we already see a bug caused by this.

One option is renaming. I might further ask, while we're at it,
whether we need to expose both values. When do I need to know the
number of non-zero elements, separately from any other operation?

To give another example, Vector 'cardinality' maps to List 'size'
right now, and you could say Vector 'size' maps to List 'capacity'.
But List exposes no method regarding its capacity.

Don't want to push too hard on this; it's a minor thing. But I figure
it's early, and such a key abstraction, that it is worth a bit of
discussion.

Sean

On Mon, Jun 15, 2009 at 6:16 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> Whoops, yes, cardinality is intended. Fortunately, the alpha vector in
> Dirichlet is dense so it does not impact correctness, such as it is.
>
>

Re: Questions about Vector

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Sean Owen wrote:
> On Mon, Jun 15, 2009 at 4:17 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>   
>> The difference is getQuick does not check bounds.
>>     
>
> This does make sense for practical reasons. From an API perspective,
> seems like it's slightly undesirable to have two of these methods, one
> of which exists just to bypass the checks, as it might lead to silent
> bugs. I see the purpose: performance. but in that case, as I am
> getting to in MAHOUT-121, perhaps it is better to attack that problem
> in a different way? for instance, if most of the setting is done in
> vector construction, and I imagine it is, then we might want instead
> some kind of bulk-set method.
>
>
>   
I have no strong attachment to the quick methods. Ted proposed them in 
the early discussion of the interface and I implemented them. IIRC, 
there was not much discussion of this sort at the time.
>> IIRC, clone of a SparseVector would share the map. Not desirable IMHO.
>>     
>
> The default implementation of clone() produces a shallow copy, indeed,
> but that is why it may be overridden. It seems slightly better to just
> call this method clone() and implement Cloneable to fit into the
> standard APIs.
>
>
>   
ditto
>> Cardinality is the actual dimension and size is the current number of
>> elements. It only differs from cardinality in the case of SparseVector, in
>> which case size returns the number of non-zero elements.
>>     
>
> Got it. Is size() a useful value to expose? seems like an
> implementation detail. What would a caller, in general, do with that?
> You could say it can't hurt to expose, but for instance, the first use
> of it that came up in my IDE appears to be a bug (could be wrong) in
> UncommonsDistributions:
>
>   public static Vector rDirichlet(Vector alpha) {
>     Vector r = alpha.like();
>     double total = alpha.zSum();
>     double remainder = 1;
>     for (int i = 0; i < r.size(); i++) {
>       double a = alpha.get(i);
>       total -= a;
>       double beta = rBeta(a, Math.max(0, total));
>       double p = beta * remainder;
>       r.set(i, p);
>       remainder -= p;
>     }
>     return r;
>   }
>
> Is cardinality intended?
>
>
>   
Whoops, yes, cardinality is intended. Fortunately, the alpha vector in 
Dirichlet is dense so it does not impact correctness, such as it is.

Re: Questions about Vector

Posted by Sean Owen <sr...@gmail.com>.

OK I have ready the changes described in this thread, but I didn't go
ahead with removing "quick" methods. It proved to be a very large
change, and, there were some situations where it looks like we need to
think a little harder about what to do with the resulting code; it's
not really just a matter of removing one method, which would end up
incurring a lot of overhead in normal operations. That is to say --
the problem is not merely solved with a bulk update method either.

On Tue, Jun 16, 2009 at 1:43 AM, Ted Dunning<te...@gmail.com> wrote:
> IN fact, setQuick could be viewed as a performance degrading option since it
> leads people to imagine that they have "optimized" their code.  set() is
> generally inlined and the bounds checks are often lifted out of the loop so
> there would be no difference.  As Sean suggest, however, there is a huge
> benefit to be gained by block updates.
>
> Thus, eliminating setQuick might encourage people to avoid complacency and
> seek out a bulk update.
>

Re: Questions about Vector

Posted by Ted Dunning <te...@gmail.com>.

IN fact, setQuick could be viewed as a performance degrading option since it
leads people to imagine that they have "optimized" their code.  set() is
generally inlined and the bounds checks are often lifted out of the loop so
there would be no difference.  As Sean suggest, however, there is a huge
benefit to be gained by block updates.

Thus, eliminating setQuick might encourage people to avoid complacency and
seek out a bulk update.

On Mon, Jun 15, 2009 at 9:59 AM, Sean Owen <sr...@gmail.com> wrote:

> > The difference is getQuick does not check bounds.
>
> This does make sense for practical reasons. From an API perspective,
> seems like it's slightly undesirable to have two of these methods, one
> of which exists just to bypass the checks, as it might lead to silent
> bugs. I see the purpose: performance. but in that case, as I am
> getting to in MAHOUT-121, perhaps it is better to attack that problem
> in a different way? for instance, if most of the setting is done in
> vector construction, and I imagine it is, then we might want instead
> some kind of bulk-set method.

Re: Questions about Vector

Posted by Sean Owen <sr...@gmail.com>.

On Mon, Jun 15, 2009 at 4:17 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> The difference is getQuick does not check bounds.

This does make sense for practical reasons. From an API perspective,
seems like it's slightly undesirable to have two of these methods, one
of which exists just to bypass the checks, as it might lead to silent
bugs. I see the purpose: performance. but in that case, as I am
getting to in MAHOUT-121, perhaps it is better to attack that problem
in a different way? for instance, if most of the setting is done in
vector construction, and I imagine it is, then we might want instead
some kind of bulk-set method.

> IIRC, clone of a SparseVector would share the map. Not desirable IMHO.

The default implementation of clone() produces a shallow copy, indeed,
but that is why it may be overridden. It seems slightly better to just
call this method clone() and implement Cloneable to fit into the
standard APIs.

> Cardinality is the actual dimension and size is the current number of
> elements. It only differs from cardinality in the case of SparseVector, in
> which case size returns the number of non-zero elements.

Got it. Is size() a useful value to expose? seems like an
implementation detail. What would a caller, in general, do with that?
You could say it can't hurt to expose, but for instance, the first use
of it that came up in my IDE appears to be a bug (could be wrong) in
UncommonsDistributions:

  public static Vector rDirichlet(Vector alpha) {
    Vector r = alpha.like();
    double total = alpha.zSum();
    double remainder = 1;
    for (int i = 0; i < r.size(); i++) {
      double a = alpha.get(i);
      total -= a;
      double beta = rBeta(a, Math.max(0, total));
      double p = beta * remainder;
      r.set(i, p);
      remainder -= p;
    }
    return r;
  }

Is cardinality intended?

Re: Questions about Vector

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Sean Owen wrote:
> While leafing through the Vector API today, I had a few questions...
>
> toArray() -- should we really have this? the values, without the
> dimensions/indices, don't seem necessarily meaningful. It is used in
> one place, DenseMatrix.assignRow(), but seems like there is a
> potential bug there. The way its used, assumes the vector is
> non-sparse.
>   
SparseVector will materialize the entire array, including zero-valued 
elements. I'm not sure this would ever be a desired outcome and am ok 
removing the operation.
> get()/getQuick() -- can these be merged now? seems like we would
> rather not have this distinction.
>   
The difference is getQuick does not check bounds.
> copy() -- the semantics here seem identical to the standard clone()
> method on Object. Should we just switch to clone()?
>   
IIRC, clone of a SparseVector would share the map. Not desirable IMHO.
> Also what is the intended difference between size and cardinality --
> size is number of elements actually set to a non-zero value,
> cardinality is the actual dimension of the vector?
>
>   
Cardinality is the actual dimension and size is the current number of 
elements. It only differs from cardinality in the case of SparseVector, 
in which case size returns the number of non-zero elements.