You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/01/13 21:18:29 UTC

Re: CardinalityException in DirichletDriver

Can I ask a dumb question -- why is this? conceptually vectors don't
have a maximum size. They just have values at some dimensions and 0
elsewhere. Dotting two vectors is always well-defined.

Certainly particular implementations have a notion of maximum size:
DenseVector. But I'd think it's an implementation-specific possible
error case.

On Wed, Jan 13, 2010 at 8:13 PM, Ted Dunning <te...@gmail.com> wrote:
> dot product is a vector operation that is the sum of products of
> corresponding elements of the two vectors being operated on.  If these
> vectors don't have the same length, then it is an error.

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

The idea of dot products and most vector implementations come from the
linear algebra world.  There, the key concept is vector space with fixed
number of dimensions (and dot products are only simple sums of products for
certain choices of coordinate system).  Essentially all implementations have
inherited this notion of dimension and consider it a serious problem when
dot'ing vectors of different dimension.

This is the same consideration that causes the multiplication of
non-conformable matrices to be considered an error.  You could think of
matrices as being unbounded in dimension, in which case all matrices are
conformable, but this is definitely not the traditional notion nor
implementation.

It is a nice side effect of our implementation that you can define a vector
size as MAX_INT, but that doesn't change the notion of conformability.

In the Dirichlet clustering the NormalModel is working very much in the
linear algebra mind-set.  Data vectors are linear algebra vectors that have
normal distributions that use them as the domain.  The internal parameters
of the normal distribution are vectors or matrices and the notion of
conformability is pretty important as a type check.

On Wed, Jan 13, 2010 at 12:18 PM, Sean Owen <sr...@gmail.com> wrote:

> Can I ask a dumb question -- why is this? conceptually vectors don't
> have a maximum size. ...
>
> Certainly particular implementations have a notion of maximum size:
> DenseVector. But I'd think it's an implementation-specific possible
> error case.
>
>

Re: CardinalityException in DirichletDriver

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jan 13, 2010 at 1:59 PM, Ted Dunning <te...@gmail.com> wrote:

> Unless we go all the way down that road and make SparseMatrix live with the
> same trick, I would be against doing this by default.
>

Certainly - we need to be consistent whichever we do - if we decide to have
our
"default vector space" be R^{\inf} instead of R^{0}, we do it for everything
which
knows about vector spaces.

  -jake

Re: CardinalityException in DirichletDriver

Posted by Ted Dunning <te...@gmail.com>.

Unless we go all the way down that road and make SparseMatrix live with the
same trick, I would be against doing this by default.

On Wed, Jan 13, 2010 at 1:27 PM, Jake Mannix <ja...@gmail.com> wrote:

> You can certainly "turn it off" by making
> all of your (Sparse!) Vectors be "infinite" dimensional from the start.  I
> imagine we could do the reverse, and have it default to infinite
> dimensional and
> only when you construct them with explicit dimensions would you instead
> start
> doing this checking.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: CardinalityException in DirichletDriver

Posted by Jake Mannix <ja...@gmail.com>.

On Wed, Jan 13, 2010 at 12:18 PM, Sean Owen <sr...@gmail.com> wrote:

> Can I ask a dumb question -- why is this? conceptually vectors don't
> have a maximum size. They just have values at some dimensions and 0
> elsewhere. Dotting two vectors is always well-defined.
>

Conceptually vectors belong to some *fixed* vector space, which by
definition
has some fixed dimension (or is infinite dimensional).  Practically
speaking,
every finite dimensional vector space of the same dimension is isomorphic,
and they all live as subspaces of an infinite dimensional one, so you can
map them there (conceptually) and dot them there.  But mathematically
speaking
it doesn't make sense to dot product vectors from different spaces.

Programatically, in the past I've wanted to make sure that to avoid
programmer
error, Vector classes I've written were sometimes parametrized with a
marker interface (interface Vector<T extends VectorSpace>), forcing you to
only do vector operations between vectors of the same space, which gives
you compile time checking that you're not doing something silly (taking a
vector which was projected down to 100 dimensions and dotting it with a
vector which lives in your original 50,000 dimensional "term space", or
avoiding adding a word-bag vector to a document-bag vector, etc...).

I eventually found that such checks were great, and did help, but managing
the delicacies of writing APIs which were properly covariant w.r.t. the
VectorSpace typing (for collections of Vectors, and apis and methods which
took and returned collections of subclasses of vectors, etc etc...) was
more pain that it was worth.

A nice intermediate ground is getting at least runtime checking that you are
not messing up (which is what we have here in Mahout, and what the commons
math folk have, and most everyone).  You can certainly "turn it off" by
making
all of your (Sparse!) Vectors be "infinite" dimensional from the start.  I
imagine
we could do the reverse, and have it default to infinite dimensional and
only
when you construct them with explicit dimensions would you instead start
doing this checking.

  -jake

> Certainly particular implementations have a notion of maximum size:
> DenseVector. But I'd think it's an implementation-specific possible
> error case.