You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by AnilKumar B <ak...@gmail.com> on 2014/07/13 16:19:44 UTC

Vectorization for clustering

Hi All,

I am new to mahout and just want to understand below.

I would like to know, why mahout clustering algorithms need numerical
vectorization of actual records(like json etc)?

When we have a record with mixed data types and if we convert it into
numerical vector, we may not be able to apply field wise comparisons and
also maintaing mapping b/w actual record and vector also a problem.

Is it numerical vectorization only for performance optimization? or is
there any other reason.

Does it make sense to apply clustering directly on actual records?


Thanks & Regards,
B Anil Kumar.

Re: Vectorization for clustering

Posted by AnilKumar B <ak...@gmail.com>.

Thanks for the clarification Ted.

Thanks & Regards,
B Anil Kumar.


On Mon, Jul 14, 2014 at 3:47 AM, Ted Dunning <te...@gmail.com> wrote:

> On Sun, Jul 13, 2014 at 7:19 AM, AnilKumar B <ak...@gmail.com>
> wrote:
>
> > Is it numerical vectorization only for performance optimization? or is
> > there any other reason.
> >
> > Does it make sense to apply clustering directly on actual records?
> >
>
> You can define distance measures on the original data, but you can also
> pretty much also define numerical vectorizations which allow those same
> distance measures to be calculated on the vectorized form.  Distance
> measures which have complex forms which are not computable in this way
> will, in many cases, defeat clustering algorithms since assumptions about
> the topological space implied by the distance function are often baked into
> these algorithms.
>
> A good example of this is the triangle inequality.  Using Elkan's
> optimization can improve clustering speed by as much as 10x in some cases,
> but if your distance doesn't satisfy this, then the optimization becomes
> incorrect.
>
> On the other hand, it is easy to guarantee that any distance that is
> computed by first vectorizing and then using a standard distance works
> correctly.
>

Re: Vectorization for clustering

Posted by Ted Dunning <te...@gmail.com>.

On Sun, Jul 13, 2014 at 7:19 AM, AnilKumar B <ak...@gmail.com> wrote:

> Is it numerical vectorization only for performance optimization? or is
> there any other reason.
>
> Does it make sense to apply clustering directly on actual records?
>

You can define distance measures on the original data, but you can also
pretty much also define numerical vectorizations which allow those same
distance measures to be calculated on the vectorized form.  Distance
measures which have complex forms which are not computable in this way
will, in many cases, defeat clustering algorithms since assumptions about
the topological space implied by the distance function are often baked into
these algorithms.

A good example of this is the triangle inequality.  Using Elkan's
optimization can improve clustering speed by as much as 10x in some cases,
but if your distance doesn't satisfy this, then the optimization becomes
incorrect.

On the other hand, it is easy to guarantee that any distance that is
computed by first vectorizing and then using a standard distance works
correctly.