You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Peter K <pe...@yahoo.de> on 2013/12/02 10:31:01 UTC

Clustering Spatial Data

Hi there,

I've have no experience with mahout but I know that it will solve my
problem :) !

I've the following requirements:

 * No hadoop setup should be necessary. I want a simple approach and I
know this is possible with mahout!
 * I have lots of points (~100 million) but also some RAM (32GB)
 * I know the clusters upfront via its center positions.
 * I need to assign every point to exactly one cluster.
 * Every cluster can have a different shape, area size and point count

I've found:
http://en.wikipedia.org/wiki/OPTICS_algorithm
http://en.wikipedia.org/wiki/DBSCAN

Both algorithms do not really pay attention to the fixed cluster center
but I think I will start there. Is one of them implemented in mahout?

Or do you have another idea or hint/link?

Regards,
Peter.

Re: Clustering Spatial Data

Posted by Ted Dunning <te...@gmail.com>.
Peter,

What you say is a bit confusing to me.

You say you have centers already.  But then you talk about algorithms which
find the centers.

Also, you say you want to assign points based on centers, but you also say
that clusters have different shapes, area, size and point count.  Do you
mean that assignment should be purely based on proximity to the center and
that the shape will be whatever it happens to be as a result?

Or do you mean that there is an a priori known shape that has to be taken
into account during point assignment?

If proximity is the only question, and if you can use great circle distance
as your proximity measure, then this problem is fairly easy and can be
handled in just a few lines of code.  One easy way to handle this is to
convert your centers to normalized x, y, z locations using

   x = cos \lambda cos \phi
   y = cos \lambda sin \phi
   z = sin \lambda

where \lambda is the latitude and \phi is the longitude.  Great circle
distance is monotonically related to Euclidean distance in 3-space and thus
is inversely monotonically related to dot production.

This means you can sort the centers by distance to a point by simply
computing x,y,z for the point and then doing the dot product and sorting in
descending order. The nice thing with this is that there are no trig
functions inside your inner loop.

You can also use the haversine formula, but that requires 3-4 trig
functions in the inner loop and is likely to be slower.

You don't really need Mahout for this at all (unless I completely
misunderstand your problem, which is quite possible).



On Mon, Dec 2, 2013 at 1:31 AM, Peter K <pe...@yahoo.de> wrote:

> Hi there,
>
> I've have no experience with mahout but I know that it will solve my
> problem :) !
>
> I've the following requirements:
>
>  * No hadoop setup should be necessary. I want a simple approach and I
> know this is possible with mahout!
>  * I have lots of points (~100 million) but also some RAM (32GB)
>  * I know the clusters upfront via its center positions.
>  * I need to assign every point to exactly one cluster.
>  * Every cluster can have a different shape, area size and point count
>
> I've found:
> http://en.wikipedia.org/wiki/OPTICS_algorithm
> http://en.wikipedia.org/wiki/DBSCAN
>
> Both algorithms do not really pay attention to the fixed cluster center
> but I think I will start there. Is one of them implemented in mahout?
>
> Or do you have another idea or hint/link?
>
> Regards,
> Peter.
>