You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by john abbott <ab...@gmail.com> on 2011/02/17 17:48:56 UTC

Clustering assistance, mean shift

Hi,

I was wondering whether someone might be able to help me out.  I'd like to
use Mahout via Elastic map Reduce to cluster some datasets but I'm not sure
I've got the right use case.  I'm hoping someone might be able to comment
and perhaps point me in the direction of some further advice.

I have a dataset which is stored in a database and structured as follows:

Item  Value X   Value Y  Value Z
A       2             4            3
A       3             5            6
A       6             7            9
B       5            8             2
B       2            4             7
...

I would like to create a series of clusters for each item based on the
values of X and Y and Z.  X and Y are geographic co-ordinates i.e. real
world places and Z is a value observed in those places.  What I'd like to
end up with is (for each Item) a series of clusters saying these Values of Z
are coincident at this place (represented by Value X and Y).  I've looked
through and played with the quickstarts and that's all fine but I'm
wondering:

1.  Is this sort of analysis possible?
2.  How I convert my numeric data into the correct format to be processed by
a Job
3.  Any pointers to how I might configure my job in a way that can be
distributed and create a cluster for each item

Thank you to anyone who might be able to help, I'm really excited to get
started with Mahout but I'm struggling to understand whether it's suitable
and how to get started.

Thanks very much,

John

Re: Clustering assistance, mean shift

Posted by john abbott <ab...@gmail.com>.

Ted and Jeff,

Thanks very much for the advice I'll make a start on named Vectors.

thanks very much again

John

On Thu, Feb 17, 2011 at 5:49 PM, Ted Dunning <te...@gmail.com> wrote:

> This should be fine.
>
> I recommend not doing too much special with the coordinates except for
> translating to unit vector positions.  This allows standard Euclidean
> metrics to give you the result you want.
>
> You will also need to scale or translate your third variable so so that it
> is in the same scale of reference as the first two.  The question you
> should
> ask yourself is whether a particular amount of change in each coordinate
> represents about the same level of difference in the final result.
>
> On Thu, Feb 17, 2011 at 9:15 AM, Jeff Eastman <je...@narus.com> wrote:
>
> > Let me first translate your problem a little to make it more tangible.
> > Suppose you have data of the following format [Item Longitude Latitude
> > Altitude]:
> >
> > You can convert this into Mahout NamedVectors, where the name=Item and
> the
> > vector values are [Longitude, Latitude, Altitude]. Look in
> > utils/src/main/java/m/a/o/clustering/conversion for some example jobs for
> > starting points. You will likely need to write your own conversion job to
> > create the NamedVectors the way you want from your input data in its
> > encoding format.
> >
> > Now, you can cluster this data using any of the Mahout algorithms, but
> the
> > clustering will treat all your vector elements equally. I get that you
> > really want to cluster mostly based on Altitude (cluster all the Lon/Lat
> > items which have similar Altitudes). If this is the case then you can use
> > one of our WeightedDistanceMeasures to minimize (or eliminate) the
> effects
> > of Lon/Lat and focus mostly (or entirely) on Altitudes. Or, better, you
> can
> > write your own SphericalDistanceMeasure (to deal with the fact that
> Lon=001
> > is quite close to Lon=359, for example).
> >
> > Hope this helps,
> > Jeff
> >
> > -----Original Message-----
> > From: john abbott [mailto:abbottje@gmail.com]
> > Sent: Thursday, February 17, 2011 8:49 AM
> > To: user@mahout.apache.org
> > Subject: Clustering assistance, mean shift
> >
> > Hi,
> >
> > I was wondering whether someone might be able to help me out.  I'd like
> to
> > use Mahout via Elastic map Reduce to cluster some datasets but I'm not
> sure
> > I've got the right use case.  I'm hoping someone might be able to comment
> > and perhaps point me in the direction of some further advice.
> >
> > I have a dataset which is stored in a database and structured as follows:
> >
> > Item  Value X   Value Y  Value Z
> > A       2             4            3
> > A       3             5            6
> > A       6             7            9
> > B       5            8             2
> > B       2            4             7
> > ...
> >
> > I would like to create a series of clusters for each item based on the
> > values of X and Y and Z.  X and Y are geographic co-ordinates i.e. real
> > world places and Z is a value observed in those places.  What I'd like to
> > end up with is (for each Item) a series of clusters saying these Values
> of
> > Z
> > are coincident at this place (represented by Value X and Y).  I've looked
> > through and played with the quickstarts and that's all fine but I'm
> > wondering:
> >
> > 1.  Is this sort of analysis possible?
> > 2.  How I convert my numeric data into the correct format to be processed
> > by
> > a Job
> > 3.  Any pointers to how I might configure my job in a way that can be
> > distributed and create a cluster for each item
> >
> > Thank you to anyone who might be able to help, I'm really excited to get
> > started with Mahout but I'm struggling to understand whether it's
> suitable
> > and how to get started.
> >
> > Thanks very much,
> >
> > John
> >
>



-- 
John Abbott
Co-Founder
www.oobafit.com
m. 44 (0)7919392754
@scmjea

Re: Clustering assistance, mean shift

Posted by Ted Dunning <te...@gmail.com>.

This should be fine.

I recommend not doing too much special with the coordinates except for
translating to unit vector positions.  This allows standard Euclidean
metrics to give you the result you want.

You will also need to scale or translate your third variable so so that it
is in the same scale of reference as the first two.  The question you should
ask yourself is whether a particular amount of change in each coordinate
represents about the same level of difference in the final result.

On Thu, Feb 17, 2011 at 9:15 AM, Jeff Eastman <je...@narus.com> wrote:

> Let me first translate your problem a little to make it more tangible.
> Suppose you have data of the following format [Item Longitude Latitude
> Altitude]:
>
> You can convert this into Mahout NamedVectors, where the name=Item and the
> vector values are [Longitude, Latitude, Altitude]. Look in
> utils/src/main/java/m/a/o/clustering/conversion for some example jobs for
> starting points. You will likely need to write your own conversion job to
> create the NamedVectors the way you want from your input data in its
> encoding format.
>
> Now, you can cluster this data using any of the Mahout algorithms, but the
> clustering will treat all your vector elements equally. I get that you
> really want to cluster mostly based on Altitude (cluster all the Lon/Lat
> items which have similar Altitudes). If this is the case then you can use
> one of our WeightedDistanceMeasures to minimize (or eliminate) the effects
> of Lon/Lat and focus mostly (or entirely) on Altitudes. Or, better, you can
> write your own SphericalDistanceMeasure (to deal with the fact that Lon=001
> is quite close to Lon=359, for example).
>
> Hope this helps,
> Jeff
>
> -----Original Message-----
> From: john abbott [mailto:abbottje@gmail.com]
> Sent: Thursday, February 17, 2011 8:49 AM
> To: user@mahout.apache.org
> Subject: Clustering assistance, mean shift
>
> Hi,
>
> I was wondering whether someone might be able to help me out.  I'd like to
> use Mahout via Elastic map Reduce to cluster some datasets but I'm not sure
> I've got the right use case.  I'm hoping someone might be able to comment
> and perhaps point me in the direction of some further advice.
>
> I have a dataset which is stored in a database and structured as follows:
>
> Item  Value X   Value Y  Value Z
> A       2             4            3
> A       3             5            6
> A       6             7            9
> B       5            8             2
> B       2            4             7
> ...
>
> I would like to create a series of clusters for each item based on the
> values of X and Y and Z.  X and Y are geographic co-ordinates i.e. real
> world places and Z is a value observed in those places.  What I'd like to
> end up with is (for each Item) a series of clusters saying these Values of
> Z
> are coincident at this place (represented by Value X and Y).  I've looked
> through and played with the quickstarts and that's all fine but I'm
> wondering:
>
> 1.  Is this sort of analysis possible?
> 2.  How I convert my numeric data into the correct format to be processed
> by
> a Job
> 3.  Any pointers to how I might configure my job in a way that can be
> distributed and create a cluster for each item
>
> Thank you to anyone who might be able to help, I'm really excited to get
> started with Mahout but I'm struggling to understand whether it's suitable
> and how to get started.
>
> Thanks very much,
>
> John
>

RE: Clustering assistance, mean shift

Posted by Jeff Eastman <je...@Narus.com>.

Let me first translate your problem a little to make it more tangible. Suppose you have data of the following format [Item Longitude Latitude Altitude]:

You can convert this into Mahout NamedVectors, where the name=Item and the vector values are [Longitude, Latitude, Altitude]. Look in utils/src/main/java/m/a/o/clustering/conversion for some example jobs for starting points. You will likely need to write your own conversion job to create the NamedVectors the way you want from your input data in its encoding format. 

Now, you can cluster this data using any of the Mahout algorithms, but the clustering will treat all your vector elements equally. I get that you really want to cluster mostly based on Altitude (cluster all the Lon/Lat items which have similar Altitudes). If this is the case then you can use one of our WeightedDistanceMeasures to minimize (or eliminate) the effects of Lon/Lat and focus mostly (or entirely) on Altitudes. Or, better, you can write your own SphericalDistanceMeasure (to deal with the fact that Lon=001 is quite close to Lon=359, for example).

Hope this helps,
Jeff

-----Original Message-----
From: john abbott [mailto:abbottje@gmail.com] 
Sent: Thursday, February 17, 2011 8:49 AM
To: user@mahout.apache.org
Subject: Clustering assistance, mean shift

Hi,

I was wondering whether someone might be able to help me out.  I'd like to
use Mahout via Elastic map Reduce to cluster some datasets but I'm not sure
I've got the right use case.  I'm hoping someone might be able to comment
and perhaps point me in the direction of some further advice.

I have a dataset which is stored in a database and structured as follows:

Item  Value X   Value Y  Value Z
A       2             4            3
A       3             5            6
A       6             7            9
B       5            8             2
B       2            4             7
...

I would like to create a series of clusters for each item based on the
values of X and Y and Z.  X and Y are geographic co-ordinates i.e. real
world places and Z is a value observed in those places.  What I'd like to
end up with is (for each Item) a series of clusters saying these Values of Z
are coincident at this place (represented by Value X and Y).  I've looked
through and played with the quickstarts and that's all fine but I'm
wondering:

1.  Is this sort of analysis possible?
2.  How I convert my numeric data into the correct format to be processed by
a Job
3.  Any pointers to how I might configure my job in a way that can be
distributed and create a cluster for each item

Thank you to anyone who might be able to help, I'm really excited to get
started with Mahout but I'm struggling to understand whether it's suitable
and how to get started.

Thanks very much,

John