You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Marko Dinic <ma...@nissatech.com> on 2015/01/08 16:00:27 UTC

DTW distance measure and K-medioids, Hierarchical clustering

Hello everyone.

I have a couple of questions.

1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout 
that could be used as a distance measure for clustering?

2) Why isn't there an implementation of K-mediods in Mahout? I'm 
guessing that it could not be implemented efficiently on Hadoop, but I 
wanted to check if something like that is possible.

3) Same question, just considering Agglomerative Hierarchical clustering.

Thanks

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Jan 10, 2015 at 3:02 AM, Marko Dinic <ma...@nissatech.com>
wrote:

> For example, mean of two sinusoids while one of them is shifted by Pi is
> 0. And that's definitely not a good centroid in my case.


Well, if you think that phase shifts represent small distance proportional
to phase difference then the mean of two shifted vectors is the one that
has intermediate phase.  If you normalize away magnitude then you are just
doing distances on a circle.

The definition of the mean is the thing that minimizes squared distances to
every point being averaged.  You have to have a compatible mean definition
to go with your distance definition.

Depending on how much you are warping your signals, I would tend to look to
an auto-encoder representation rather than clustering to get your
similarity.  The auto-encoder internal layer values are often fixed with
and often have good local properties for averaging and squared differences.

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Ted Dunning <te...@gmail.com>.
Anand,

That is a fine idea.  It is called a medoid instead of a mean (
https://en.wikipedia.org/wiki/Medoid )

The basic idea is that for any metric m, you can define the medoid as the
element from the set that minimizes the sum of the distances to the other
elements for that metric.  In one-dimension, the L_1-medoid is the median.
The L_2-medoid is the point in the dataset closest to the mean.

As you say, computing the medoid is expensive.  The cost is clearly bounded
by O(n^2) but you might be able to do somewhat better in practice with
clever branching and bounding.  I don't think that there is anything like
an O(n) algorithm for medoid computation.

On Wed, Jan 14, 2015 at 8:25 PM, Anand Avati <av...@gluster.org> wrote:

> Perhaps you could think of the centroid as one of the signals itself, from
> which the average distance to rest of the signals in the cluster is the
> lowest? This way instead of finding that mythical "mean" of DTWs, you hop
> from one signal to another over iterations as you refine memberships.
> However "mean" calculation would become O(N^2) (within a cluster).
>
> Be warned I have never tried this method.
>
> Thanks
>
> On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic <ma...@nissatech.com>
> wrote:
>
> > Ted,
> >
> > It's because I'm trying to cluster time series that may differ in
> > length, some parts may be shifted, some parts may last longer in one
> > signal than in the other (somehow skewed), but they represent more-less
> > the same thing. DTW seems good because it's optimized for such things
> > (used a lot if speech recognition where problems like this may occur).
> > The problem is that K-means incorporates calculating centroids (and the
> > thing is, I need some kind of centroids that will represent the cluster
> > in the end). Maybe I'm wrong, please correct me, I'm a youngster in
> > this kind of things,  but there is a problem finding centroids for such
> > signals. Distance measures that recalculate centroids (such as
> > Euclidean), calculate pairwise distance between components of two
> > signals (signal and a centroid). How can I use a distance measure for
> > signals with different length? And also, there is shifting and skewing
> > in signals that represent the same thing, and there mean could be
> > totally wrong. For example, mean of two sinusoids while one of them is
> > shifted by Pi is 0. And that's definitely not a good centroid in my
> > case.
> >
> > So, I'm trying to find a scalable solution for my problem, I tried to
> > fit it in what's already implemented in Mahout (for clustering), but
> > it's not so obvious to me.
> >
> > I'm open to suggestions, I'm still new to all of this.
> >
> > Thanks,
> > Marko
> >
> > On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:
> > > Why is it you can't compute a mean?
> > >
> > >
> > >
> > > On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <marko.dinic@nissatech.com
> >
> > > wrote:
> > >
> > >> Thank you for your answer Ted.
> > >>
> > >> What about some kind of Bisecting k-means? I'm trying to cluster time
> > >> series of different length and I came up to an idea to use DTW as a
> > >> similarity measure, which seems to be adequate, but the thing is, I
> > cannot
> > >> use it with K-means, since it's hard to define centroids based on time
> > >> series which can have different length/phase. So I was thinking about
> > >> Hierarchical clustering, since it seems appropriate to combine with
> DTW,
> > >> but is not scalable, as you said. So my next thought is to try with
> > >> bisecting k-means that seems scalable, since it is based on K-means
> step
> > >> repetitions. My idea is next, by steps:
> > >>
> > >> - Take two signals as initial centroids (maybe two signals that have
> > >> smallest similarity, calculated using DTW)
> > >> - Assign all signals to two initial centroids
> > >> - Repeat the procedure on the biggest cluster
> > >>
> > >> In this way I could use DTW as distance measure, that could be useful
> > >> since my data may be shifted, skewed, and avoid calculating centroids.
> > At
> > >> the end I could take one signal from each cluster that is the most
> > similar
> > >> with others in cluster (some kind of centroid/medioid).
> > >>
> > >> What do you think about this approach and about the scalability?
> > >>
> > >> I would highly appreciate your answer, thanks.
> > >>
> > >> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
> > >>
> > >>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <
> marko.dinic@nissatech.com
> > >
> > >>> wrote:
> > >>>
> > >>>   1) Is there an implementation of DTW (Dynamic Time Warping) in
> Mahout
> > >>>> that
> > >>>> could be used as a distance measure for clustering?
> > >>>>
> > >>>>
> > >>> No.
> > >>>
> > >>>
> > >>>
> > >>>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm
> > guessing
> > >>>> that it could not be implemented efficiently on Hadoop, but I wanted
> > to
> > >>>> check if something like that is possible.
> > >>>>
> > >>>>
> > >>> Scalability as you suspected.
> > >>>
> > >>>
> > >>>
> > >>>> 3) Same question, just considering Agglomerative Hierarchical
> > clustering.
> > >>>>
> > >>>>
> > >>> Again.  Agglomerative algorithms tend to be n^2 which contradicts
> > scaling.
> > >>>
> > >>>
> > >
> >
>

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Jan 15, 2015 at 1:47 PM, <ma...@nissatech.com> wrote:

> So, to summarize, my idea of K-medoids with DTW as a distance measure
> (implemented as a two phase mapreduce) doesn't sound like an unrealistic
> idea?
>

Sounds like a fine idea.


> I'm mostly afraid not to get in the situation that the algorithm needs a
> couple of days for generating clusters, because it is more computationally
> expensive than K-means. Is there a way to predict the size of data that
> could be processed in a reasonable amount of time (let's say a day utmost)?
> I understand that again depends on the resources, but I'm hopping for a
> rough approximation, if possible...
>

I would guess that this can be made to work up to a few million elements.
In particular, if you can figure out how to do fast search in order to get
the most similar items quickly.  You can then filter and re-order using
DTW.  For doing the fast search, I would suggest motif extraction.  See the
SAX work [1] or the clustering stuff I did in the Anomaly Detection booklet
[2]

Using ideas related to DP-means algorithm or some of the scalable k-means
stuff.  See Wong, Schindler, Myerson's work. [3] and [4]


>
> Thank you again, I'm actually exited to start with the implementation, if
> it doesn't sound that crazy.
>

Sounds very plausible.  It will be work to get it really right, but could
be really good.

[1] http://www.cs.ucr.edu/~eamonn/SAX.htm
[2] http://info.mapr.com/resources_ebook_anewlook_anomalydetection.html
[3]
http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf
[4] http://dl.acm.org/citation.cfm?id=2133039

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by ma...@nissatech.com.
Dear Ted,

Thank you very much for your answer. It is very inspiring for a  
beginner like me to see the effort that you put to answer to questions  
like mine, I'm sure that they look trivial. And the whole community  
involved is great.

So, to summarize, my idea of K-medoids with DTW as a distance measure  
(implemented as a two phase mapreduce) doesn't sound like an  
unrealistic idea?

I'm mostly afraid not to get in the situation that the algorithm needs  
a couple of days for generating clusters, because it is more  
computationally expensive than K-means. Is there a way to predict the  
size of data that could be processed in a reasonable amount of time  
(let's say a day utmost)? I understand that again depends on the  
resources, but I'm hopping for a rough approximation, if possible...

Thank you again, I'm actually exited to start with the implementation,  
if it doesn't sound that crazy.

I wish you all the best,
Marko



Quoting Ted Dunning <te...@gmail.com>:

> On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>>
>>
>> Thank you for your answer. Maybe I made a wrong picture about my data when
>> giving sinusoid as an example, my time series are not periodic.
>
>
> I merely continued with that example because it illustrates how an
> alternative metric makes all sinusoids look very similar.
>
>
>> Let's say that I have a signal that represents value of power when some
>> device is turned on. That power signal depends of the time when person
>> turns on that device. The device can be turned on after different interval
>> and can stay turned on for different durations. But all that represents a
>> similar behaviour that can be grouped together. I'm simplifying things a
>> bit, but that's the concept.
>>
>
> Sure.  But mean and medoid are still something you can define here.
>
>
>> Either way, I made a simple test in RapidMiner and used K-mediods with DTW
>> as a similarity measure on a small subset and it gave very satisfying
>> results (please don't laugh at my approach, I'm a beginner :) ).
>>
>
> Not laughing.  Sounds like a fine result.
>
>
>>
>> Now, in one of the previous mails you have answered that K-mediods isn't
>> scalable, so it isn't implemented. Does scalable means not possible, or not
>> such a good option? Why isn't it scalable? Does it take too much time for
>> execution?
>>
>
> Scalable means that as the data gets bigger, the amount of resources gets
> at most bigger in linear proportion.
>
>
>>
>> I've got an idea to implement it as two phase mapreduce, like this
>>
>
> Yeah... this is pretty standard,
>
> BUT
>
> That takes a number of iterations that is unknown and which depends on the
> size of your data.  Each pass costs O(n).  Having many passes means that
> you are super-linear in cost.
>
> For k-means, you can create a sketch in one pass over the data and then
> munch on the sketch to get a clustering.  This gives an O(n + k log n +
> small goo) answer.



Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Jan 15, 2015 at 3:50 AM, Marko Dinic <ma...@nissatech.com>
wrote:

>
>
> Thank you for your answer. Maybe I made a wrong picture about my data when
> giving sinusoid as an example, my time series are not periodic.


I merely continued with that example because it illustrates how an
alternative metric makes all sinusoids look very similar.


> Let's say that I have a signal that represents value of power when some
> device is turned on. That power signal depends of the time when person
> turns on that device. The device can be turned on after different interval
> and can stay turned on for different durations. But all that represents a
> similar behaviour that can be grouped together. I'm simplifying things a
> bit, but that's the concept.
>

Sure.  But mean and medoid are still something you can define here.


> Either way, I made a simple test in RapidMiner and used K-mediods with DTW
> as a similarity measure on a small subset and it gave very satisfying
> results (please don't laugh at my approach, I'm a beginner :) ).
>

Not laughing.  Sounds like a fine result.


>
> Now, in one of the previous mails you have answered that K-mediods isn't
> scalable, so it isn't implemented. Does scalable means not possible, or not
> such a good option? Why isn't it scalable? Does it take too much time for
> execution?
>

Scalable means that as the data gets bigger, the amount of resources gets
at most bigger in linear proportion.


>
> I've got an idea to implement it as two phase mapreduce, like this
>

Yeah... this is pretty standard,

BUT

That takes a number of iterations that is unknown and which depends on the
size of your data.  Each pass costs O(n).  Having many passes means that
you are super-linear in cost.

For k-means, you can create a sketch in one pass over the data and then
munch on the sketch to get a clustering.  This gives an O(n + k log n +
small goo) answer.

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Marko Dinic <ma...@nissatech.com>.
Ted,

Thank you for your answer. Maybe I made a wrong picture about my data 
when giving sinusoid as an example, my time series are not periodic. 
Let's say that I have a signal that represents value of power when some 
device is turned on. That power signal depends of the time when person 
turns on that device. The device can be turned on after different 
interval and can stay turned on for different durations. But all that 
represents a similar behaviour that can be grouped together. I'm 
simplifying things a bit, but that's the concept.

Either way, I made a simple test in RapidMiner and used K-mediods with 
DTW as a similarity measure on a small subset and it gave very 
satisfying results (please don't laugh at my approach, I'm a beginner 
:) ).

Now, in one of the previous mails you have answered that K-mediods 
isn't scalable, so it isn't implemented. Does scalable means not 
possible, or not such a good option? Why isn't it scalable? Does it 
take too much time for execution?

I've got an idea to implement it as two phase mapreduce, like this

Pick initial centroids
1st Phase
map
find distance from each node to its nearest centroid using dtw and emit 
pair (clusterId, pointId)
reduce
for each cluster make output as (oneOfTheNodesInClusterId, 
listOfAllNodesInThatCluster)
2nd Phase
map
calculate distance from point to all the other points in the cluster as 
it was centroid and emit (pointAndItsClusterId, 
costIfThisPointIsCentroid)
reduce
find point in each cluster that results in the smallest error in that 
cluster
repeat until convergence or certain number of iterations

 This is the first thing that came to my mind, there's probably a 
better solution. If it's not scalable because of two many calculations, 
how much time could I expect for such an algorithm in case of 10.000 
signals with 300 points, for example? How can I even estimate that?

Thanks for your effort, if you have time to answer.

Regards,
Marko

On Thu 15 Jan 2015 05:25:55 AM CET, Anand Avati wrote:
> Perhaps you could think of the centroid as one of the signals itself, from
> which the average distance to rest of the signals in the cluster is the
> lowest? This way instead of finding that mythical "mean" of DTWs, you hop
> from one signal to another over iterations as you refine memberships.
> However "mean" calculation would become O(N^2) (within a cluster).
>
> Be warned I have never tried this method.
>
> Thanks
>
> On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic <ma...@nissatech.com>
> wrote:
>
>> Ted,
>>
>> It's because I'm trying to cluster time series that may differ in
>> length, some parts may be shifted, some parts may last longer in one
>> signal than in the other (somehow skewed), but they represent more-less
>> the same thing. DTW seems good because it's optimized for such things
>> (used a lot if speech recognition where problems like this may occur).
>> The problem is that K-means incorporates calculating centroids (and the
>> thing is, I need some kind of centroids that will represent the cluster
>> in the end). Maybe I'm wrong, please correct me, I'm a youngster in
>> this kind of things,  but there is a problem finding centroids for such
>> signals. Distance measures that recalculate centroids (such as
>> Euclidean), calculate pairwise distance between components of two
>> signals (signal and a centroid). How can I use a distance measure for
>> signals with different length? And also, there is shifting and skewing
>> in signals that represent the same thing, and there mean could be
>> totally wrong. For example, mean of two sinusoids while one of them is
>> shifted by Pi is 0. And that's definitely not a good centroid in my
>> case.
>>
>> So, I'm trying to find a scalable solution for my problem, I tried to
>> fit it in what's already implemented in Mahout (for clustering), but
>> it's not so obvious to me.
>>
>> I'm open to suggestions, I'm still new to all of this.
>>
>> Thanks,
>> Marko
>>
>> On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:
>>> Why is it you can't compute a mean?
>>>
>>>
>>>
>>> On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <ma...@nissatech.com>
>>> wrote:
>>>
>>>> Thank you for your answer Ted.
>>>>
>>>> What about some kind of Bisecting k-means? I'm trying to cluster time
>>>> series of different length and I came up to an idea to use DTW as a
>>>> similarity measure, which seems to be adequate, but the thing is, I
>> cannot
>>>> use it with K-means, since it's hard to define centroids based on time
>>>> series which can have different length/phase. So I was thinking about
>>>> Hierarchical clustering, since it seems appropriate to combine with DTW,
>>>> but is not scalable, as you said. So my next thought is to try with
>>>> bisecting k-means that seems scalable, since it is based on K-means step
>>>> repetitions. My idea is next, by steps:
>>>>
>>>> - Take two signals as initial centroids (maybe two signals that have
>>>> smallest similarity, calculated using DTW)
>>>> - Assign all signals to two initial centroids
>>>> - Repeat the procedure on the biggest cluster
>>>>
>>>> In this way I could use DTW as distance measure, that could be useful
>>>> since my data may be shifted, skewed, and avoid calculating centroids.
>> At
>>>> the end I could take one signal from each cluster that is the most
>> similar
>>>> with others in cluster (some kind of centroid/medioid).
>>>>
>>>> What do you think about this approach and about the scalability?
>>>>
>>>> I would highly appreciate your answer, thanks.
>>>>
>>>> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
>>>>
>>>>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <marko.dinic@nissatech.com
>>>
>>>>> wrote:
>>>>>
>>>>>    1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout
>>>>>> that
>>>>>> could be used as a distance measure for clustering?
>>>>>>
>>>>>>
>>>>> No.
>>>>>
>>>>>
>>>>>
>>>>>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm
>> guessing
>>>>>> that it could not be implemented efficiently on Hadoop, but I wanted
>> to
>>>>>> check if something like that is possible.
>>>>>>
>>>>>>
>>>>> Scalability as you suspected.
>>>>>
>>>>>
>>>>>
>>>>>> 3) Same question, just considering Agglomerative Hierarchical
>> clustering.
>>>>>>
>>>>>>
>>>>> Again.  Agglomerative algorithms tend to be n^2 which contradicts
>> scaling.
>>>>>
>>>>>
>>>
>>
>

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Anand Avati <av...@gluster.org>.
Perhaps you could think of the centroid as one of the signals itself, from
which the average distance to rest of the signals in the cluster is the
lowest? This way instead of finding that mythical "mean" of DTWs, you hop
from one signal to another over iterations as you refine memberships.
However "mean" calculation would become O(N^2) (within a cluster).

Be warned I have never tried this method.

Thanks

On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic <ma...@nissatech.com>
wrote:

> Ted,
>
> It's because I'm trying to cluster time series that may differ in
> length, some parts may be shifted, some parts may last longer in one
> signal than in the other (somehow skewed), but they represent more-less
> the same thing. DTW seems good because it's optimized for such things
> (used a lot if speech recognition where problems like this may occur).
> The problem is that K-means incorporates calculating centroids (and the
> thing is, I need some kind of centroids that will represent the cluster
> in the end). Maybe I'm wrong, please correct me, I'm a youngster in
> this kind of things,  but there is a problem finding centroids for such
> signals. Distance measures that recalculate centroids (such as
> Euclidean), calculate pairwise distance between components of two
> signals (signal and a centroid). How can I use a distance measure for
> signals with different length? And also, there is shifting and skewing
> in signals that represent the same thing, and there mean could be
> totally wrong. For example, mean of two sinusoids while one of them is
> shifted by Pi is 0. And that's definitely not a good centroid in my
> case.
>
> So, I'm trying to find a scalable solution for my problem, I tried to
> fit it in what's already implemented in Mahout (for clustering), but
> it's not so obvious to me.
>
> I'm open to suggestions, I'm still new to all of this.
>
> Thanks,
> Marko
>
> On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:
> > Why is it you can't compute a mean?
> >
> >
> >
> > On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <ma...@nissatech.com>
> > wrote:
> >
> >> Thank you for your answer Ted.
> >>
> >> What about some kind of Bisecting k-means? I'm trying to cluster time
> >> series of different length and I came up to an idea to use DTW as a
> >> similarity measure, which seems to be adequate, but the thing is, I
> cannot
> >> use it with K-means, since it's hard to define centroids based on time
> >> series which can have different length/phase. So I was thinking about
> >> Hierarchical clustering, since it seems appropriate to combine with DTW,
> >> but is not scalable, as you said. So my next thought is to try with
> >> bisecting k-means that seems scalable, since it is based on K-means step
> >> repetitions. My idea is next, by steps:
> >>
> >> - Take two signals as initial centroids (maybe two signals that have
> >> smallest similarity, calculated using DTW)
> >> - Assign all signals to two initial centroids
> >> - Repeat the procedure on the biggest cluster
> >>
> >> In this way I could use DTW as distance measure, that could be useful
> >> since my data may be shifted, skewed, and avoid calculating centroids.
> At
> >> the end I could take one signal from each cluster that is the most
> similar
> >> with others in cluster (some kind of centroid/medioid).
> >>
> >> What do you think about this approach and about the scalability?
> >>
> >> I would highly appreciate your answer, thanks.
> >>
> >> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
> >>
> >>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <marko.dinic@nissatech.com
> >
> >>> wrote:
> >>>
> >>>   1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout
> >>>> that
> >>>> could be used as a distance measure for clustering?
> >>>>
> >>>>
> >>> No.
> >>>
> >>>
> >>>
> >>>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm
> guessing
> >>>> that it could not be implemented efficiently on Hadoop, but I wanted
> to
> >>>> check if something like that is possible.
> >>>>
> >>>>
> >>> Scalability as you suspected.
> >>>
> >>>
> >>>
> >>>> 3) Same question, just considering Agglomerative Hierarchical
> clustering.
> >>>>
> >>>>
> >>> Again.  Agglomerative algorithms tend to be n^2 which contradicts
> scaling.
> >>>
> >>>
> >
>

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Marko Dinic <ma...@nissatech.com>.
Ted,

It's because I'm trying to cluster time series that may differ in 
length, some parts may be shifted, some parts may last longer in one 
signal than in the other (somehow skewed), but they represent more-less 
the same thing. DTW seems good because it's optimized for such things 
(used a lot if speech recognition where problems like this may occur). 
The problem is that K-means incorporates calculating centroids (and the 
thing is, I need some kind of centroids that will represent the cluster 
in the end). Maybe I'm wrong, please correct me, I'm a youngster in 
this kind of things,  but there is a problem finding centroids for such 
signals. Distance measures that recalculate centroids (such as 
Euclidean), calculate pairwise distance between components of two 
signals (signal and a centroid). How can I use a distance measure for 
signals with different length? And also, there is shifting and skewing 
in signals that represent the same thing, and there mean could be 
totally wrong. For example, mean of two sinusoids while one of them is 
shifted by Pi is 0. And that's definitely not a good centroid in my 
case.

So, I'm trying to find a scalable solution for my problem, I tried to 
fit it in what's already implemented in Mahout (for clustering), but 
it's not so obvious to me.

I'm open to suggestions, I'm still new to all of this.

Thanks,
Marko

On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:
> Why is it you can't compute a mean?
>
>
>
> On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>> Thank you for your answer Ted.
>>
>> What about some kind of Bisecting k-means? I'm trying to cluster time
>> series of different length and I came up to an idea to use DTW as a
>> similarity measure, which seems to be adequate, but the thing is, I cannot
>> use it with K-means, since it's hard to define centroids based on time
>> series which can have different length/phase. So I was thinking about
>> Hierarchical clustering, since it seems appropriate to combine with DTW,
>> but is not scalable, as you said. So my next thought is to try with
>> bisecting k-means that seems scalable, since it is based on K-means step
>> repetitions. My idea is next, by steps:
>>
>> - Take two signals as initial centroids (maybe two signals that have
>> smallest similarity, calculated using DTW)
>> - Assign all signals to two initial centroids
>> - Repeat the procedure on the biggest cluster
>>
>> In this way I could use DTW as distance measure, that could be useful
>> since my data may be shifted, skewed, and avoid calculating centroids. At
>> the end I could take one signal from each cluster that is the most similar
>> with others in cluster (some kind of centroid/medioid).
>>
>> What do you think about this approach and about the scalability?
>>
>> I would highly appreciate your answer, thanks.
>>
>> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
>>
>>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <ma...@nissatech.com>
>>> wrote:
>>>
>>>   1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout
>>>> that
>>>> could be used as a distance measure for clustering?
>>>>
>>>>
>>> No.
>>>
>>>
>>>
>>>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
>>>> that it could not be implemented efficiently on Hadoop, but I wanted to
>>>> check if something like that is possible.
>>>>
>>>>
>>> Scalability as you suspected.
>>>
>>>
>>>
>>>> 3) Same question, just considering Agglomerative Hierarchical clustering.
>>>>
>>>>
>>> Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.
>>>
>>>
>

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Ted Dunning <te...@gmail.com>.
Why is it you can't compute a mean?



On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <ma...@nissatech.com>
wrote:

> Thank you for your answer Ted.
>
> What about some kind of Bisecting k-means? I'm trying to cluster time
> series of different length and I came up to an idea to use DTW as a
> similarity measure, which seems to be adequate, but the thing is, I cannot
> use it with K-means, since it's hard to define centroids based on time
> series which can have different length/phase. So I was thinking about
> Hierarchical clustering, since it seems appropriate to combine with DTW,
> but is not scalable, as you said. So my next thought is to try with
> bisecting k-means that seems scalable, since it is based on K-means step
> repetitions. My idea is next, by steps:
>
> - Take two signals as initial centroids (maybe two signals that have
> smallest similarity, calculated using DTW)
> - Assign all signals to two initial centroids
> - Repeat the procedure on the biggest cluster
>
> In this way I could use DTW as distance measure, that could be useful
> since my data may be shifted, skewed, and avoid calculating centroids. At
> the end I could take one signal from each cluster that is the most similar
> with others in cluster (some kind of centroid/medioid).
>
> What do you think about this approach and about the scalability?
>
> I would highly appreciate your answer, thanks.
>
> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
>
>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <ma...@nissatech.com>
>> wrote:
>>
>>  1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout
>>> that
>>> could be used as a distance measure for clustering?
>>>
>>>
>> No.
>>
>>
>>
>>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
>>> that it could not be implemented efficiently on Hadoop, but I wanted to
>>> check if something like that is possible.
>>>
>>>
>> Scalability as you suspected.
>>
>>
>>
>>> 3) Same question, just considering Agglomerative Hierarchical clustering.
>>>
>>>
>> Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.
>>
>>

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Marko Dinic <ma...@nissatech.com>.
Thank you for your answer Ted.

What about some kind of Bisecting k-means? I'm trying to cluster time 
series of different length and I came up to an idea to use DTW as a 
similarity measure, which seems to be adequate, but the thing is, I 
cannot use it with K-means, since it's hard to define centroids based 
on time series which can have different length/phase. So I was thinking 
about Hierarchical clustering, since it seems appropriate to combine 
with DTW, but is not scalable, as you said. So my next thought is to 
try with bisecting k-means that seems scalable, since it is based on 
K-means step repetitions. My idea is next, by steps:

- Take two signals as initial centroids (maybe two signals that have 
smallest similarity, calculated using DTW)
- Assign all signals to two initial centroids
- Repeat the procedure on the biggest cluster

In this way I could use DTW as distance measure, that could be useful 
since my data may be shifted, skewed, and avoid calculating centroids. 
At the end I could take one signal from each cluster that is the most 
similar with others in cluster (some kind of centroid/medioid).

What do you think about this approach and about the scalability?

I would highly appreciate your answer, thanks.

On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <ma...@nissatech.com>
> wrote:
>
>> 1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that
>> could be used as a distance measure for clustering?
>>
>
> No.
>
>
>>
>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
>> that it could not be implemented efficiently on Hadoop, but I wanted to
>> check if something like that is possible.
>>
>
> Scalability as you suspected.
>
>
>>
>> 3) Same question, just considering Agglomerative Hierarchical clustering.
>>
>
> Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.
>

Re: DTW distance measure and K-medioids, Hierarchical clustering

Posted by Ted Dunning <te...@gmail.com>.
On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <ma...@nissatech.com>
wrote:

> 1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that
> could be used as a distance measure for clustering?
>

No.


>
> 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing
> that it could not be implemented efficiently on Hadoop, but I wanted to
> check if something like that is possible.
>

Scalability as you suspected.


>
> 3) Same question, just considering Agglomerative Hierarchical clustering.
>

Again.  Agglomerative algorithms tend to be n^2 which contradicts scaling.