You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vasil Vasilev <va...@gmail.com> on 2011/01/18 17:10:38 UTC

Meanshift clustering

Hello Mahout developers,

Currently I am trying to get more in depth with the clustering algorithms -
how they should be used and tuned.
For this purpose I decided to learn from the source code of the different
implementations.
In this respect I have the following questions about the Meanshift algorithm
(sorry if it may sound naive, but I am a novice in the area):

1. I noted that the way it is implemented is different from the
straightforward approach that is described in the paper (
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf).
Later I learned from Jira MAHOUT-15 that this was made to enable
parallelism. There I also noticed that T2 should be fixed to 1.0.
In fact for me it seems that T2 should be correlated with the convergence
delta parameter (which by default is 0.5) and should be slightly higher then
it. Is my assumption correct?

2. With the current implementation the user has the option to select desired
distance measure, but does not have the flexibility to select a kernel. The
current approach results in a hard-coded conical kernel with radius T1 and
no points outside T1 are considered in the path calculation of the canopy.
Is it possible to slightly modify the algorithm (similar to the modification
from kmeans to fuzzy kmeans) where weights are associated with a given point
that would touch the canopy and these weights are drown from the kernel
function. For example they could be drawn from a normal distribution? Do you
think the possibility for kernel selection could impact positively the
clustering with meanshift in some cases?

Regards, Vasil

Re: Meanshift clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Hi Mahout developers,

As suggested above, I wrote about my experience clustering the synthetic
control data with the different algorithms.
Can you tell me your opinion and suggestions? Also where is the best place
to post it on the wiki?

https://docs.google.com/leaf?id=0B8qKJ44gVr7QMDEyMTlkY2YtNDQ1OC00MmQ3LWFmZTUtOGIxNGZiODFhOGYx&hl=en

Regards, Vasil

On Thu, Jan 27, 2011 at 1:06 PM, Vasil Vasilev <va...@gmail.com> wrote:

> I created: https://issues.apache.org/jira/browse/MAHOUT-597
>
> Regards, Vasil
>
>
> On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> No, please make a new JIRA.
>>
>> Lance
>>
>> On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <va...@gmail.com>
>> wrote:
>> > It is added to https://issues.apache.org/jira/browse/MAHOUT-15
>> >
>> > I hope this was the correct place to add it!
>> >
>> > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <go...@gmail.com>
>> wrote:
>> >
>> >> Sure, please. Please include hints about the various cases where it
>> >> works (or does not).
>> >>
>> >> Thanks!
>> >>
>> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <va...@gmail.com>
>> >> wrote:
>> >> > Hello,
>> >> >
>> >> > In fact my estimation technique works only for the Meanshift
>> algorithm,
>> >> > because in it the centers of the canopies are moving around. With
>> pure
>> >> > Canopy clustering I don't think it will work.
>> >> >
>> >> > I spent some time trying to realize the idea of different kernel
>> profiles
>> >> > with the meanshift clustering algorithm and here are the results:
>> >> > 1. I changed a little bit the original algorithm. Previously when one
>> >> > cluster "touched" other clusters its centroid was computed only based
>> on
>> >> the
>> >> > centroids of the clusters it touched, but not based on his centroid
>> >> itself.
>> >> > Now befor calculating the shift I added one more line which makes the
>> >> > cluster observe his personal centroid:
>> >> >
>> >> > public boolean shiftToMean(MeanShiftCanopy canopy) {
>> >> >    canopy.observe(canopy.getCenter(),
>> canopy.getBoundPoints().size());
>> >> >    canopy.computeConvergence(measure, convergenceDelta);
>> >> >    canopy.computeParameters();
>> >> >    return canopy.isConverged();
>> >> >  }
>> >> >
>> >> > With this modification also the problem with convergence of the
>> algorithm
>> >> > (that I described above) disappeared, although the number of
>> iterations
>> >> > until convergence increased slightly.
>> >> > This change was necessary in order to introduce the other types of
>> "soft"
>> >> > kernels.
>> >> >
>> >> > 2. I introduced IKernelProfile interface which has the methods:
>> >> > public double calculateValue(double distance, double h);
>> >> >    public double calculateDerivativeValue(double distance, double h);
>> >> >
>> >> > 3. I created 2 implementations:
>> >> > TriangularKernelProfile with calculated value:
>> >> > @Override
>> >> >    public double calculateDerivativeValue(double distance, double h)
>> {
>> >> >        return (distance < h) ? 1.0 : 0.0;
>> >> >    }
>> >> >
>> >> > and NormalKernelProfile with calculated value:
>> >> > @Override
>> >> >    public double calculateDerivativeValue(double distance, double h)
>> {
>> >> >        return UncommonDistributions.dNorm(distance, 0.0, h);
>> >> >    }
>> >> >
>> >> > 4. I modified the core for merging canopies:
>> >> >
>> >> > public void mergeCanopy(MeanShiftCanopy aCanopy,
>> >> Collection<MeanShiftCanopy>
>> >> > canopies) {
>> >> >    MeanShiftCanopy closestCoveringCanopy = null;
>> >> >    double closestNorm = Double.MAX_VALUE;
>> >> >    for (MeanShiftCanopy canopy : canopies) {
>> >> >      double norm = measure.distance(canopy.getCenter(),
>> >> > aCanopy.getCenter());
>> >> >      double weight = kernelProfile.calculateDerivativeValue(norm,
>> t1);
>> >> >      if(weight > 0.0)
>> >> >      {
>> >> >          aCanopy.touch(canopy, weight);
>> >> >      }
>> >> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
>> >> > closestNorm))) {
>> >> >        closestNorm = norm;
>> >> >        closestCoveringCanopy = canopy;
>> >> >      }
>> >> >    }
>> >> >    if (closestCoveringCanopy == null) {
>> >> >      canopies.add(aCanopy);
>> >> >    } else {
>> >> >      closestCoveringCanopy.merge(aCanopy);
>> >> >    }
>> >> >  }
>> >> >
>> >> > 5. I modified MeanShiftCanopy
>> >> > void touch(MeanShiftCanopy canopy, double weight) {
>> >> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
>> >> >    observe(canopy.getCenter(),
>> >> weight*((double)canopy.boundPoints.size()));
>> >> >  }
>> >> >
>> >> > 6. I modified some other files which were necessary to compile the
>> code
>> >> and
>> >> > for the tests to pass
>> >> >
>> >> > With the so changed algorithm I had the following observations:
>> >> >
>> >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the
>> cluster
>> >> > content is more intermixed
>> >> > 2. As there is no convergence problem any more I found the following
>> >> > procedure for estimating T2 and convergence delta:
>> >> >  - bind convergence delta = T2 / 2
>> >> >  - When you decrease T2 the number of iterations increases and the
>> number
>> >> of
>> >> > resulting clusters after convergence decreases
>> >> >  - You come to a moment when you decrease T2 the number of iterations
>> >> > increases, but the number of resulting clusters remains unchanged.
>> This
>> >> is
>> >> > the point with the best value for T2
>> >> > 3. In case of Normal kernel what you give as T1 is in fact the
>> standard
>> >> > deviation of a normal distribution, so the radius of the window will
>> be
>> >> T1^2
>> >> >
>> >> > If you are interested I can send the code.
>> >> >
>> >> > Regards, Vasil
>> >> >
>> >> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <go...@gmail.com>
>> >> wrote:
>> >> >
>> >> >> Vasil-
>> >> >>
>> >> >> Would you consider adding your estimation algorithm to this patch?
>> >> >> https://issues.apache.org/jira/browse/MAHOUT-563
>> >> >>
>> >> >> The estimator in there now is stupid- a real one would make the
>> Canopy
>> >> >> algorithms orders of magnitude more useful.
>> >> >>
>> >> >> Lance
>> >> >>
>> >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <ted.dunning@gmail.com
>> >
>> >> >> wrote:
>> >> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <
>> vavasilev@gmail.com>
>> >> >> wrote:
>> >> >> >
>> >> >> >>
>> >> >> >> dimension 1: Using linear regression with gradient descent
>> algorithm
>> >> I
>> >> >> find
>> >> >> >> what is the trend of the line, i.e. is it increasing, decreasing
>> or
>> >> >> >> straight
>> >> >> >> line
>> >> >> >> dimension 2: Knowing the approximating line (from the linear
>> >> regression)
>> >> >> I
>> >> >> >> count how many times this line gets crossed by the original
>> signal.
>> >> This
>> >> >> >> helps in separating the cyclic data from all the rest
>> >> >> >> dimension 3: What is the biggest increase/decrease of a single
>> signal
>> >> >> line.
>> >> >> >> This helps find shifts
>> >> >> >>
>> >> >> >> So to say - I put a semantics for the data that are to be
>> clustered
>> >> (I
>> >> >> >> don't
>> >> >> >> know if it is correct to do that, but I couldn't think of how an
>> >> >> algorithm
>> >> >> >> could cope with the task without such additional semantics)
>> >> >> >>
>> >> >> >
>> >> >> > It is very common for feature extraction like this to be the key
>> for
>> >> >> > data-mining projects.   Such features are absolutely critical for
>> most
>> >> >> time
>> >> >> > series mining and are highly application dependent.
>> >> >> >
>> >> >> > One key aspect of your features is that they are shift invariant.
>> >> >> >
>> >> >> >
>> >> >> >> Also I developed a small swing application which visualizes the
>> >> >> clustered
>> >> >> >> signals and which helped me in playing with the algorithms.
>> >> >> >>
>> >> >> >
>> >> >> > Great idea.
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Lance Norskog
>> >> >> goksron@gmail.com
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>

Re: Meanshift clustering

Posted by Vasil Vasilev <va...@gmail.com>.

I created: https://issues.apache.org/jira/browse/MAHOUT-597

Regards, Vasil

On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <go...@gmail.com> wrote:

> No, please make a new JIRA.
>
> Lance
>
> On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <va...@gmail.com>
> wrote:
> > It is added to https://issues.apache.org/jira/browse/MAHOUT-15
> >
> > I hope this was the correct place to add it!
> >
> > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> >> Sure, please. Please include hints about the various cases where it
> >> works (or does not).
> >>
> >> Thanks!
> >>
> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <va...@gmail.com>
> >> wrote:
> >> > Hello,
> >> >
> >> > In fact my estimation technique works only for the Meanshift
> algorithm,
> >> > because in it the centers of the canopies are moving around. With pure
> >> > Canopy clustering I don't think it will work.
> >> >
> >> > I spent some time trying to realize the idea of different kernel
> profiles
> >> > with the meanshift clustering algorithm and here are the results:
> >> > 1. I changed a little bit the original algorithm. Previously when one
> >> > cluster "touched" other clusters its centroid was computed only based
> on
> >> the
> >> > centroids of the clusters it touched, but not based on his centroid
> >> itself.
> >> > Now befor calculating the shift I added one more line which makes the
> >> > cluster observe his personal centroid:
> >> >
> >> > public boolean shiftToMean(MeanShiftCanopy canopy) {
> >> >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
> >> >    canopy.computeConvergence(measure, convergenceDelta);
> >> >    canopy.computeParameters();
> >> >    return canopy.isConverged();
> >> >  }
> >> >
> >> > With this modification also the problem with convergence of the
> algorithm
> >> > (that I described above) disappeared, although the number of
> iterations
> >> > until convergence increased slightly.
> >> > This change was necessary in order to introduce the other types of
> "soft"
> >> > kernels.
> >> >
> >> > 2. I introduced IKernelProfile interface which has the methods:
> >> > public double calculateValue(double distance, double h);
> >> >    public double calculateDerivativeValue(double distance, double h);
> >> >
> >> > 3. I created 2 implementations:
> >> > TriangularKernelProfile with calculated value:
> >> > @Override
> >> >    public double calculateDerivativeValue(double distance, double h) {
> >> >        return (distance < h) ? 1.0 : 0.0;
> >> >    }
> >> >
> >> > and NormalKernelProfile with calculated value:
> >> > @Override
> >> >    public double calculateDerivativeValue(double distance, double h) {
> >> >        return UncommonDistributions.dNorm(distance, 0.0, h);
> >> >    }
> >> >
> >> > 4. I modified the core for merging canopies:
> >> >
> >> > public void mergeCanopy(MeanShiftCanopy aCanopy,
> >> Collection<MeanShiftCanopy>
> >> > canopies) {
> >> >    MeanShiftCanopy closestCoveringCanopy = null;
> >> >    double closestNorm = Double.MAX_VALUE;
> >> >    for (MeanShiftCanopy canopy : canopies) {
> >> >      double norm = measure.distance(canopy.getCenter(),
> >> > aCanopy.getCenter());
> >> >      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
> >> >      if(weight > 0.0)
> >> >      {
> >> >          aCanopy.touch(canopy, weight);
> >> >      }
> >> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> >> > closestNorm))) {
> >> >        closestNorm = norm;
> >> >        closestCoveringCanopy = canopy;
> >> >      }
> >> >    }
> >> >    if (closestCoveringCanopy == null) {
> >> >      canopies.add(aCanopy);
> >> >    } else {
> >> >      closestCoveringCanopy.merge(aCanopy);
> >> >    }
> >> >  }
> >> >
> >> > 5. I modified MeanShiftCanopy
> >> > void touch(MeanShiftCanopy canopy, double weight) {
> >> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
> >> >    observe(canopy.getCenter(),
> >> weight*((double)canopy.boundPoints.size()));
> >> >  }
> >> >
> >> > 6. I modified some other files which were necessary to compile the
> code
> >> and
> >> > for the tests to pass
> >> >
> >> > With the so changed algorithm I had the following observations:
> >> >
> >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> >> > content is more intermixed
> >> > 2. As there is no convergence problem any more I found the following
> >> > procedure for estimating T2 and convergence delta:
> >> >  - bind convergence delta = T2 / 2
> >> >  - When you decrease T2 the number of iterations increases and the
> number
> >> of
> >> > resulting clusters after convergence decreases
> >> >  - You come to a moment when you decrease T2 the number of iterations
> >> > increases, but the number of resulting clusters remains unchanged.
> This
> >> is
> >> > the point with the best value for T2
> >> > 3. In case of Normal kernel what you give as T1 is in fact the
> standard
> >> > deviation of a normal distribution, so the radius of the window will
> be
> >> T1^2
> >> >
> >> > If you are interested I can send the code.
> >> >
> >> > Regards, Vasil
> >> >
> >> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <go...@gmail.com>
> >> wrote:
> >> >
> >> >> Vasil-
> >> >>
> >> >> Would you consider adding your estimation algorithm to this patch?
> >> >> https://issues.apache.org/jira/browse/MAHOUT-563
> >> >>
> >> >> The estimator in there now is stupid- a real one would make the
> Canopy
> >> >> algorithms orders of magnitude more useful.
> >> >>
> >> >> Lance
> >> >>
> >> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <te...@gmail.com>
> >> >> wrote:
> >> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <
> vavasilev@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> >>
> >> >> >> dimension 1: Using linear regression with gradient descent
> algorithm
> >> I
> >> >> find
> >> >> >> what is the trend of the line, i.e. is it increasing, decreasing
> or
> >> >> >> straight
> >> >> >> line
> >> >> >> dimension 2: Knowing the approximating line (from the linear
> >> regression)
> >> >> I
> >> >> >> count how many times this line gets crossed by the original
> signal.
> >> This
> >> >> >> helps in separating the cyclic data from all the rest
> >> >> >> dimension 3: What is the biggest increase/decrease of a single
> signal
> >> >> line.
> >> >> >> This helps find shifts
> >> >> >>
> >> >> >> So to say - I put a semantics for the data that are to be
> clustered
> >> (I
> >> >> >> don't
> >> >> >> know if it is correct to do that, but I couldn't think of how an
> >> >> algorithm
> >> >> >> could cope with the task without such additional semantics)
> >> >> >>
> >> >> >
> >> >> > It is very common for feature extraction like this to be the key
> for
> >> >> > data-mining projects.   Such features are absolutely critical for
> most
> >> >> time
> >> >> > series mining and are highly application dependent.
> >> >> >
> >> >> > One key aspect of your features is that they are shift invariant.
> >> >> >
> >> >> >
> >> >> >> Also I developed a small swing application which visualizes the
> >> >> clustered
> >> >> >> signals and which helped me in playing with the algorithms.
> >> >> >>
> >> >> >
> >> >> > Great idea.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Lance Norskog
> >> >> goksron@gmail.com
> >> >>
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Meanshift clustering

Posted by Lance Norskog <go...@gmail.com>.

No, please make a new JIRA.

Lance

On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <va...@gmail.com> wrote:
> It is added to https://issues.apache.org/jira/browse/MAHOUT-15
>
> I hope this was the correct place to add it!
>
> On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> Sure, please. Please include hints about the various cases where it
>> works (or does not).
>>
>> Thanks!
>>
>> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <va...@gmail.com>
>> wrote:
>> > Hello,
>> >
>> > In fact my estimation technique works only for the Meanshift algorithm,
>> > because in it the centers of the canopies are moving around. With pure
>> > Canopy clustering I don't think it will work.
>> >
>> > I spent some time trying to realize the idea of different kernel profiles
>> > with the meanshift clustering algorithm and here are the results:
>> > 1. I changed a little bit the original algorithm. Previously when one
>> > cluster "touched" other clusters its centroid was computed only based on
>> the
>> > centroids of the clusters it touched, but not based on his centroid
>> itself.
>> > Now befor calculating the shift I added one more line which makes the
>> > cluster observe his personal centroid:
>> >
>> > public boolean shiftToMean(MeanShiftCanopy canopy) {
>> >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
>> >    canopy.computeConvergence(measure, convergenceDelta);
>> >    canopy.computeParameters();
>> >    return canopy.isConverged();
>> >  }
>> >
>> > With this modification also the problem with convergence of the algorithm
>> > (that I described above) disappeared, although the number of iterations
>> > until convergence increased slightly.
>> > This change was necessary in order to introduce the other types of "soft"
>> > kernels.
>> >
>> > 2. I introduced IKernelProfile interface which has the methods:
>> > public double calculateValue(double distance, double h);
>> >    public double calculateDerivativeValue(double distance, double h);
>> >
>> > 3. I created 2 implementations:
>> > TriangularKernelProfile with calculated value:
>> > @Override
>> >    public double calculateDerivativeValue(double distance, double h) {
>> >        return (distance < h) ? 1.0 : 0.0;
>> >    }
>> >
>> > and NormalKernelProfile with calculated value:
>> > @Override
>> >    public double calculateDerivativeValue(double distance, double h) {
>> >        return UncommonDistributions.dNorm(distance, 0.0, h);
>> >    }
>> >
>> > 4. I modified the core for merging canopies:
>> >
>> > public void mergeCanopy(MeanShiftCanopy aCanopy,
>> Collection<MeanShiftCanopy>
>> > canopies) {
>> >    MeanShiftCanopy closestCoveringCanopy = null;
>> >    double closestNorm = Double.MAX_VALUE;
>> >    for (MeanShiftCanopy canopy : canopies) {
>> >      double norm = measure.distance(canopy.getCenter(),
>> > aCanopy.getCenter());
>> >      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
>> >      if(weight > 0.0)
>> >      {
>> >          aCanopy.touch(canopy, weight);
>> >      }
>> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
>> > closestNorm))) {
>> >        closestNorm = norm;
>> >        closestCoveringCanopy = canopy;
>> >      }
>> >    }
>> >    if (closestCoveringCanopy == null) {
>> >      canopies.add(aCanopy);
>> >    } else {
>> >      closestCoveringCanopy.merge(aCanopy);
>> >    }
>> >  }
>> >
>> > 5. I modified MeanShiftCanopy
>> > void touch(MeanShiftCanopy canopy, double weight) {
>> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
>> >    observe(canopy.getCenter(),
>> weight*((double)canopy.boundPoints.size()));
>> >  }
>> >
>> > 6. I modified some other files which were necessary to compile the code
>> and
>> > for the tests to pass
>> >
>> > With the so changed algorithm I had the following observations:
>> >
>> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
>> > content is more intermixed
>> > 2. As there is no convergence problem any more I found the following
>> > procedure for estimating T2 and convergence delta:
>> >  - bind convergence delta = T2 / 2
>> >  - When you decrease T2 the number of iterations increases and the number
>> of
>> > resulting clusters after convergence decreases
>> >  - You come to a moment when you decrease T2 the number of iterations
>> > increases, but the number of resulting clusters remains unchanged. This
>> is
>> > the point with the best value for T2
>> > 3. In case of Normal kernel what you give as T1 is in fact the standard
>> > deviation of a normal distribution, so the radius of the window will be
>> T1^2
>> >
>> > If you are interested I can send the code.
>> >
>> > Regards, Vasil
>> >
>> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <go...@gmail.com>
>> wrote:
>> >
>> >> Vasil-
>> >>
>> >> Would you consider adding your estimation algorithm to this patch?
>> >> https://issues.apache.org/jira/browse/MAHOUT-563
>> >>
>> >> The estimator in there now is stupid- a real one would make the Canopy
>> >> algorithms orders of magnitude more useful.
>> >>
>> >> Lance
>> >>
>> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <te...@gmail.com>
>> >> wrote:
>> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <va...@gmail.com>
>> >> wrote:
>> >> >
>> >> >>
>> >> >> dimension 1: Using linear regression with gradient descent algorithm
>> I
>> >> find
>> >> >> what is the trend of the line, i.e. is it increasing, decreasing or
>> >> >> straight
>> >> >> line
>> >> >> dimension 2: Knowing the approximating line (from the linear
>> regression)
>> >> I
>> >> >> count how many times this line gets crossed by the original signal.
>> This
>> >> >> helps in separating the cyclic data from all the rest
>> >> >> dimension 3: What is the biggest increase/decrease of a single signal
>> >> line.
>> >> >> This helps find shifts
>> >> >>
>> >> >> So to say - I put a semantics for the data that are to be clustered
>> (I
>> >> >> don't
>> >> >> know if it is correct to do that, but I couldn't think of how an
>> >> algorithm
>> >> >> could cope with the task without such additional semantics)
>> >> >>
>> >> >
>> >> > It is very common for feature extraction like this to be the key for
>> >> > data-mining projects.   Such features are absolutely critical for most
>> >> time
>> >> > series mining and are highly application dependent.
>> >> >
>> >> > One key aspect of your features is that they are shift invariant.
>> >> >
>> >> >
>> >> >> Also I developed a small swing application which visualizes the
>> >> clustered
>> >> >> signals and which helped me in playing with the algorithms.
>> >> >>
>> >> >
>> >> > Great idea.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lance Norskog
>> >> goksron@gmail.com
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Meanshift clustering

Posted by Vasil Vasilev <va...@gmail.com>.

It is added to https://issues.apache.org/jira/browse/MAHOUT-15

I hope this was the correct place to add it!

On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <go...@gmail.com> wrote:

> Sure, please. Please include hints about the various cases where it
> works (or does not).
>
> Thanks!
>
> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <va...@gmail.com>
> wrote:
> > Hello,
> >
> > In fact my estimation technique works only for the Meanshift algorithm,
> > because in it the centers of the canopies are moving around. With pure
> > Canopy clustering I don't think it will work.
> >
> > I spent some time trying to realize the idea of different kernel profiles
> > with the meanshift clustering algorithm and here are the results:
> > 1. I changed a little bit the original algorithm. Previously when one
> > cluster "touched" other clusters its centroid was computed only based on
> the
> > centroids of the clusters it touched, but not based on his centroid
> itself.
> > Now befor calculating the shift I added one more line which makes the
> > cluster observe his personal centroid:
> >
> > public boolean shiftToMean(MeanShiftCanopy canopy) {
> >    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
> >    canopy.computeConvergence(measure, convergenceDelta);
> >    canopy.computeParameters();
> >    return canopy.isConverged();
> >  }
> >
> > With this modification also the problem with convergence of the algorithm
> > (that I described above) disappeared, although the number of iterations
> > until convergence increased slightly.
> > This change was necessary in order to introduce the other types of "soft"
> > kernels.
> >
> > 2. I introduced IKernelProfile interface which has the methods:
> > public double calculateValue(double distance, double h);
> >    public double calculateDerivativeValue(double distance, double h);
> >
> > 3. I created 2 implementations:
> > TriangularKernelProfile with calculated value:
> > @Override
> >    public double calculateDerivativeValue(double distance, double h) {
> >        return (distance < h) ? 1.0 : 0.0;
> >    }
> >
> > and NormalKernelProfile with calculated value:
> > @Override
> >    public double calculateDerivativeValue(double distance, double h) {
> >        return UncommonDistributions.dNorm(distance, 0.0, h);
> >    }
> >
> > 4. I modified the core for merging canopies:
> >
> > public void mergeCanopy(MeanShiftCanopy aCanopy,
> Collection<MeanShiftCanopy>
> > canopies) {
> >    MeanShiftCanopy closestCoveringCanopy = null;
> >    double closestNorm = Double.MAX_VALUE;
> >    for (MeanShiftCanopy canopy : canopies) {
> >      double norm = measure.distance(canopy.getCenter(),
> > aCanopy.getCenter());
> >      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
> >      if(weight > 0.0)
> >      {
> >          aCanopy.touch(canopy, weight);
> >      }
> >      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> > closestNorm))) {
> >        closestNorm = norm;
> >        closestCoveringCanopy = canopy;
> >      }
> >    }
> >    if (closestCoveringCanopy == null) {
> >      canopies.add(aCanopy);
> >    } else {
> >      closestCoveringCanopy.merge(aCanopy);
> >    }
> >  }
> >
> > 5. I modified MeanShiftCanopy
> > void touch(MeanShiftCanopy canopy, double weight) {
> >    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
> >    observe(canopy.getCenter(),
> weight*((double)canopy.boundPoints.size()));
> >  }
> >
> > 6. I modified some other files which were necessary to compile the code
> and
> > for the tests to pass
> >
> > With the so changed algorithm I had the following observations:
> >
> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> > content is more intermixed
> > 2. As there is no convergence problem any more I found the following
> > procedure for estimating T2 and convergence delta:
> >  - bind convergence delta = T2 / 2
> >  - When you decrease T2 the number of iterations increases and the number
> of
> > resulting clusters after convergence decreases
> >  - You come to a moment when you decrease T2 the number of iterations
> > increases, but the number of resulting clusters remains unchanged. This
> is
> > the point with the best value for T2
> > 3. In case of Normal kernel what you give as T1 is in fact the standard
> > deviation of a normal distribution, so the radius of the window will be
> T1^2
> >
> > If you are interested I can send the code.
> >
> > Regards, Vasil
> >
> > On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <go...@gmail.com>
> wrote:
> >
> >> Vasil-
> >>
> >> Would you consider adding your estimation algorithm to this patch?
> >> https://issues.apache.org/jira/browse/MAHOUT-563
> >>
> >> The estimator in there now is stupid- a real one would make the Canopy
> >> algorithms orders of magnitude more useful.
> >>
> >> Lance
> >>
> >> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <va...@gmail.com>
> >> wrote:
> >> >
> >> >>
> >> >> dimension 1: Using linear regression with gradient descent algorithm
> I
> >> find
> >> >> what is the trend of the line, i.e. is it increasing, decreasing or
> >> >> straight
> >> >> line
> >> >> dimension 2: Knowing the approximating line (from the linear
> regression)
> >> I
> >> >> count how many times this line gets crossed by the original signal.
> This
> >> >> helps in separating the cyclic data from all the rest
> >> >> dimension 3: What is the biggest increase/decrease of a single signal
> >> line.
> >> >> This helps find shifts
> >> >>
> >> >> So to say - I put a semantics for the data that are to be clustered
> (I
> >> >> don't
> >> >> know if it is correct to do that, but I couldn't think of how an
> >> algorithm
> >> >> could cope with the task without such additional semantics)
> >> >>
> >> >
> >> > It is very common for feature extraction like this to be the key for
> >> > data-mining projects.   Such features are absolutely critical for most
> >> time
> >> > series mining and are highly application dependent.
> >> >
> >> > One key aspect of your features is that they are shift invariant.
> >> >
> >> >
> >> >> Also I developed a small swing application which visualizes the
> >> clustered
> >> >> signals and which helped me in playing with the algorithms.
> >> >>
> >> >
> >> > Great idea.
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Meanshift clustering

Posted by Lance Norskog <go...@gmail.com>.

Sure, please. Please include hints about the various cases where it
works (or does not).

Thanks!

On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <va...@gmail.com> wrote:
> Hello,
>
> In fact my estimation technique works only for the Meanshift algorithm,
> because in it the centers of the canopies are moving around. With pure
> Canopy clustering I don't think it will work.
>
> I spent some time trying to realize the idea of different kernel profiles
> with the meanshift clustering algorithm and here are the results:
> 1. I changed a little bit the original algorithm. Previously when one
> cluster "touched" other clusters its centroid was computed only based on the
> centroids of the clusters it touched, but not based on his centroid itself.
> Now befor calculating the shift I added one more line which makes the
> cluster observe his personal centroid:
>
> public boolean shiftToMean(MeanShiftCanopy canopy) {
>    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
>    canopy.computeConvergence(measure, convergenceDelta);
>    canopy.computeParameters();
>    return canopy.isConverged();
>  }
>
> With this modification also the problem with convergence of the algorithm
> (that I described above) disappeared, although the number of iterations
> until convergence increased slightly.
> This change was necessary in order to introduce the other types of "soft"
> kernels.
>
> 2. I introduced IKernelProfile interface which has the methods:
> public double calculateValue(double distance, double h);
>    public double calculateDerivativeValue(double distance, double h);
>
> 3. I created 2 implementations:
> TriangularKernelProfile with calculated value:
> @Override
>    public double calculateDerivativeValue(double distance, double h) {
>        return (distance < h) ? 1.0 : 0.0;
>    }
>
> and NormalKernelProfile with calculated value:
> @Override
>    public double calculateDerivativeValue(double distance, double h) {
>        return UncommonDistributions.dNorm(distance, 0.0, h);
>    }
>
> 4. I modified the core for merging canopies:
>
> public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy>
> canopies) {
>    MeanShiftCanopy closestCoveringCanopy = null;
>    double closestNorm = Double.MAX_VALUE;
>    for (MeanShiftCanopy canopy : canopies) {
>      double norm = measure.distance(canopy.getCenter(),
> aCanopy.getCenter());
>      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
>      if(weight > 0.0)
>      {
>          aCanopy.touch(canopy, weight);
>      }
>      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
> closestNorm))) {
>        closestNorm = norm;
>        closestCoveringCanopy = canopy;
>      }
>    }
>    if (closestCoveringCanopy == null) {
>      canopies.add(aCanopy);
>    } else {
>      closestCoveringCanopy.merge(aCanopy);
>    }
>  }
>
> 5. I modified MeanShiftCanopy
> void touch(MeanShiftCanopy canopy, double weight) {
>    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
>    observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size()));
>  }
>
> 6. I modified some other files which were necessary to compile the code and
> for the tests to pass
>
> With the so changed algorithm I had the following observations:
>
> 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
> content is more intermixed
> 2. As there is no convergence problem any more I found the following
> procedure for estimating T2 and convergence delta:
>  - bind convergence delta = T2 / 2
>  - When you decrease T2 the number of iterations increases and the number of
> resulting clusters after convergence decreases
>  - You come to a moment when you decrease T2 the number of iterations
> increases, but the number of resulting clusters remains unchanged. This is
> the point with the best value for T2
> 3. In case of Normal kernel what you give as T1 is in fact the standard
> deviation of a normal distribution, so the radius of the window will be T1^2
>
> If you are interested I can send the code.
>
> Regards, Vasil
>
> On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> Vasil-
>>
>> Would you consider adding your estimation algorithm to this patch?
>> https://issues.apache.org/jira/browse/MAHOUT-563
>>
>> The estimator in there now is stupid- a real one would make the Canopy
>> algorithms orders of magnitude more useful.
>>
>> Lance
>>
>> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <va...@gmail.com>
>> wrote:
>> >
>> >>
>> >> dimension 1: Using linear regression with gradient descent algorithm I
>> find
>> >> what is the trend of the line, i.e. is it increasing, decreasing or
>> >> straight
>> >> line
>> >> dimension 2: Knowing the approximating line (from the linear regression)
>> I
>> >> count how many times this line gets crossed by the original signal. This
>> >> helps in separating the cyclic data from all the rest
>> >> dimension 3: What is the biggest increase/decrease of a single signal
>> line.
>> >> This helps find shifts
>> >>
>> >> So to say - I put a semantics for the data that are to be clustered (I
>> >> don't
>> >> know if it is correct to do that, but I couldn't think of how an
>> algorithm
>> >> could cope with the task without such additional semantics)
>> >>
>> >
>> > It is very common for feature extraction like this to be the key for
>> > data-mining projects.   Such features are absolutely critical for most
>> time
>> > series mining and are highly application dependent.
>> >
>> > One key aspect of your features is that they are shift invariant.
>> >
>> >
>> >> Also I developed a small swing application which visualizes the
>> clustered
>> >> signals and which helped me in playing with the algorithms.
>> >>
>> >
>> > Great idea.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Meanshift clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Hello,

In fact my estimation technique works only for the Meanshift algorithm,
because in it the centers of the canopies are moving around. With pure
Canopy clustering I don't think it will work.

I spent some time trying to realize the idea of different kernel profiles
with the meanshift clustering algorithm and here are the results:
1. I changed a little bit the original algorithm. Previously when one
cluster "touched" other clusters its centroid was computed only based on the
centroids of the clusters it touched, but not based on his centroid itself.
Now befor calculating the shift I added one more line which makes the
cluster observe his personal centroid:

public boolean shiftToMean(MeanShiftCanopy canopy) {
    canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size());
    canopy.computeConvergence(measure, convergenceDelta);
    canopy.computeParameters();
    return canopy.isConverged();
  }

With this modification also the problem with convergence of the algorithm
(that I described above) disappeared, although the number of iterations
until convergence increased slightly.
This change was necessary in order to introduce the other types of "soft"
kernels.

2. I introduced IKernelProfile interface which has the methods:
public double calculateValue(double distance, double h);
    public double calculateDerivativeValue(double distance, double h);

3. I created 2 implementations:
TriangularKernelProfile with calculated value:
@Override
    public double calculateDerivativeValue(double distance, double h) {
        return (distance < h) ? 1.0 : 0.0;
    }

and NormalKernelProfile with calculated value:
@Override
    public double calculateDerivativeValue(double distance, double h) {
        return UncommonDistributions.dNorm(distance, 0.0, h);
    }

4. I modified the core for merging canopies:

public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy>
canopies) {
    MeanShiftCanopy closestCoveringCanopy = null;
    double closestNorm = Double.MAX_VALUE;
    for (MeanShiftCanopy canopy : canopies) {
      double norm = measure.distance(canopy.getCenter(),
aCanopy.getCenter());
      double weight = kernelProfile.calculateDerivativeValue(norm, t1);
      if(weight > 0.0)
      {
          aCanopy.touch(canopy, weight);
      }
      if (norm < t2 && ((closestCoveringCanopy == null) || (norm <
closestNorm))) {
        closestNorm = norm;
        closestCoveringCanopy = canopy;
      }
    }
    if (closestCoveringCanopy == null) {
      canopies.add(aCanopy);
    } else {
      closestCoveringCanopy.merge(aCanopy);
    }
  }

5. I modified MeanShiftCanopy
void touch(MeanShiftCanopy canopy, double weight) {
    canopy.observe(getCenter(), weight*((double)boundPoints.size()));
    observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size()));
  }

6. I modified some other files which were necessary to compile the code and
for the tests to pass

With the so changed algorithm I had the following observations:

1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster
content is more intermixed
2. As there is no convergence problem any more I found the following
procedure for estimating T2 and convergence delta:
 - bind convergence delta = T2 / 2
 - When you decrease T2 the number of iterations increases and the number of
resulting clusters after convergence decreases
 - You come to a moment when you decrease T2 the number of iterations
increases, but the number of resulting clusters remains unchanged. This is
the point with the best value for T2
3. In case of Normal kernel what you give as T1 is in fact the standard
deviation of a normal distribution, so the radius of the window will be T1^2

If you are interested I can send the code.

Regards, Vasil

On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <go...@gmail.com> wrote:

> Vasil-
>
> Would you consider adding your estimation algorithm to this patch?
> https://issues.apache.org/jira/browse/MAHOUT-563
>
> The estimator in there now is stupid- a real one would make the Canopy
> algorithms orders of magnitude more useful.
>
> Lance
>
> On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <va...@gmail.com>
> wrote:
> >
> >>
> >> dimension 1: Using linear regression with gradient descent algorithm I
> find
> >> what is the trend of the line, i.e. is it increasing, decreasing or
> >> straight
> >> line
> >> dimension 2: Knowing the approximating line (from the linear regression)
> I
> >> count how many times this line gets crossed by the original signal. This
> >> helps in separating the cyclic data from all the rest
> >> dimension 3: What is the biggest increase/decrease of a single signal
> line.
> >> This helps find shifts
> >>
> >> So to say - I put a semantics for the data that are to be clustered (I
> >> don't
> >> know if it is correct to do that, but I couldn't think of how an
> algorithm
> >> could cope with the task without such additional semantics)
> >>
> >
> > It is very common for feature extraction like this to be the key for
> > data-mining projects.   Such features are absolutely critical for most
> time
> > series mining and are highly application dependent.
> >
> > One key aspect of your features is that they are shift invariant.
> >
> >
> >> Also I developed a small swing application which visualizes the
> clustered
> >> signals and which helped me in playing with the algorithms.
> >>
> >
> > Great idea.
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Meanshift clustering

Posted by Lance Norskog <go...@gmail.com>.

Vasil-

Would you consider adding your estimation algorithm to this patch?
https://issues.apache.org/jira/browse/MAHOUT-563

The estimator in there now is stupid- a real one would make the Canopy
algorithms orders of magnitude more useful.

Lance

On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <te...@gmail.com> wrote:
> On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <va...@gmail.com> wrote:
>
>>
>> dimension 1: Using linear regression with gradient descent algorithm I find
>> what is the trend of the line, i.e. is it increasing, decreasing or
>> straight
>> line
>> dimension 2: Knowing the approximating line (from the linear regression) I
>> count how many times this line gets crossed by the original signal. This
>> helps in separating the cyclic data from all the rest
>> dimension 3: What is the biggest increase/decrease of a single signal line.
>> This helps find shifts
>>
>> So to say - I put a semantics for the data that are to be clustered (I
>> don't
>> know if it is correct to do that, but I couldn't think of how an algorithm
>> could cope with the task without such additional semantics)
>>
>
> It is very common for feature extraction like this to be the key for
> data-mining projects.   Such features are absolutely critical for most time
> series mining and are highly application dependent.
>
> One key aspect of your features is that they are shift invariant.
>
>
>> Also I developed a small swing application which visualizes the clustered
>> signals and which helped me in playing with the algorithms.
>>
>
> Great idea.
>



-- 
Lance Norskog
goksron@gmail.com

Re: Meanshift clustering

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <va...@gmail.com> wrote:

>
> dimension 1: Using linear regression with gradient descent algorithm I find
> what is the trend of the line, i.e. is it increasing, decreasing or
> straight
> line
> dimension 2: Knowing the approximating line (from the linear regression) I
> count how many times this line gets crossed by the original signal. This
> helps in separating the cyclic data from all the rest
> dimension 3: What is the biggest increase/decrease of a single signal line.
> This helps find shifts
>
> So to say - I put a semantics for the data that are to be clustered (I
> don't
> know if it is correct to do that, but I couldn't think of how an algorithm
> could cope with the task without such additional semantics)
>

It is very common for feature extraction like this to be the key for
data-mining projects.   Such features are absolutely critical for most time
series mining and are highly application dependent.

One key aspect of your features is that they are shift invariant.


> Also I developed a small swing application which visualizes the clustered
> signals and which helped me in playing with the algorithms.
>

Great idea.

Re: Meanshift clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Hi Jeff,

Ok, I can do so - I can write about my experience on the wiki, or may be
better - write it first on a document and then send it to you for a review.

About the vectorization of the data: In the current example each value of
the control signal is assigned to a separate dimension. Having in mind that
this data is sampled from real-world measurements this should mean that the
oscilations should be Gaussian distributed around the ideal line (for
example ideal line could be straight line, or an oscilating line - depends
on the type of signal). That's why I suppose Gaussian cluster with Dirichlet
algorithm worked well in determining the number of clusters. Unfortunately
the data in these clusters were very intermixed, i.e. I wouldn't say it
could cluster the data correctly.

That's why I decided to create another type of vectorization that will
create a good separation of the data and will work well also with the
distance measure based algorithms (like kmeans for example). For each sample
line a produced a vector with 3 dimensions:

dimension 1: Using linear regression with gradient descent algorithm I find
what is the trend of the line, i.e. is it increasing, decreasing or straight
line
dimension 2: Knowing the approximating line (from the linear regression) I
count how many times this line gets crossed by the original signal. This
helps in separating the cyclic data from all the rest
dimension 3: What is the biggest increase/decrease of a single signal line.
This helps find shifts

So to say - I put a semantics for the data that are to be clustered (I don't
know if it is correct to do that, but I couldn't think of how an algorithm
could cope with the task without such additional semantics)

Also I developed a small swing application which visualizes the clustered
signals and which helped me in playing with the algorithms.

About the implementation of kernel selection in meanshift: I was wondering
what is the best way to test the results. Because with the signals
clustering the algorithm copes almost perfectly. I also didn't find much in
the papers that were cited on the wiki and in the jira about what effects
kernel has in different cases. Have you seen any papers about this? May be
we could try also the algorithm on image data and see the difference
visually?

Regards, Vasil

On Thu, Jan 20, 2011 at 8:28 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Hi Vasil,
>
> Thanks for the great testimonial. Why don't you add your tuning suggestions
> to the wiki? How did your vectorization differ from the example code? If
> your synthetic control results are as good as you say they are you ought to
> post those results too. The current implementation has some scalability
> limitations - as does canopy - but you are not the first to write about its
> positive results. I'd be interested in collaborating on implementing kernel
> selection and it would be nice to be able to compare and contrast the
> algorithm with the original citation on the other points as well. This might
> be another page on the wiki as this implementation is novel afaict.
>
> I have a pretty big dataset (18 gb, compressed) and will try out mean shift
> on it when I get a chance.
>
>
> On 1/20/11 1:27 AM, Vasil Vasilev wrote:
>
>> Hi Jeff,
>>
>> I used the meanshift algorithm to cluster the synthetic control data (I
>> produced slightly different type of vectorization than what was given in
>> the
>> examples) and it gave the best results compared to all other algorithms.
>> So
>> I am really amazed by what it can do!
>>
>> About (1): One interesting thing that I noted is the influence of T2. It
>> turned out that if you do not select T2 correctly the algorithm may not
>> converge. In my case I had the following situation. At the end (6-th
>> iteration) there were 2 canopies left which were within each-other's T1
>> boundaries, but outside each-other's T2 boundaries. This led to the fact
>> that they were touching each-other on every iteration and switching their
>> centroids. As the distance between the centroids was larger then the
>> convergence delta the algorithm never converged. However when I increased
>> T2
>> to merge these 2 clusters the algorithm converged successfully.
>> In this respect for me the following procedure worked:
>> 1. Select T2 as small as possible (greater then convergence delta)
>> 2. Increase it until the algorithm converges
>> With this approach the algorithm determined also very nicely the correct
>> number of clusters
>>
>> About (2): OK. I will try it and let you know about the result.
>>
>>
>> On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman<jdog@windwardsolutions.com
>> >wrote:
>>
>>  On 1/18/11 9:10 AM, Vasil Vasilev wrote:
>>>
>>>  Hello Mahout developers,
>>>>
>>>> Currently I am trying to get more in depth with the clustering
>>>> algorithms
>>>> -
>>>> how they should be used and tuned.
>>>> For this purpose I decided to learn from the source code of the
>>>> different
>>>> implementations.
>>>> In this respect I have the following questions about the Meanshift
>>>> algorithm
>>>> (sorry if it may sound naive, but I am a novice in the area):
>>>>
>>>> 1. I noted that the way it is implemented is different from the
>>>> straightforward approach that is described in the paper (
>>>>
>>>>
>>>> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>>>> ).
>>>> Later I learned from Jira MAHOUT-15 that this was made to enable
>>>> parallelism. There I also noticed that T2 should be fixed to 1.0.
>>>> In fact for me it seems that T2 should be correlated with the
>>>> convergence
>>>> delta parameter (which by default is 0.5) and should be slightly higher
>>>> then
>>>> it. Is my assumption correct?
>>>>
>>>>  I think the optimal values for T1 and T2 depend upon the distance
>>> measure
>>> chosen and the nature of the data itself. As this implementation is
>>> really
>>> just an iterative application of Canopy, I left both T parameters
>>> specifiable too. This is not exactly the same algorithm as Mean Shift in
>>> the
>>> paper but it seems to do amazingly well in some cases.
>>>
>>>  2. With the current implementation the user has the option to select
>>>
>>>> desired
>>>> distance measure, but does not have the flexibility to select a kernel.
>>>> The
>>>> current approach results in a hard-coded conical kernel with radius T1
>>>> and
>>>> no points outside T1 are considered in the path calculation of the
>>>> canopy.
>>>> Is it possible to slightly modify the algorithm (similar to the
>>>> modification
>>>> from kmeans to fuzzy kmeans) where weights are associated with a given
>>>> point
>>>> that would touch the canopy and these weights are drown from the kernel
>>>> function. For example they could be drawn from a normal distribution? Do
>>>> you
>>>> think the possibility for kernel selection could impact positively the
>>>> clustering with meanshift in some cases?
>>>>
>>>>  I don't know but it is intriguing. Why don't you try it?
>>>
>>>  Regards, Vasil
>>>>
>>>>
>>>>
>

Re: Meanshift clustering

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Hi Vasil,

Thanks for the great testimonial. Why don't you add your tuning 
suggestions to the wiki? How did your vectorization differ from the 
example code? If your synthetic control results are as good as you say 
they are you ought to post those results too. The current implementation 
has some scalability limitations - as does canopy - but you are not the 
first to write about its positive results. I'd be interested in 
collaborating on implementing kernel selection and it would be nice to 
be able to compare and contrast the algorithm with the original citation 
on the other points as well. This might be another page on the wiki as 
this implementation is novel afaict.

I have a pretty big dataset (18 gb, compressed) and will try out mean 
shift on it when I get a chance.

On 1/20/11 1:27 AM, Vasil Vasilev wrote:
> Hi Jeff,
>
> I used the meanshift algorithm to cluster the synthetic control data (I
> produced slightly different type of vectorization than what was given in the
> examples) and it gave the best results compared to all other algorithms. So
> I am really amazed by what it can do!
>
> About (1): One interesting thing that I noted is the influence of T2. It
> turned out that if you do not select T2 correctly the algorithm may not
> converge. In my case I had the following situation. At the end (6-th
> iteration) there were 2 canopies left which were within each-other's T1
> boundaries, but outside each-other's T2 boundaries. This led to the fact
> that they were touching each-other on every iteration and switching their
> centroids. As the distance between the centroids was larger then the
> convergence delta the algorithm never converged. However when I increased T2
> to merge these 2 clusters the algorithm converged successfully.
> In this respect for me the following procedure worked:
> 1. Select T2 as small as possible (greater then convergence delta)
> 2. Increase it until the algorithm converges
> With this approach the algorithm determined also very nicely the correct
> number of clusters
>
> About (2): OK. I will try it and let you know about the result.
>
>
> On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>> On 1/18/11 9:10 AM, Vasil Vasilev wrote:
>>
>>> Hello Mahout developers,
>>>
>>> Currently I am trying to get more in depth with the clustering algorithms
>>> -
>>> how they should be used and tuned.
>>> For this purpose I decided to learn from the source code of the different
>>> implementations.
>>> In this respect I have the following questions about the Meanshift
>>> algorithm
>>> (sorry if it may sound naive, but I am a novice in the area):
>>>
>>> 1. I noted that the way it is implemented is different from the
>>> straightforward approach that is described in the paper (
>>>
>>> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>>> ).
>>> Later I learned from Jira MAHOUT-15 that this was made to enable
>>> parallelism. There I also noticed that T2 should be fixed to 1.0.
>>> In fact for me it seems that T2 should be correlated with the convergence
>>> delta parameter (which by default is 0.5) and should be slightly higher
>>> then
>>> it. Is my assumption correct?
>>>
>> I think the optimal values for T1 and T2 depend upon the distance measure
>> chosen and the nature of the data itself. As this implementation is really
>> just an iterative application of Canopy, I left both T parameters
>> specifiable too. This is not exactly the same algorithm as Mean Shift in the
>> paper but it seems to do amazingly well in some cases.
>>
>>   2. With the current implementation the user has the option to select
>>> desired
>>> distance measure, but does not have the flexibility to select a kernel.
>>> The
>>> current approach results in a hard-coded conical kernel with radius T1 and
>>> no points outside T1 are considered in the path calculation of the canopy.
>>> Is it possible to slightly modify the algorithm (similar to the
>>> modification
>>> from kmeans to fuzzy kmeans) where weights are associated with a given
>>> point
>>> that would touch the canopy and these weights are drown from the kernel
>>> function. For example they could be drawn from a normal distribution? Do
>>> you
>>> think the possibility for kernel selection could impact positively the
>>> clustering with meanshift in some cases?
>>>
>> I don't know but it is intriguing. Why don't you try it?
>>
>>> Regards, Vasil
>>>
>>>

Re: Meanshift clustering

Posted by Vasil Vasilev <va...@gmail.com>.

Hi Jeff,

I used the meanshift algorithm to cluster the synthetic control data (I
produced slightly different type of vectorization than what was given in the
examples) and it gave the best results compared to all other algorithms. So
I am really amazed by what it can do!

About (1): One interesting thing that I noted is the influence of T2. It
turned out that if you do not select T2 correctly the algorithm may not
converge. In my case I had the following situation. At the end (6-th
iteration) there were 2 canopies left which were within each-other's T1
boundaries, but outside each-other's T2 boundaries. This led to the fact
that they were touching each-other on every iteration and switching their
centroids. As the distance between the centroids was larger then the
convergence delta the algorithm never converged. However when I increased T2
to merge these 2 clusters the algorithm converged successfully.
In this respect for me the following procedure worked:
1. Select T2 as small as possible (greater then convergence delta)
2. Increase it until the algorithm converges
With this approach the algorithm determined also very nicely the correct
number of clusters

About (2): OK. I will try it and let you know about the result.

On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> On 1/18/11 9:10 AM, Vasil Vasilev wrote:
>
>> Hello Mahout developers,
>>
>> Currently I am trying to get more in depth with the clustering algorithms
>> -
>> how they should be used and tuned.
>> For this purpose I decided to learn from the source code of the different
>> implementations.
>> In this respect I have the following questions about the Meanshift
>> algorithm
>> (sorry if it may sound naive, but I am a novice in the area):
>>
>> 1. I noted that the way it is implemented is different from the
>> straightforward approach that is described in the paper (
>>
>> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
>> ).
>> Later I learned from Jira MAHOUT-15 that this was made to enable
>> parallelism. There I also noticed that T2 should be fixed to 1.0.
>> In fact for me it seems that T2 should be correlated with the convergence
>> delta parameter (which by default is 0.5) and should be slightly higher
>> then
>> it. Is my assumption correct?
>>
> I think the optimal values for T1 and T2 depend upon the distance measure
> chosen and the nature of the data itself. As this implementation is really
> just an iterative application of Canopy, I left both T parameters
> specifiable too. This is not exactly the same algorithm as Mean Shift in the
> paper but it seems to do amazingly well in some cases.
>
>  2. With the current implementation the user has the option to select
>> desired
>> distance measure, but does not have the flexibility to select a kernel.
>> The
>> current approach results in a hard-coded conical kernel with radius T1 and
>> no points outside T1 are considered in the path calculation of the canopy.
>> Is it possible to slightly modify the algorithm (similar to the
>> modification
>> from kmeans to fuzzy kmeans) where weights are associated with a given
>> point
>> that would touch the canopy and these weights are drown from the kernel
>> function. For example they could be drawn from a normal distribution? Do
>> you
>> think the possibility for kernel selection could impact positively the
>> clustering with meanshift in some cases?
>>
> I don't know but it is intriguing. Why don't you try it?
>
>> Regards, Vasil
>>
>>
>

Re: Meanshift clustering

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

On 1/18/11 9:10 AM, Vasil Vasilev wrote:
> Hello Mahout developers,
>
> Currently I am trying to get more in depth with the clustering algorithms -
> how they should be used and tuned.
> For this purpose I decided to learn from the source code of the different
> implementations.
> In this respect I have the following questions about the Meanshift algorithm
> (sorry if it may sound naive, but I am a novice in the area):
>
> 1. I noted that the way it is implemented is different from the
> straightforward approach that is described in the paper (
> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf).
> Later I learned from Jira MAHOUT-15 that this was made to enable
> parallelism. There I also noticed that T2 should be fixed to 1.0.
> In fact for me it seems that T2 should be correlated with the convergence
> delta parameter (which by default is 0.5) and should be slightly higher then
> it. Is my assumption correct?
I think the optimal values for T1 and T2 depend upon the distance 
measure chosen and the nature of the data itself. As this implementation 
is really just an iterative application of Canopy, I left both T 
parameters specifiable too. This is not exactly the same algorithm as 
Mean Shift in the paper but it seems to do amazingly well in some cases.
> 2. With the current implementation the user has the option to select desired
> distance measure, but does not have the flexibility to select a kernel. The
> current approach results in a hard-coded conical kernel with radius T1 and
> no points outside T1 are considered in the path calculation of the canopy.
> Is it possible to slightly modify the algorithm (similar to the modification
> from kmeans to fuzzy kmeans) where weights are associated with a given point
> that would touch the canopy and these weights are drown from the kernel
> function. For example they could be drawn from a normal distribution? Do you
> think the possibility for kernel selection could impact positively the
> clustering with meanshift in some cases?
I don't know but it is intriguing. Why don't you try it?
> Regards, Vasil
>