You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@commons.apache.org by "email@thorstenschaefer.de" <em...@thorstenschaefer.de> on 2013/11/12 23:58:13 UTC

[MATH] Restricted hierarchical clustering

I saw Thomas’ patch in https://issues.apache.org/jira/browse/MATH-959 which aims to add support for HAC to commons-math. However, I am just faced with a use case and wonder if/how this could be done either with existing methods or the proposed HAC algorithm there.

Lets assume we have items to 1000 cluster. Each item represents a sequence, e.g. AB, AC, AD, …, BA, BB, BC, …, ZA, …, ZZ and I can assign data points  to each item which can be used to calculate their similarity/distance. My goal is to create 50 clusters containing all sequences – this can be done pretty straight forward using KMeans++.
However, lets assume we want a hierarchical cluster,  with 10 clusters at level 1 and 50 at level 2. At level one, I have the restriction that the first element in the sequence needs to be assigned to a unique cluster, e.g., the structure should look something like this:
Cluster1: A, B, C
Cluster1.1: AA, AC, AD, AE, …, BD, BE, BF, … CA
Cluster1.2: AB,BA,BC,BD,CB,CC,CD, …
…
Cluster1.7: AY,AZ,BZ,CZ
// cluster 1 has 7 subclusters.
Cluster2: D, E, F
…
Cluster3: G
Cluster3.1: GA,GB …, GU
Cluster3.2: GV, GW,… GZ
// note that cluster 3 has only 2 sub clusters
Cluster4: H, I
…
Cluster 10: W, X, Y, Z
// all sub clusters from cluster1 to cluster10 should add up to 50

Hence, all sequences in a cluster in level 2 need to have its sequence prefix in the parent cluster. Furthermore, even though I want 10 clusters on level 1 and 50 on level 2, it does not mean that each level 1 cluster should necessarily have 5 child clusters.

I hope its clear enough to get the general restriction I want to ensure and I wonder how this could be implemented using the clustering algorithms in commons-math.

Cheers,

Thorsten

Re: [MATH] Restricted hierarchical clustering

Posted by Thomas Neidhart <th...@gmail.com>.
Hi Thorsten,

this sounds like a very specific use-case of a hierarchical clustering.
I could imagine the following way to achieve it:

 * first cluster all data points with kmeans, with a k=50 as you would like
to have 50 clusters on level 2
 * take the 50 clusters and feed them into a HAC like algorithm which
finishes when the number of level 1 clusters have been formed ( = 10 )

I did experiment with this, see the code below. The test is very simple, I
create sequences from the normal ascii alphabet, AA .. ZZ
For the distance measure, I created a special variant of the Euclidean
distance by taking the position of the sequence into account, e.g. AA is
more similar to AB than to BA.
The result of kmeans is directly used to bootstrap the HAC clustering.

Is this something you had in mind?


    public static class Sequence implements Clusterable {

        private char[] sequence;

        public Sequence(char[] seq) {
            sequence = seq;
        }

        public double[] getPoint() {
            double[] point = new double[ sequence.length ];
            for ( int i = 0; i < point.length; i++ ) {
                point[i] = sequence[i] - 'A';
            }
            return point;
        }

        @Override
        public String toString() {
            return String.valueOf(sequence);
        }
    }

    public static class SequenceDistance implements DistanceMeasure {

        public double compute(double[] a, double[] b) {
            double dist = 0;
            for ( int i = 0, j = 2; i < a.length; i++ ) {
                double diff = a[i] - b[i];
                dist += FastMath.pow( 10, j ) * diff * diff;

                --j;
                if ( j < 1 ) {
                    j = 1;
                }
            }
            return FastMath.sqrt(dist);
        }

    }

    public static void main(String[] args) {

        List<Sequence> points = new ArrayList<Sequence>();

        for ( char a = 'A'; a <= 'Z' ; a++ ) {
            for ( char b = 'A'; b <= 'Z'; b++ ) {
                points.add(new Sequence( new char[] { a, b } ) );
            }
        }

        KMeansPlusPlusClusterer<Sequence> kmeans = new
KMeansPlusPlusClusterer<Sequence>(50, 100, new SequenceDistance());
        List<? extends Cluster<Sequence>> cluster = kmeans.cluster(points);

        Set<Cluster<Sequence>> currentClusters = new
HashSet<Cluster<Sequence>>(cluster);

        while (currentClusters.size() > 10 ) {
          Cluster<Sequence> a = null;
          Cluster<Sequence> b = null;
          double minDistance = Double.MAX_VALUE;
          int i = 0;
          for (Cluster<Sequence> clusterA : currentClusters) {
              int j = 0;
              for (Cluster<Sequence> clusterB : currentClusters) {
                  if (j++ <= i) {
                      continue;
                  }
                  double distance = maxDistance(clusterA, clusterB,
kmeans.getDistanceMeasure());
                  if (distance < minDistance) {
                      a = clusterA;
                      b = clusterB;
                      minDistance = distance;
                  }
              }
              i++;
          }

          currentClusters.remove(a);
          currentClusters.remove(b);
          Cluster<Sequence> merge = new
HierarchicalCluster<Sequence>(minDistance, a, b);
          currentClusters.add(merge);
        }

        for ( Cluster<Sequence> c : currentClusters ) {
            System.out.println(c.getPoints());
            printLeafs(c);
            System.out.println("---");
        }
    }

    public static <T extends Clusterable> double maxDistance(Cluster<T> a,
Cluster<T> b, DistanceMeasure dm) {
        double maxDistance = Double.MIN_VALUE;
        for (final T pA : a.getPoints()) {
            for (final T pB : b.getPoints()) {
                double d = dm.compute(pA.getPoint(), pB.getPoint());
                if (d > maxDistance) {
                    maxDistance = d;
                }
            }
        }
        return maxDistance;
    }

    public static <T extends Clusterable> void printLeafs(Cluster<T>
cluster) {
        if ( cluster instanceof HierarchicalCluster) {
            HierarchicalCluster hc = (HierarchicalCluster<T>) cluster;
            //System.out.println("Cluster: distance=" + hc.getDistance() +
" " + cluster.getPoints());

            if (hc.getLeftChild() != null) {
                printLeafs(hc.getLeftChild());
            }
            if (hc.getRightChild() != null) {
                printLeafs(hc.getRightChild());
            }

        } else {
            System.out.println("ClusterLeaf: " +
cluster.getPoints());
        }

    }





On Tue, Nov 12, 2013 at 11:58 PM, email@thorstenschaefer.de <
email@thorstenschaefer.de> wrote:

> I saw Thomas’ patch in https://issues.apache.org/jira/browse/MATH-959which aims to add support for HAC to commons-math. However, I am just faced
> with a use case and wonder if/how this could be done either with existing
> methods or the proposed HAC algorithm there.
>
> Lets assume we have items to 1000 cluster. Each item represents a
> sequence, e.g. AB, AC, AD, …, BA, BB, BC, …, ZA, …, ZZ and I can assign
> data points  to each item which can be used to calculate their
> similarity/distance. My goal is to create 50 clusters containing all
> sequences – this can be done pretty straight forward using KMeans++.
> However, lets assume we want a hierarchical cluster,  with 10 clusters at
> level 1 and 50 at level 2. At level one, I have the restriction that the
> first element in the sequence needs to be assigned to a unique cluster,
> e.g., the structure should look something like this:
> Cluster1: A, B, C
> Cluster1.1: AA, AC, AD, AE, …, BD, BE, BF, … CA
> Cluster1.2: AB,BA,BC,BD,CB,CC,CD, …
> …
> Cluster1.7: AY,AZ,BZ,CZ
> // cluster 1 has 7 subclusters.
> Cluster2: D, E, F
> …
> Cluster3: G
> Cluster3.1: GA,GB …, GU
> Cluster3.2: GV, GW,… GZ
> // note that cluster 3 has only 2 sub clusters
> Cluster4: H, I
> …
> Cluster 10: W, X, Y, Z
> // all sub clusters from cluster1 to cluster10 should add up to 50
>
> Hence, all sequences in a cluster in level 2 need to have its sequence
> prefix in the parent cluster. Furthermore, even though I want 10 clusters
> on level 1 and 50 on level 2, it does not mean that each level 1 cluster
> should necessarily have 5 child clusters.
>
> I hope its clear enough to get the general restriction I want to ensure
> and I wonder how this could be implemented using the clustering algorithms
> in commons-math.
>
> Cheers,
>
> Thorsten
>