You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/09/28 17:30:33 UTC
[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915786#action_12915786 ] 

Jeff Eastman commented on MAHOUT-513:
-------------------------------------

Hi Derek,

Thanks for your help on this new (experimental) code! If a particular cluster actually has no points assigned to it by one of the clustering jobs, then the centroid of the cluster will be repeated n times in its representative points and the (max-min) will fail as you note. Dirichlet does this quite often, as there are usually more models allocated than receive points in an iteration. The invalidCluster method is attempting to detect this degenerate situation and remove all clusters that would mess up the calculations.

In your situation, I gather your representative points are so close to the centroid that (max-min) becomes zero while the centroid vector equality test returns false because there is still some small difference. My hunch is that these clusters need to be pruned too, and adding an epsilon test to invalidCluster would be the right choice. Otherwise, one would have to return a very large number for the normalized density of that cluster and it would radically skew the intra-cluster density average. OTOH, if your clusters really do have distinct representative points you might want a very large intra-cluster density. I'm open to suggestions here. NaN is clearly not helpful.

Is this a text-clustering problem you are working?


> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.