You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/09/28 17:22:32 UTC

[jira] Created: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

ClusterEvaluator inter-cluster density returns NaN
--------------------------------------------------

                 Key: MAHOUT-513
                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.3
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman
             Fix For: 0.4


Hi Jeff,

I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:

Centroid:

{0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}

and one of the representative points (3 per cluster):

[0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]

As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.

Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:

min = max = 1.5397509610616733E-7
count = 3
sum = 4.61925288318502E-7
max - min: 0.0
count - min: 2.9999998460249038
(sum / count - min) = 0.0

This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?

BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?

Thanks,

Derek



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-513.
---------------------------------

    Resolution: Fixed

Replacing the standard deviation computation and some other algorithm changes seem to have resolved this issue. Marking Fixed.

> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916216#action_12916216 ] 

Hudson commented on MAHOUT-513:
-------------------------------

Integrated in Mahout-Quality #347 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/347/])
    MAHOUT-513: hopefully fixed weighting of vectors by adding weightedX
MAHOUT-513
- Created interface GaussianAccumulator and two concrete implementations:
  - RunningSumsGaussianAccumulator uses running sums approach
  - OnlineGaussianAccumulator uses Knuth (Welford) approach
- Added unit test thereof which produces significant std deviations and drastically-odd
  variances. I'm committing this so it can get more eyeballs. It is not used anywhere yet.
- Refactored CDbwClusterEvaluator to use RunningSumsGaussianAccumulator and
  existing tests continue to run
- Cleaned up logging in various clustering algorithms to increase use of debug vs. info
  to reduce log clutter
All tests run.


> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915895#action_12915895 ] 

Hudson commented on MAHOUT-513:
-------------------------------

Integrated in Mahout-Quality #344 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/344/])
    MAHOUT-513: Added a sequential version of representativePoints extraction. Changed other tests to use it. All tests run


> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by Ted Dunning <te...@gmail.com>.

I didn't have anything to do with the code originally, so I can only comment
in generalities.

Degenerate clusters with radius zero are commonly a problem in evaluation
metrics.  Even if the cluster isn't exactly degenerate, if a sample of the
cluster is, then you may have the same problem.  These are also a problem in
maximum likelihood methods because
they try to cluster to maximize a metric that breaks under degeneracy.
 Sadly, a single point is the prototypical degenerate cluster
so it is easy to have trouble break out.

K-means avoids this by avoiding the concept of radius (i.e. fixing it in a
way that it doesn't matter).  Dirchlet mixtures handle it with a good prior.

The CDbw metrics don't seem to handle this well.  My tendency would be to
impose some kind of prior in the computation of radii (implicit in the
max-min that you mention).  How to do this well isn't clear to me without
spending more than my allowance in looking
at the code or the paper.

Sorry to be fragmentary.  Hope it helps anyway.

On Tue, Sep 28, 2010 at 10:22 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Sean, Robin, Ted: One of you guys evidently wrote the inter-cluster density
> computation but did not include an intra-cluster computation in" Mahout In
> Action". The CDbwEvaluator calculates both using only the representative
> points (and may have been transcribed incorrectly from the paper to boot).
> Please chime in.

Re: [jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  Hi Derek,

Let's consider why the intra-cluster density is being normalized by 
(max-min) in the first place. I confess I don't understand why the 
inter-cluster density is so normalized, but I copied the pattern from it 
out of blind faith.

Sean, Robin, Ted: One of you guys evidently wrote the inter-cluster 
density computation but did not include an intra-cluster computation in" 
Mahout In Action". The CDbwEvaluator calculates both using only the 
representative points (and may have been transcribed incorrectly from 
the paper to boot). Please chime in.

On 9/28/10 12:25 PM, Derek O'Callaghan (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915814#action_12915814 ]
>
> Derek O'Callaghan commented on MAHOUT-513:
> ------------------------------------------
>
> Hi Jeff,
>
> In this case it appears that there are ~20 points in the cluster, and they're all almost identical to each other. It's a text-clustering problem, using reduced dimensionality, and these original 20 points have almost identical terms. I'm not sure either what the solution is, this is an acceptable cluster which so happens to be quite dense, so it'd be good to see this in the results. Having said that, the average density will then be skewed as you say, as the remaining clusters in this case are nowhere near as dense. I need to think about it a bit more.
>
> I'm also getting a couple of strange values in the CDbwEvaluator, I suspect it could be a similar issue but I haven't had a chance to confirm it yet.
>
> Thanks,
>
> Derek
>
>> ClusterEvaluator inter-cluster density returns NaN
>> --------------------------------------------------
>>
>>                  Key: MAHOUT-513
>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-513
>>              Project: Mahout
>>           Issue Type: Bug
>>           Components: Clustering
>>     Affects Versions: 0.3
>>             Reporter: Jeff Eastman
>>             Assignee: Jeff Eastman
>>              Fix For: 0.4
>>
>>
>> Hi Jeff,
>> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
>> Centroid:
>> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
>> and one of the representative points (3 per cluster):
>> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
>> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
>> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
>> min = max = 1.5397509610616733E-7
>> count = 3
>> sum = 4.61925288318502E-7
>> max - min: 0.0
>> count - min: 2.9999998460249038
>> (sum / count - min) = 0.0
>> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
>> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
>> Thanks,
>> Derek

[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Derek O'Callaghan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915814#action_12915814 ] 

Derek O'Callaghan commented on MAHOUT-513:
------------------------------------------

Hi Jeff,

In this case it appears that there are ~20 points in the cluster, and they're all almost identical to each other. It's a text-clustering problem, using reduced dimensionality, and these original 20 points have almost identical terms. I'm not sure either what the solution is, this is an acceptable cluster which so happens to be quite dense, so it'd be good to see this in the results. Having said that, the average density will then be skewed as you say, as the remaining clusters in this case are nowhere near as dense. I need to think about it a bit more.

I'm also getting a couple of strange values in the CDbwEvaluator, I suspect it could be a similar issue but I haven't had a chance to confirm it yet.

Thanks,

Derek

> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915982#action_12915982 ] 

Hudson commented on MAHOUT-513:
-------------------------------

Integrated in Mahout-Quality #346 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/346/])
    MAHOUT-513: Reviewed calculations from paper in detail, making some changes that seemed needed. All tests run but I'm still not very confident in the computed numbers


> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915786#action_12915786 ] 

Jeff Eastman commented on MAHOUT-513:
-------------------------------------

Hi Derek,

Thanks for your help on this new (experimental) code! If a particular cluster actually has no points assigned to it by one of the clustering jobs, then the centroid of the cluster will be repeated n times in its representative points and the (max-min) will fail as you note. Dirichlet does this quite often, as there are usually more models allocated than receive points in an iteration. The invalidCluster method is attempting to detect this degenerate situation and remove all clusters that would mess up the calculations.

In your situation, I gather your representative points are so close to the centroid that (max-min) becomes zero while the centroid vector equality test returns false because there is still some small difference. My hunch is that these clusters need to be pruned too, and adding an epsilon test to invalidCluster would be the right choice. Otherwise, one would have to return a very large number for the normalized density of that cluster and it would radically skew the intra-cluster density average. OTOH, if your clusters really do have distinct representative points you might want a very large intra-cluster density. I'm open to suggestions here. NaN is clearly not helpful.

Is this a text-clustering problem you are working?


> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-513) ClusterEvaluator inter-cluster density returns NaN

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916237#action_12916237 ] 

Hudson commented on MAHOUT-513:
-------------------------------

Integrated in Mahout-Quality #348 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/348/])
    MAHOUT-513
- removed weighting from GaussianAccumulator.observe(). It's not needed for
CDbw and is problematic in the OnlineGaussianAccumulator.  Tests all run.
MAHOUT-513
- fixed bug in OnlineGaussianAccumulator.getStd()
- added test of variance
all tests now run, though std/variance results are different than with RunningSums


> ClusterEvaluator inter-cluster density returns NaN
> --------------------------------------------------
>
>                 Key: MAHOUT-513
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-513
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.4
>
>
> Hi Jeff,
> I've been trying out the ClusterEvaluator class today since your recent changes, and I'm running into a problem whereby the average intra-cluster density can be set to NaN. Looking into it, it seems to happen for clusters containing points which are very close to the centroid.  For example, I have a cluster with:
> Centroid:
> {0:0.6075199543688895,1:-0.3165058387409551,2:0.2027106147825682,3:-21.246338574215706,4:-5.875047828899212,5:-0.9835694086952028,6:0.2794019939470805,7:-0.36402079609289717,8:0.5201946127074457,9:-0.47084217746293855,10:-0.14380397719670499,11:-0.10441028152861193,12:0.0698485086335405,13:0.014286758874801297}
> and one of the representative points (3 per cluster):
> [0.6075199543688894, -0.31650583874095506, 0.2027106147825682, -21.2463385742157, -5.875047828899212, -0.9835694086952026, 0.27940199394708054, -0.36402079609289706, 0.5201946127074457, -0.47084217746293855, -0.14380397719670499, -0.10441028152861194, 0.06984850863354047, 0.014286758874801297]
> As far as I can tell from debugging, the representative points look identical to the centroid of this cluster, but I'm assuming there's some small difference as "if (!vector.equals(clusterI.getCenter()))" in ClusterEvaluator.invalidCluster() is always returning false for these points, and so the cluster isn't pruned from the list.
> Later on, in ClusterEvaluator.intraClusterDensity(), the "min" and "max" distances are ending up with the same value, and the density from "double density = (sum / count - min) / (max - min);" is calculated as NaN, e.g. here are the values I'm getting:
> min = max = 1.5397509610616733E-7
> count = 3
> sum = 4.61925288318502E-7
> max - min: 0.0
> count - min: 2.9999998460249038
> (sum / count - min) = 0.0
> This then causes avgDensity to be calculated as NaN. I'm not sure what the solution is here, should invalidCluster() check that the the difference between the centroid and the candidate representative point is greater than a certain threshold, which would cause such a cluster to be pruned? Or is the fix in the intraClusterDensity() calculation to handle the case where min = max?
> BTW would you prefer that I create a Jira to record these issues, or is it okay to send them to the dev list as I've been doing?
> Thanks,
> Derek

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.