You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2011/03/13 01:24:59 UTC

[jira] Created: (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

T1 and T2 Values in Canopy (& MeanShift) 
-----------------------------------------

                 Key: MAHOUT-626
                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.5
            Reporter: Jeff Eastman
            Assignee: Jeff Eastman
             Fix For: 0.5


Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 

Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.

If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

RE: [jira] [Commented] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by Jeff Eastman <je...@Narus.com>.
The Eclipse settings of my work editor are different than the Mahout conventions. I haven't heard any more feedback on the meat of this issue but have no problem committing it. I will take a look this weekend and make a disposition.

-----Original Message-----
From: Sean Owen (JIRA) [mailto:jira@apache.org] 
Sent: Sunday, March 20, 2011 1:50 PM
To: dev@mahout.apache.org
Subject: [jira] [Commented] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)


    [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008967#comment-13008967 ] 

Sean Owen commented on MAHOUT-626:
----------------------------------

Jeff I glanced at it and didn't see any issues except formatting. A lot of the changed lines look like whitespace changes, and it seems to be using tabs for indents vs 2 spaces. (I'd also suggested that private fields plus protected getters are better than protected fields.) But we can address it later. 

I think you are in the best position to understand this code and the need and the change, so it seems reasonable enough for you to perhaps look at the above and then commit and if there are any small further changes you can iterate from there.

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018164#comment-13018164 ] 

Hudson commented on MAHOUT-626:
-------------------------------

Integrated in Mahout-Quality #737 (See [https://hudson.apache.org/hudson/job/Mahout-Quality/737/])
    MAHOUT-626: Added optional T3/T4 arguments to Canopy. Added new unit test. All tests run


> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-626.
---------------------------------

    Resolution: Fixed

I've stewed about whether or not to try this with MeanShiftCanopy and decided it is not appropriate to change these values between the mapper and reducer. MeanShift is an iterative algorithm and these changes would vascillate between mapper & reducer values in a way that is not reflected in the algorithm as I understand it.

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.6
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008967#comment-13008967 ] 

Sean Owen commented on MAHOUT-626:
----------------------------------

Jeff I glanced at it and didn't see any issues except formatting. A lot of the changed lines look like whitespace changes, and it seems to be using tabs for indents vs 2 spaces. (I'd also suggested that private fields plus protected getters are better than protected fields.) But we can address it later. 

I think you are in the best position to understand this code and the need and the change, so it seems reasonable enough for you to perhaps look at the above and then commit and if there are any small further changes you can iterate from there.

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018142#comment-13018142 ] 

Jeff Eastman commented on MAHOUT-626:
-------------------------------------

r1090881 committed the above patch (with improved formatting). All tests run

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman reopened MAHOUT-626:
---------------------------------


Reopening in case I need to do this for MeanShift too

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-626:
--------------------------------

    Fix Version/s:     (was: 0.5)

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-626:
-----------------------------

    Affects Version/s: 0.4
        Fix Version/s: 0.6

JIRA housekeeping: assigning for 0.6 since I bet it's easy enough to finish up for MeanShift then.

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4, 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.6
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-626:
--------------------------------

    Attachment: CanopyT3T4.patch

Here's the patch for Canopy

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-626) T1 and T2 Values in Canopy (& MeanShift)

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman resolved MAHOUT-626.
---------------------------------

    Resolution: Fixed

> T1 and T2 Values in Canopy (& MeanShift) 
> -----------------------------------------
>
>                 Key: MAHOUT-626
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-626
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.5
>            Reporter: Jeff Eastman
>            Assignee: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: CanopyT3T4.patch
>
>
> Users are reporting that the T1 and T2 threshold values which work in sequential mode don't work as well in the mapreduce mode because both the mapper and reducer are using the same values. The effect of coalescing a number of points into a single centroid done by the mapper changes the distances enough that independent threshold values are needed in the reducer. 
> Here is a patch which implements optional T3 and T4 threshold values which are only used by the canopy reducer. Convenience methods have been added for API compatibility and defaults included so that these values will default to T1 and T2. A new unit test confirms the thresholds are being set correctly.
> If this works out as a positive improvement, I will make the same changes to MeanShift and commit them

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira