You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org> on 2011/03/31 15:42:05 UTC

[jira] [Created] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Elkan distance optimization for VectorBenchmarks class
------------------------------------------------------

                 Key: MAHOUT-645
                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.4
         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
            Reporter: Gustavo Salazar Torres
            Priority: Minor
             Fix For: 0.4


Implementation of first lemma of Elkan's optimization:
Given three points x, b, c (where b and c are centroids):
                                           d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gustavo Salazar Torres updated MAHOUT-645:
------------------------------------------

    Attachment: patches.zip

Affects mahout-core and mahout-utils

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013983#comment-13013983 ] 

Gustavo Salazar Torres commented on MAHOUT-645:
-----------------------------------------------

Sean, I created the patch (hope I did right). Regarding whether committing it as an optimization or not, Ted and Robert suggested to wait until having further results. The intention is to implement the full Elkan optimization which will modify almost entirely the current version of K-means.

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.4
>
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014135#comment-13014135 ] 

Gustavo Salazar Torres commented on MAHOUT-645:
-----------------------------------------------

Thanks, that's simpler!

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.5
>
>         Attachments: MAHOUT-645.patch, patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-645:
-----------------------------

    Fix Version/s:     (was: 0.4)
           Labels: centroid clustering elkan  (was: test)

Sounds good, though do you have a patch?

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013980#comment-13013980 ] 

Sean Owen commented on MAHOUT-645:
----------------------------------

One patch named "MAHOUT-645.patch" and not compressed would be easiest to deal with.
I just actually read the title... this is only adding a benchmark? Nothing wrong with that per se but seems a lot more useful if this optimization could be turned into a patch to improve the real code instead of just demonstrating it's an optimization (which is most certainly true).

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013990#comment-13013990 ] 

Sean Owen commented on MAHOUT-645:
----------------------------------

OK, would it be better to consider this an open issue for later, to cover updating the K-means code? or would you rather get this in now as a prelude to that?

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.4
>
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-645:
-----------------------------

         Due Date: 08/Apr/11
    Fix Version/s:     (was: 0.4)
                   0.5
         Assignee: Sean Owen

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.5
>
>         Attachments: MAHOUT-645.patch, patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-645:
-----------------------------

    Attachment: MAHOUT-645.patch

Here's my proposal for this patch. Rather than putting one benchmark-specific counter into the general TimingStatistics class, I just used the existing mechanism for reporting extra info.

I also adjusted the formatting and style.

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.4
>
>         Attachments: MAHOUT-645.patch, patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013994#comment-13013994 ] 

Gustavo Salazar Torres commented on MAHOUT-645:
-----------------------------------------------

I prefer to get in this now.

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.4
>
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gustavo Salazar Torres updated MAHOUT-645:
------------------------------------------

    Fix Version/s: 0.4
           Status: Patch Available  (was: Open)

Affects mahout-core and mahout-utils

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.4
>
>         Attachments: patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-645) Elkan distance optimization for VectorBenchmarks class

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-645:
-----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
>                 Key: MAHOUT-645
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-645
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>         Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
>            Reporter: Gustavo Salazar Torres
>            Assignee: Sean Owen
>            Priority: Minor
>              Labels: centroid, clustering, elkan
>             Fix For: 0.5
>
>         Attachments: MAHOUT-645.patch, patches.zip
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
>                                            d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira