You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org> on 2011/03/31 15:42:05 UTC
[jira] [Created] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Elkan distance optimization for VectorBenchmarks class
------------------------------------------------------
Key: MAHOUT-645
URL: https://issues.apache.org/jira/browse/MAHOUT-645
Project: Mahout
Issue Type: Improvement
Components: Clustering
Affects Versions: 0.4
Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
Reporter: Gustavo Salazar Torres
Priority: Minor
Fix For: 0.4
Implementation of first lemma of Elkan's optimization:
Given three points x, b, c (where b and c are centroids):
d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gustavo Salazar Torres updated MAHOUT-645:
------------------------------------------
Attachment: patches.zip
Affects mahout-core and mahout-utils
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013983#comment-13013983 ]
Gustavo Salazar Torres commented on MAHOUT-645:
-----------------------------------------------
Sean, I created the patch (hope I did right). Regarding whether committing it as an optimization or not, Ted and Robert suggested to wait until having further results. The intention is to implement the full Elkan optimization which will modify almost entirely the current version of K-means.
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.4
>
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014135#comment-13014135 ]
Gustavo Salazar Torres commented on MAHOUT-645:
-----------------------------------------------
Thanks, that's simpler!
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Assignee: Sean Owen
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.5
>
> Attachments: MAHOUT-645.patch, patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-645:
-----------------------------
Fix Version/s: (was: 0.4)
Labels: centroid clustering elkan (was: test)
Sounds good, though do you have a patch?
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013980#comment-13013980 ]
Sean Owen commented on MAHOUT-645:
----------------------------------
One patch named "MAHOUT-645.patch" and not compressed would be easiest to deal with.
I just actually read the title... this is only adding a benchmark? Nothing wrong with that per se but seems a lot more useful if this optimization could be turned into a patch to improve the real code instead of just demonstrating it's an optimization (which is most certainly true).
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013990#comment-13013990 ]
Sean Owen commented on MAHOUT-645:
----------------------------------
OK, would it be better to consider this an open issue for later, to cover updating the K-means code? or would you rather get this in now as a prelude to that?
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.4
>
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-645:
-----------------------------
Due Date: 08/Apr/11
Fix Version/s: (was: 0.4)
0.5
Assignee: Sean Owen
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Assignee: Sean Owen
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.5
>
> Attachments: MAHOUT-645.patch, patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-645:
-----------------------------
Attachment: MAHOUT-645.patch
Here's my proposal for this patch. Rather than putting one benchmark-specific counter into the general TimingStatistics class, I just used the existing mechanism for reporting extra info.
I also adjusted the formatting and style.
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.4
>
> Attachments: MAHOUT-645.patch, patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013994#comment-13013994 ]
Gustavo Salazar Torres commented on MAHOUT-645:
-----------------------------------------------
I prefer to get in this now.
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.4
>
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Gustavo Salazar Torres (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gustavo Salazar Torres updated MAHOUT-645:
------------------------------------------
Fix Version/s: 0.4
Status: Patch Available (was: Open)
Affects mahout-core and mahout-utils
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.4
>
> Attachments: patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-645) Elkan distance optimization for
VectorBenchmarks class
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-645:
-----------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Elkan distance optimization for VectorBenchmarks class
> ------------------------------------------------------
>
> Key: MAHOUT-645
> URL: https://issues.apache.org/jira/browse/MAHOUT-645
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Environment: Ubuntu Linux at Intel Core2 Duo P7450 @ 2.13GHz
> Reporter: Gustavo Salazar Torres
> Assignee: Sean Owen
> Priority: Minor
> Labels: centroid, clustering, elkan
> Fix For: 0.5
>
> Attachments: MAHOUT-645.patch, patches.zip
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Implementation of first lemma of Elkan's optimization:
> Given three points x, b, c (where b and c are centroids):
> d(b,c)>=2d(x.b) then d(x,c)>=d(x,b)
> in which case we wouldn't need to calculate d(x,c). This is used to find the closest centroid for every point x.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira