You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Matias Bjørling (JIRA)" <ji...@apache.org> on 2009/09/06 10:44:57 UTC
[jira] Created: (MAHOUT-173) Implement clustering of massive-domain
attributes
Implement clustering of massive-domain attributes
-------------------------------------------------
Key: MAHOUT-173
URL: https://issues.apache.org/jira/browse/MAHOUT-173
Project: Mahout
Issue Type: New Feature
Components: Clustering
Reporter: Matias Bjørling
Priority: Trivial
Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
Steps:
1. Implement baseline solution to compare solutions.
2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
3. Implement Count-Min sketch algorithm for each cluster.
4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by Ted Dunning <te...@gmail.com>.
I never saw much progress on this.
On Wed, Dec 23, 2009 at 11:58 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794197#action_12794197]
>
> Sean Owen commented on MAHOUT-173:
> ----------------------------------
>
> Pinging this issue -- is there any progress in the past 3.5 months or
> should we shelve it?
>
> > Implement clustering of massive-domain attributes
> > -------------------------------------------------
> >
> > Key: MAHOUT-173
> > URL: https://issues.apache.org/jira/browse/MAHOUT-173
> > Project: Mahout
> > Issue Type: New Feature
> > Components: Clustering
> > Reporter: Matias Bjørling
> > Priority: Trivial
> > Original Estimate: 30h
> > Remaining Estimate: 30h
> >
> > Implement the Clustering algorithm described in "A Framework for
> Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> > Steps:
> > 1. Implement baseline solution to compare solutions.
> > 2. Figure out how to implement the loading of clustering by looking at
> the k-means implementation.
> > 3. Implement Count-Min sketch algorithm for each cluster.
> > 4. Find out how to give the user the power to choose the distance
> function for the input data ( Maybe already possible? )
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
--
Ted Dunning, CTO
DeepDyve
[jira] Updated: (MAHOUT-173) Implement clustering of massive-domain
attributes
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated MAHOUT-173:
-----------------------------
Fix Version/s: 0.3
Affects Version/s: 0.2
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Matias Bjørling
> Priority: Trivial
> Fix For: 0.3
>
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794197#action_12794197 ]
Sean Owen commented on MAHOUT-173:
----------------------------------
Pinging this issue -- is there any progress in the past 3.5 months or should we shelve it?
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Matias Bjørling
> Priority: Trivial
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796538#action_12796538 ]
Ted Dunning commented on MAHOUT-173:
------------------------------------
It seems that this algorithm is a combination of a hashed kernel and k-means clustering.
Would it be easiest to implement this using the vectorization algorithms described as part of MAHOUT-228 and then just using the normal k-means algorithm?
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Matias Bjørling
> Priority: Trivial
> Fix For: 0.3
>
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799154#action_12799154 ]
Sean Owen commented on MAHOUT-173:
----------------------------------
Just clarifying the status -- Vaijanath are you working on this or are you saying it's kind of subsumed by MAHOUT-228?
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Matias Bjørling
> Priority: Trivial
> Fix For: 0.3
>
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Vaijanath N. Rao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799163#action_12799163 ]
Vaijanath N. Rao commented on MAHOUT-173:
-----------------------------------------
Hi Sean,
This can be subsumed by Mahout-228. As the only additional step is running K-means once you get the vectors. But I am still working on this on my free time to learn more on Mahout.
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Matias Bjørling
> Priority: Trivial
> Fix For: 0.3
>
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Vaijanath N. Rao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796545#action_12796545 ]
Vaijanath N. Rao commented on MAHOUT-173:
-----------------------------------------
Hi Ted,
Either one can use the Mahout-228 patch or just the lucene text vectorization already in there. After that it looks more likley to run K-means.
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Matias Bjørling
> Priority: Trivial
> Fix For: 0.3
>
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Vaijanath N. Rao (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794307#action_12794307 ]
Vaijanath N. Rao commented on MAHOUT-173:
-----------------------------------------
Hi,
I have few hours to spend, I can take a look at it and try to accomplish it. Is it okay before we shelve it.
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Matias Bjørling
> Priority: Trivial
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-173) Implement clustering of massive-domain
attributes
Posted by "Matias Bjørling (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matias Bjørling updated MAHOUT-173:
-----------------------------------
Remaining Estimate: 30h (was: 2016h)
Original Estimate: 30h (was: 2016h)
Changing estimate. It will be done in three months, but estimate is only 30 hours.
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Matias Bjørling
> Priority: Trivial
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794320#action_12794320 ]
Ted Dunning commented on MAHOUT-173:
------------------------------------
Go right ahead and implement it or at least scope out how much work is required.
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Matias Bjørling
> Priority: Trivial
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-173) Implement clustering of
massive-domain attributes
Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved MAHOUT-173.
------------------------------
Resolution: Won't Fix
> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
> Key: MAHOUT-173
> URL: https://issues.apache.org/jira/browse/MAHOUT-173
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.2
> Reporter: Matias Bjørling
> Priority: Trivial
> Fix For: 0.3
>
> Original Estimate: 30h
> Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps:
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.