You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Matias Bjørling (JIRA)" <ji...@apache.org> on 2009/09/06 10:44:57 UTC

[jira] Created: (MAHOUT-173) Implement clustering of massive-domain attributes

Implement clustering of massive-domain attributes
-------------------------------------------------

                 Key: MAHOUT-173
                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
            Reporter: Matias Bjørling
            Priority: Trivial


Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.

Steps: 

1. Implement baseline solution to compare solutions.
2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
3. Implement Count-Min sketch algorithm for each cluster.
4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by Ted Dunning <te...@gmail.com>.

I never saw much progress on this.

On Wed, Dec 23, 2009 at 11:58 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794197#action_12794197]
>
> Sean Owen commented on MAHOUT-173:
> ----------------------------------
>
> Pinging this issue -- is there any progress  in the past 3.5 months or
> should we shelve it?
>
> > Implement clustering of massive-domain attributes
> > -------------------------------------------------
> >
> >                 Key: MAHOUT-173
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
> >             Project: Mahout
> >          Issue Type: New Feature
> >          Components: Clustering
> >            Reporter: Matias Bjørling
> >            Priority: Trivial
> >   Original Estimate: 30h
> >  Remaining Estimate: 30h
> >
> > Implement the Clustering algorithm described in "A Framework for
> Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> > Steps:
> > 1. Implement baseline solution to compare solutions.
> > 2. Figure out how to implement the loading of clustering by looking at
> the k-means implementation.
> > 3. Implement Count-Min sketch algorithm for each cluster.
> > 4. Find out how to give the user the power to choose the distance
> function for the input data ( Maybe already possible? )
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

[jira] Updated: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-173:
-----------------------------

        Fix Version/s: 0.3
    Affects Version/s: 0.2

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Matias Bjørling
>            Priority: Trivial
>             Fix For: 0.3
>
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794197#action_12794197 ] 

Sean Owen commented on MAHOUT-173:
----------------------------------

Pinging this issue -- is there any progress  in the past 3.5 months or should we shelve it?

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Matias Bjørling
>            Priority: Trivial
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796538#action_12796538 ] 

Ted Dunning commented on MAHOUT-173:
------------------------------------


It seems that this algorithm is a combination of a hashed kernel and k-means clustering.

Would it be easiest to implement this using the vectorization algorithms described as part of MAHOUT-228 and then just using the normal k-means algorithm?



> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Matias Bjørling
>            Priority: Trivial
>             Fix For: 0.3
>
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799154#action_12799154 ] 

Sean Owen commented on MAHOUT-173:
----------------------------------

Just clarifying the status -- Vaijanath are you working on this or are you saying it's kind of subsumed by MAHOUT-228?

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Matias Bjørling
>            Priority: Trivial
>             Fix For: 0.3
>
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Vaijanath N. Rao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799163#action_12799163 ] 

Vaijanath N. Rao commented on MAHOUT-173:
-----------------------------------------

Hi Sean,

This can be subsumed by Mahout-228. As the only additional step is running K-means once you get the vectors.  But I am still working on this on my free time to learn more on Mahout.

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Matias Bjørling
>            Priority: Trivial
>             Fix For: 0.3
>
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Vaijanath N. Rao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796545#action_12796545 ] 

Vaijanath N. Rao commented on MAHOUT-173:
-----------------------------------------

Hi Ted,

Either one can use the Mahout-228 patch or just the lucene text vectorization already in there.  After that it looks more likley to run K-means.

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Matias Bjørling
>            Priority: Trivial
>             Fix For: 0.3
>
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Vaijanath N. Rao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794307#action_12794307 ] 

Vaijanath N. Rao commented on MAHOUT-173:
-----------------------------------------

Hi,

I have few hours to spend, I can take a look at it and try to accomplish it. Is it okay before we shelve it.

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Matias Bjørling
>            Priority: Trivial
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Matias Bjørling (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matias Bjørling updated MAHOUT-173:
-----------------------------------

    Remaining Estimate: 30h  (was: 2016h)
     Original Estimate: 30h  (was: 2016h)

Changing estimate. It will be done in three months, but estimate is only 30 hours.

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Matias Bjørling
>            Priority: Trivial
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794320#action_12794320 ] 

Ted Dunning commented on MAHOUT-173:
------------------------------------


Go right ahead and implement it or at least scope out how much work is required.

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Matias Bjørling
>            Priority: Trivial
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-173) Implement clustering of massive-domain attributes

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-173.
------------------------------

    Resolution: Won't Fix

> Implement clustering of massive-domain attributes
> -------------------------------------------------
>
>                 Key: MAHOUT-173
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-173
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: Matias Bjørling
>            Priority: Trivial
>             Fix For: 0.3
>
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> Implement the Clustering algorithm described in "A Framework for Clustering Massive-Domain Data Streams" by Chary C. Aggarwal.
> Steps: 
> 1. Implement baseline solution to compare solutions.
> 2. Figure out how to implement the loading of clustering by looking at the k-means implementation.
> 3. Implement Count-Min sketch algorithm for each cluster.
> 4. Find out how to give the user the power to choose the distance function for the input data ( Maybe already possible? )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.