You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2008/05/27 02:07:57 UTC

[jira] Created: (MAHOUT-59) Create some examples of clustering well-known datasets

Create some examples of clustering well-known datasets
------------------------------------------------------

                 Key: MAHOUT-59
                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
             Project: Mahout
          Issue Type: New Feature
          Components: Clustering
            Reporter: Jeff Eastman


The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-59.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.2

As far as I can tell this was done?

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>             Fix For: 0.2
>
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-59:
-------------------------------

    Attachment: MAHOUT-59.patch

This patch adds canopy, kmeans and meanshift clustering examples for the http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series dataset, which should be copied into the directory testdata before running the example jobs. I'm still not happy with the arguments for meanshift, and I'm going to work on them to get a nicer result before committing. The canopy and kmeans outputs produce the correct number of clusters (6) but I have not verified that the data are correctly clustered.

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682843#action_12682843 ] 

Grant Ingersoll commented on MAHOUT-59:
---------------------------------------

Jeff, this was committed, right?

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Edward J. Yoon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12640103#action_12640103 ] 

Edward J. Yoon commented on MAHOUT-59:
--------------------------------------

Great, I'm +1 for this patch.

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Richard Tomsett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674983#action_12674983 ] 

richardtomsett edited comment on MAHOUT-59 at 2/19/09 4:55 AM:
----------------------------------------------------------------

Re: discussion of text clustering on the mailing list, there are several 'bag of words' examples at the UCI repository: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words . The data is in [docID wordID wordcount] format so needs to be processed into TF-IDF Vectors for clustering. I previously did this with a Python script but I'll write something in Hadoop to do it, before passing it on to Canopy or K-Means clustering. May take a little while as I haven't looked at my code for about half a year, and I didn't write unit tests or anything last time...

This would also involve writing a cosine distance measure class, which I guess would be useful generally. Could also involve ideas from https://issues.apache.org/jira/browse/MAHOUT-65 re: labelling data points (documents). Would this be a useful example?

      was (Author: richardtomsett):
    Re: discussion of text clustering on the mailing list, there are several 'bag of words' examples at the UCI repository: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words . The data is in [docID wordID wordcount] format so needs to be processed into TF-IDF Vectors for clustering. I previously did this with a Python script but I'll write something in Hadoop to do it, before passing it on to Canopy or K-Means clustering. May take a little while as I haven't looked at my code for about half a year, and I didn't write unit tests or anything last time...

This would also involve writing a cosine distance measure class, which I guess would be useful generally. Would this be a useful example?
  
> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Richard Tomsett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683254#action_12683254 ] 

Richard Tomsett commented on MAHOUT-59:
---------------------------------------

Ugh, I had an example almost done but managed to over-write it by having folders with too-similar names. That'll teach me :-\ anyway, looking at the K-Means issue [MAHOUT-99] at the moment but will hopefully post a bag of words example relatively soon...!

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Richard Tomsett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674983#action_12674983 ] 

Richard Tomsett commented on MAHOUT-59:
---------------------------------------

Re: discussion of text clustering on the mailing list, there are several 'bag of words' examples at the UCI repository: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words . The data is in [docID wordID wordcount] format so needs to be processed into TF-IDF Vectors for clustering. I previously did this with a Python script but I'll write something in Hadoop to do it, before passing it on to Canopy or K-Means clustering. May take a little while as I haven't looked at my code for about half a year, and I didn't write unit tests or anything last time...

This would also involve writing a cosine distance measure class, which I guess would be useful generally. Would this be a useful example?

> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-59) Create some examples of clustering well-known datasets

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675172#action_12675172 ] 

Ted Dunning commented on MAHOUT-59:
-----------------------------------

I think that is a great idea.

Jeff's Dirichlet clustering code might be interesting as well.



> Create some examples of clustering well-known datasets
> ------------------------------------------------------
>
>                 Key: MAHOUT-59
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-59
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Jeff Eastman
>         Attachments: MAHOUT-59.patch
>
>
> The existing unit tests for clustering need to be augmented with examples from the literature which illustrate its correct operation on datasets which have known clusters present. See http://archive.ics.uci.edu/ml/ for some candidate datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.