You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2011/05/21 05:10:47 UTC

[jira] [Updated] (MAHOUT-588) Benchmark Mahout's clustering performance on EC2 and publish the results

     [ https://issues.apache.org/jira/browse/MAHOUT-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-588:
-----------------------------

    Affects Version/s: 0.5
        Fix Version/s: 0.6
             Assignee: Grant Ingersoll

Looks like a lot of great work. I'm not clear on whether there are additional steps here -- is it "done"? In any event looks like something we should consider to be finished before 0.6 ships.

> Benchmark Mahout's clustering performance on EC2 and publish the results
> ------------------------------------------------------------------------
>
>                 Key: MAHOUT-588
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-588
>             Project: Mahout
>          Issue Type: Task
>    Affects Versions: 0.5
>            Reporter: Grant Ingersoll
>            Assignee: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: 60_clusters_kmeans_10_iterations_100K_coordinates.txt, MAHOUT-588.patch, MailArchivesClusteringAnalyzer.java, MailArchivesClusteringAnalyzerTest.java, SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives.java, SequenceFilesFromMailArchives2.java, SequenceFilesFromMailArchivesTest.java, TamingAnalyzer.java, TamingAnalyzer.java, TamingAnalyzerTest.java, TamingCollocDriver.java, TamingCollocMapper.java, TamingDictVect.java, TamingDictionaryVectorizer.java, TamingGramKeyGroupComparator.java, TamingSubset.java, TamingSubsetMapper.java, TamingTFIDF.java, TamingTokenizer.java, Top1000Tokens_maybe_stopWords, Uncompress.java, clusters1.txt, clusters_kMeans.txt, distcp_large_to_s3_failed.log, ec2_setup_notes.txt, ec2_setup_notes_v2.txt, ec2_setup_notes_v2.txt, mahout-588_canopy.pdf, mahout-588_distribution.pdf, prep_asf_mail_archives.sh, prep_asf_mail_archives.sh, prep_asf_mail_archives.sh, seq2sparse_small_failed.log, seq2sparse_xlarge_ok.log
>
>
> For Taming Text, I've commissioned some benchmarking work on Mahout's clustering algorithms.  I've asked the two doing the project to do all the work in the open here.  The goal is to use a publicly reusable dataset (for now, the ASF mail archives, assuming it is big enough) and run on EC2 and make all resources available so others can reproduce/improve.
> I'd like to add the setup code to utils (although it could possibly be done as a Vectorizer) and the publication of the results will be put up on the Wiki as well as in the book.  This issue is to track the patches, etc.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira