You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/10/14 18:03:41 UTC

ASF Public Mail Archives are now on S3

Hi Mahouters,

I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this).  I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know).  

There are 5 files:
https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar
https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar
https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar
https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar
https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar

The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file).  The tarballs contain GZIP files by date, I believe.  I believe the total uncompressed file size is somewhere in the 80-100GB range.  That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things.

As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered.

From a Mahout standpoint, I'd love to see us make it dead simple to run these on Amazon's EMR and elsewhere as part of our examples with minimal setup work.  The data set could easily drive clustering and classification examples and probably could be extended for other areas too.  For instance, I'd love to see a classifier that labeled emails into experience levels (is this email beginner level or expert level).  One other natural classifier is simply to guess which project the email belongs to, similar to the 20 Newsgroups example.  Likewise, with clustering, it would be interesting to think about affinity across projects, i.e. how many messages to Mahout cluster near Hadoop or Lucene?  

Cheers,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


Re: ASF Public Mail Archives are now on S3

Posted by Isabel Drost <is...@apache.org>.
On Thu, 14 Oct 2010 Grant Ingersoll <gs...@apache.org> wrote:
> I put up the complete ASF public mail archives as of about 3 weeks
> ago on Amazon's S3 and have made them public

Yeah! Thanks for your effort.


Isabel


Re: ASF Public Mail Archives are now on S3

Posted by Robin Anil <ro...@gmail.com>.
Awesome!.