You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/10/14 18:05:06 UTC

ASF Public Mail Archives on Amazon S3

Hi ORPers,

I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this).  I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know).  

There are 5 files:
https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar
https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar
https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar
https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar
https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar

The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file).  The tarballs contain GZIP files by date, I believe.  I believe the total uncompressed file size is somewhere in the 80-100GB range.  That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things.

As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered.

From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place.

Cheers,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


Re: ASF Public Mail Archives on Amazon S3

Posted by Grant Ingersoll <gs...@apache.org>.
Hmmm, let me look.  I don't know if I will be able to recover it


On Nov 17, 2010, at 1:48 PM, Michael McCandless wrote:

> Grant, public_p_r.tar seems to be missing?  Is that intentional?
> Maybe some super-secret project inside there :)
> 
> Mike
> 
> On Thu, Oct 14, 2010 at 12:05 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Hi ORPers,
>> 
>> I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this).  I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know).
>> 
>> There are 5 files:
>> https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar
>> 
>> The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file).  The tarballs contain GZIP files by date, I believe.  I believe the total uncompressed file size is somewhere in the 80-100GB range.  That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things.
>> 
>> As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered.
>> 
>> From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place.
>> 
>> Cheers,
>> Grant
>> 


Re: ASF Public Mail Archives on Amazon S3

Posted by Michael McCandless <lu...@mikemccandless.com>.
Grant, public_p_r.tar seems to be missing?  Is that intentional?
Maybe some super-secret project inside there :)

Mike

On Thu, Oct 14, 2010 at 12:05 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Hi ORPers,
>
> I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this).  I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know).
>
> There are 5 files:
> https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar
> https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar
> https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar
> https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar
> https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar
>
> The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file).  The tarballs contain GZIP files by date, I believe.  I believe the total uncompressed file size is somewhere in the 80-100GB range.  That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things.
>
> As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered.
>
> From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place.
>
> Cheers,
> Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>