You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2010/11/17 19:48:49 UTC

Re: ASF Public Mail Archives on Amazon S3

Grant, public_p_r.tar seems to be missing?  Is that intentional?
Maybe some super-secret project inside there :)

Mike

On Thu, Oct 14, 2010 at 12:05 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Hi ORPers,
>
> I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this).  I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know).
>
> There are 5 files:
> https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar
> https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar
> https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar
> https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar
> https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar
>
> The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file).  The tarballs contain GZIP files by date, I believe.  I believe the total uncompressed file size is somewhere in the 80-100GB range.  That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things.
>
> As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered.
>
> From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place.
>
> Cheers,
> Grant
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

Re: ASF Public Mail Archives on Amazon S3

Posted by Grant Ingersoll <gs...@apache.org>.
Hmmm, let me look.  I don't know if I will be able to recover it


On Nov 17, 2010, at 1:48 PM, Michael McCandless wrote:

> Grant, public_p_r.tar seems to be missing?  Is that intentional?
> Maybe some super-secret project inside there :)
> 
> Mike
> 
> On Thu, Oct 14, 2010 at 12:05 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Hi ORPers,
>> 
>> I put up the complete ASF public mail archives as of about 3 weeks ago on Amazon's S3 and have made them public (let me know if I messed up, it is the first time I've done this).  I also intend, in the coming weeks, to convert them into Mahout files (if anyone wants to help let me know).
>> 
>> There are 5 files:
>> https://s3.amazonaws.com/asf-mail-archives/public_a_d.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_e_k.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_l_o.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_s_t.tar
>> https://s3.amazonaws.com/asf-mail-archives/public_u_z.tar
>> 
>> The tarballs are organized by Top Level Project name (i.e. Mahout is in the public_l_o.tar file).  The tarballs contain GZIP files by date, I believe.  I believe the total uncompressed file size is somewhere in the 80-100GB range.  That should be sufficient to drive some semi-interesting things in terms of scale, even if it is towards the smaller end of things.
>> 
>> As the ASF has very clear public mailing list archive policies, it is my belief that this data set is completely unencumbered.
>> 
>> From an ORP standpoint, this might make for a first data set for evaluation once we have the evaluator framework in place.
>> 
>> Cheers,
>> Grant
>>