You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/07/24 18:53:09 UTC

[Lucene-java Wiki] Update of "OpenRelevance" by PeteSkomoroch

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by PeteSkomoroch:
http://wiki.apache.org/lucene-java/OpenRelevance

The comment on the change is:
Suggesting Amazon S3 or EBS for easy data distribution channel

------------------------------------------------------------------------------
  
  Editing of relevance judgments can be performed through a web application, so the infrastructure needs to provide a servlet container. Search functionality will be also provided by a web application.
  
- Distribution of the corpus is the most demanding aspect of this project. Due to its size (~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create subsets? distribute on HDD ?)''
+ Distribution of the corpus is the most demanding aspect of this project. Due to its size (~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create subsets? distribute on HDD ?)''.  Amazon S3 and EBS (via [http://aws.amazon.com/publicdatasets/ Amazon Public Datasets]) are efficient & cheap options for distributing larger datasets.  Uploading to a public S3 bucket is the easiest option, and automatically [http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html?S3Torrent.html makes uploaded data available via torrent]. Datasets up to 1 TB [http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset can also be distributed] via free public EBS volumes.
  
  == Queries ==