You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/07/24 18:53:09 UTC
[Lucene-java Wiki] Update of "OpenRelevance" by PeteSkomoroch
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.
The following page has been changed by PeteSkomoroch:
http://wiki.apache.org/lucene-java/OpenRelevance
The comment on the change is:
Suggesting Amazon S3 or EBS for easy data distribution channel
------------------------------------------------------------------------------
Editing of relevance judgments can be performed through a web application, so the infrastructure needs to provide a servlet container. Search functionality will be also provided by a web application.
- Distribution of the corpus is the most demanding aspect of this project. Due to its size (~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create subsets? distribute on HDD ?)''
+ Distribution of the corpus is the most demanding aspect of this project. Due to its size (~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create subsets? distribute on HDD ?)''. Amazon S3 and EBS (via [http://aws.amazon.com/publicdatasets/ Amazon Public Datasets]) are efficient & cheap options for distributing larger datasets. Uploading to a public S3 bucket is the easiest option, and automatically [http://docs.amazonwebservices.com/AmazonS3/2006-03-01/index.html?S3Torrent.html makes uploaded data available via torrent]. Datasets up to 1 TB [http://www.datawrangling.com/wikipedia-page-traffic-statistics-dataset can also be distributed] via free public EBS volumes.
== Queries ==