You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2012/03/25 23:58:00 UTC

[CONF] Apache Mahout > Collections

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Collections (https://cwiki.apache.org/confluence/display/MAHOUT/Collections)


Edited by Lance Norskog:
---------------------------------------------------------------------
TODO: Organize these somehow, add one-line blurbs
Organize by usage? (classification, recommendation etc.)

h2. Collections of Collections

- [ML Data|http://mldata.org/about/] ... repository supported by Pascal 2.
- [DBPedia|http://wiki.dbpedia.org/Downloads30]
- [UCI Machine Learning Repo|http://archive.ics.uci.edu/ml/]
- [http://mloss.org/community/blog/2008/sep/19/data-sources/]
- [Linked Library Data|http://ckan.net/group/lld] via CKAN
- [InfoChimps|http://infochimps.com/] Free and purchasable datasets
- [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle] LinkedIn discussion of lots of data sets

h2. Categorization Data

- [20Newsgroups|http://people.csail.mit.edu/jrennie/20Newsgroups/]
- [RCV1 data set|http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm]
- [10 years of CLEF Data|http://direct.dei.unipd.it/]
- http://ece.ut.ac.ir/DBRG/Hamshahri/ (Approximately 160k categorized docs)
There is a newer beta verson here:
http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/ (Approximately 320k categorized docs)

h2. Recommendation Data

- [Netflix Prize/Dataset|http://www.netflixprize.com/download]
- [Book usage and recommendation data from the University of Huddersfield|http://library.hud.ac.uk/data/usagedata/]
- [Last.fm|http://denoiserthebetter.posterous.com/music-recommendation-datasets] - Non-commercial use only
- [Amazon Product Review Data via Jindal and Liu| http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html] -- Scroll down
- [GroupLens/MovieLens Movie Review Dataset|http://www.grouplens.org/node/73]

h2. Multilingual Data

- [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php] - 308,000 subtitle files covering about 18,900 movies in 59 languages (July 2006 numbers) The original site, OpenSubtitles.org, is up to 1.6m subtitles files.
- [Statistical Machine Translation|http://www.statmt.org/] - devoted to all things language translation. Includes multilingual corpuses of European and Canadian legal tomes.

h2. Geospatial

- [Natural Earth Data|http://www.naturalearthdata.com/]
- [Open Street Maps|http://wiki.openstreetmap.org/wiki/Main_Page]
And other crowd-sourced mapping data sites.

h2. Airline
- [Open Flights|http://openflights.org/] - Crowd-sourced database of airlines, flights, airports, times, etc.
- [Airline on-time information - 1987-2008|http://stat-computing.org/dataexpo/2009/] - 120m CSV records, 12G uncompressed

h2. General Resources

- [theinfo|http://theinfo.org/]
- [WordNet|http://wordnet.princeton.edu/obtain]
- [Common Crawl|http://www.commoncrawl.org/] - freely available web crawl on EC2

h2. Stuff
- [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html]
- [4 Universities Data Set|http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/]
- [Large crawl of Twitter|http://an.kaist.ac.kr/traces/WWW2010.html]
- [UniProt|http://beta.uniprot.org/]
- [http://www.icwsm.org/2009/data/]
- http://data.gov
- http://www.ckan.net/
- http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world
- http://data.gov.uk/


Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action