You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by sr...@apache.org on 2008/05/17 21:50:48 UTC

svn commit: r657446 - in /lucene/mahout/site/src/documentation/content/xdocs: images/taste-architecture.png taste.xml

Author: srowen
Date: Sat May 17 12:50:47 2008
New Revision: 657446

URL: http://svn.apache.org/viewvc?rev=657446&view=rev
Log:
Initial checkin of Taste docs

Added:
    lucene/mahout/site/src/documentation/content/xdocs/images/taste-architecture.png   (with props)
    lucene/mahout/site/src/documentation/content/xdocs/taste.xml

Added: lucene/mahout/site/src/documentation/content/xdocs/images/taste-architecture.png
URL: http://svn.apache.org/viewvc/lucene/mahout/site/src/documentation/content/xdocs/images/taste-architecture.png?rev=657446&view=auto
==============================================================================
Binary file - no diff available.

Propchange: lucene/mahout/site/src/documentation/content/xdocs/images/taste-architecture.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: lucene/mahout/site/src/documentation/content/xdocs/taste.xml
URL: http://svn.apache.org/viewvc/lucene/mahout/site/src/documentation/content/xdocs/taste.xml?rev=657446&view=auto
==============================================================================
--- lucene/mahout/site/src/documentation/content/xdocs/taste.xml (added)
+++ lucene/mahout/site/src/documentation/content/xdocs/taste.xml Sat May 17 12:50:47 2008
@@ -0,0 +1,406 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<document>
+<header><title>Apache Mahout - Taste Documentation</title></header>
+<properties>
+<author email="srowen@apache.org">Sean Owen</author>
+</properties>
+<body>
+
+<section id="overview"><title>Overview</title>
+
+<p>Taste is a flexible, fast collaborative filtering engine for Java. The engine takes users'
+preferences for items ("tastes") and returns estimated preferences for other items. For example, a
+site that sells books or CDs could easily use Taste to figure out, from past purchase data, which
+CDs a customer might be interested in listening to.</p>
+
+<p>Taste provides a rich set of components from which you can construct a customized recommender
+system from a selection of algorithms. Taste is designed to be enterprise-ready; it's designed for
+performance, scalability and flexibility.
+Taste is not just for Java; it can be run as an external server which exposes recommendation logic
+to your application via web services and HTTP.</p>
+
+<p>Top-level packages define the Taste interfaces to these key abstractions:</p>
+
+<ul>
+  <li><code>DataModel</code></li>
+  <li><code>UserCorrelation</code> and <code>ItemCorrelation</code></li>
+  <li><code>UserNeighborhood</code></li>
+  <li><code>Recommender</code></li>
+</ul>
+
+<p>Subpackages of <code>org.apache.mahout.cf.taste.impl</code> hold implementations of these interfaces.
+These are the pieces from which you will build your own recommendation engine. That's it!
+For the academically inclined, Taste supports both <em>memory-based</em> and <em>item-based</em>
+recommender systems, <em>slope one</em> recommenders, and a couple other experimental implementations.
+It does not currently support <em>model-based</em> recommenders.</p>
+
+</section>
+
+<section id="architecture"><title>Architecture</title>
+
+<p class="centertext"><img src="images/taste-architecture.png" alt="Taste Architecture" height="1060" width="442"/></p>
+
+<p>This diagram shows the relationship between various Taste components in a user-based recommender.
+An item-based recommender system is similar except that there are no PreferenceInferrers or Neighborhood
+algorithms involved.</p>
+
+<section><title>Recommender</title>
+
+<p>A <code>Recommender</code> is the core abstraction in Taste. Given a <code>DataModel</code>, it can produce
+recommendations. Applications will most likely use the <code>GenericUserBasedRecommender</code> implementation
+or <code>GenericItemBasedRecommender</code>, possibly decorated by
+
+<code>CachingRecommender</code>.</p>
+
+</section>
+
+<section><title>DataModel</title>
+
+<p>A <code>DataModel</code> is the interface to information about user preferences. An implementation might
+draw this data from any source, but a database is the most likely source. Taste provides <code>MySQLJDBCDataModel</code>
+to access preference data from a database via JDBC, though many applications will want to write their own.
+Taste also provides a <code>FileDataModel</code>.</p>
+
+<p>Along with <code>DataModel</code>, Taste uses the <code>User</code>, <code>Item</code> and
+<code>Preference</code> abstractions to represent the users, items, and preferences for those items in the
+recommendation engine. Custom <code>DataModel</code> implementations would return implementations of these
+interfaces that are appropriate to the application - maybe an <code>OnlineUser</code> implementation
+that represents an online store user, and a <code>BookItem</code> implementation representing a book.</p>
+
+</section>
+
+<section><title>UserCorrelation, ItemCorrelation</title>
+
+<p>A <code>UserCorrelation</code> defines a notion of similarity between two <code>User</code>s.
+This is a crucial part of a recommendation engine. These are attached to a <code>Neighborhood</code> implementation.
+<code>ItemCorrelation</code>s are analagous, but find similarity between <code>Item</code>s.</p>
+
+</section>
+
+<section><title>UserNeighborhood</title>
+
+<p>In a user-based recommender, recommendations are produced by finding a "neighborhood" of
+similar users near a given user. A <code>UserNeighborhood</code> defines a means of determining
+that neighborhood &#8212; for example, nearest 10 users. Implementations typically need a
+<code>UserCorrelation</code> to operate.</p>
+
+</section>
+
+</section>
+
+<section id="requirements"><title>Requirements</title>
+
+<section><title>Required</title>
+
+<ul>
+ <li><a href="http://java.sun.com/j2se/1.5.0/index.jsp">Java / J2SE 5.0</a></li>
+</ul>
+
+</section>
+
+<section><title>Optional</title>
+
+<ul>
+ <li><a href="http://ant.apache.org/">Apache Ant</a> 1.5 or later,
+  if you want to build from source or build examples.</li>
+ <li>Taste web applications require a <a href="http://java.sun.com/products/servlet/index.jsp">Servlet 2.3+</a>
+  container, such as
+  <a href="http://jakarta.apache.org/tomcat/">Jakarta Tomcat</a>. It may in fact work with older
+  containers with slight modification.</li>
+ <li><code>MySQLJDBCDataModel</code> implementation requires a
+  <a href="http://www.mysql.com/products/mysql/">MySQL 4.x</a> (or later) database.
+  Again, it may be made to work with earlier versions or other databases with slight changes.</li>
+
+</ul>
+
+</section>
+
+</section>
+
+<section id="demo"><title>Demo</title>
+
+<p>To build and run the demo, follow the instructions below, which are written for Unix-like operating systems:</p>
+
+<ol>
+  <li>Download the "1 Million MovieLens Dataset" from
+   <a href="http://www.grouplens.org/">http://www.grouplens.org/</a>.</li>
+
+  <li>Unpack the archive and copy <code>movies.dat</code> and <code>ratings.dat</code> to
+   <code>src/example/org/apache/mahout/cf/taste/example/grouplens</code> under the Taste distribution
+   directory.</li>
+  <li>Build the example web application by executing <code>ant build-grouplens-example</code> in the directory
+    where you unpacked the Taste distribution. This produces <code>taste.war</code>.</li>
+
+  <li><a href="http://tomcat.apache.org/download-55.cgi">Download</a> and install Tomcat.</li>
+  <li>Copy <code>taste.war</code> to the <code>webapps</code> directory under the Tomcat installation directory.</li>
+  <li>Increase the heap space that is given to Tomcat by setting the <code>JAVA_OPTS</code>
+      environment variable to "<code>-server -da -dsa -Xms1024m -Xmx1024m</code>", to allow 1024MB of heap space and
+    enable performance optimizations. Using <code>bash</code>,
+      one can do this with the command <code>export JAVA_OPTS="..."</code></li>
+  <li>Start Tomcat. This is usually done by running <code>bin/startup.sh</code>
+      from the Tomcat installation directory. You may get an error asking you to set <code>JAVA_HOME</code>; do
+      so as above.</li>
+
+  <li>Get recommendations by accessing the web application in your browser:<br/>
+    <code>http://localhost:8080/taste/RecommenderServlet?userID=1</code><br/>
+    This will produce a simple preference-item ID list which could be consumed by a client application.
+    Get more useful human-readable output with the <code>debug</code> parameter:<br/>
+    <code>http://localhost:8080/taste/RecommenderServlet?userID=1&amp;debug=true</code></li>
+</ol>
+
+<p>Incidentally, Taste's web service interface may then be found at:<br/>
+<code>http://localhost:8080/taste/RecommenderService.jws</code><br/>
+Its WSDL file will be here...<br/>
+<code>http://localhost:8080/taste/RecommenderService.jws?wsdl</code><br/>
+... and you can even access it in your browser via a simple HTTP request:<br/>
+<code>.../RecommenderService.jws?method=recommend&amp;userID=1&amp;howMany=10</code></p>
+
+</section>
+
+<section id="examples"><title>Examples</title>
+
+<section><title>User-based Recommender</title>
+
+<p>User-based recommenders are the "original", conventional style of recommender system. They can produce good
+recommendations when tweaked properly; they are not necessarily the fastest recommender systems and
+are thus suitable for small data sets (roughly, less than a million ratings). We'll start with an example of this.</p>
+
+<p>First, create a <code>DataModel</code> of some kind. Here, we'll use a simple on based
+on data in a file:</p>
+
+<pre>DataModel model = new FileDataModel(new File("data.txt"));
+</pre>
+
+<p>We'll use the PearsonCorrelation implementation of <code>UserCorrelation</code> as our user
+correlation algorithm, and add an optional preference inference algorithm:</p>
+
+<pre>UserCorrelation userCorrelation = new PearsonCorrelation(model);
+// Optional:
+userCorrelation.setPreferenceInferrer(new AveragingPreferenceInferrer());
+</pre>
+
+<p>Now we create a <code>UserNeighborhood</code> algorithm. Here we use nearest-3:</p>
+
+<pre>UserNeighborhood neighborhood =
+  new NearestNUserNeighborhood(3, userCorrelation, model);
+</pre>
+
+<p>Now we can create our <code>Recommender</code>, and add a caching decorator:</p>
+
+<pre>Recommender recommender =
+  new GenericUserBasedRecommender(model, neighborhood, userCorrelation);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+</pre>
+
+<p>Now we can get 10 recommendations for user ID "1234" &#8212; done!</p>
+
+<pre>List&lt;RecommendedItem&gt; recommendations =
+  cachingRecommender.recommend("1234", 10);
+</pre>
+
+</section>
+
+<section><title>Item-based Recommender</title>
+
+<p>We could have created an item-based recommender instead. Item-based recommender base recommendation
+not on user similarity, but on item similarity. In theory these are about the same approach to the
+problem, just from different angles. However the similarity of two items is relatively fixed, more so
+than the similarity of two users. So, item-based recommenders can use pre-computed similarity values
+in the computations, which make them much faster. For large data sets, item-based recommenders
+are more appropriate.</p>
+
+<p>Let's start over, again with a <code>FileDataModel</code> to start:</p>
+
+<pre>DataModel model = new FileDataModel(new File("data.txt"));
+</pre>
+
+<p>We'll also need an <code>ItemCorrelation</code>. We could use <code>PearsonCorrelation</code>,
+which computes item similarity in realtime, but, this is generally too slow to be useful.
+Instead, in a real application, you would feed a list of pre-computed correlations to
+a <code>GenericItemCorrelation</code>:</p>
+
+<pre>// Construct the list of pre-compted correlations
+Collection&lt;GenericItemCorrelation.ItemItemCorrelation&gt; correlations =
+  ...;
+ItemCorrelation itemCorrelation =
+  new GenericItemCorrelation(correlations);
+
+</pre>
+
+<p>Then we can finish as before to produce recommendations:</p>
+
+<pre>Recommender recommender =
+  new GenericItemBasedRecommender(model, itemCorrelation);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+...
+List&lt;RecommendedItem&gt; recommendations =
+  cachingRecommender.recommend("1234", 10);
+</pre>
+
+</section>
+
+<section><title>Slope-One Recommender</title>
+
+<p>This is a simple yet effective <code>Recommender</code> and we present another example to
+round out the list:</p>
+
+<pre>DataModel model = new FileDataModel(new File("data.txt"));
+// Make a weighted slope one recommender
+Recommender recommender = new SlopeOneRecommender(model);
+Recommender cachingRecommender = new CachingRecommender(recommender);
+</pre>
+
+</section>
+
+</section>
+
+<section id="integration"><title>Integration with your application</title>
+
+<section><title>Direct</title>
+
+<p>You can create a <code>Recommender</code>, as shown above, wherever you like in your Java application, and use it. This
+includes simple Java applications or GUI applications, server applications, and J2EE web applications.</p>
+
+</section>
+
+<section><title>Standalone server</title>
+
+<p>Taste can also be run as an external server, which may be the only option for non-Java applications.
+A Taste Recommender can be exposed as a web application via <code>org.apach.mahout.cf.taste.web.RecommenderServlet</code>,
+and your application can then access recommendations via simple HTTP requests and response, or as a
+full-fledged SOAP web service. See above, and see
+<code>the javadoc</code> for details.</p>
+
+<p>To deploy your <code>Recommender</code> as an external server:</p>
+
+<ol>
+  <li>Create an implementation of <code>org.apache.mahout.cf.taste.recommender.Recommender</code>.</li>
+
+  <li>Compile it and create a JAR file containing your implementation.</li>
+  <li>Build a WAR file that will run your Recommender as a web application:<br/>
+  <code>ant -Dmy-recommender.jar=yourJARfile.jar -Dmy-recommender-class=com.foo.YourRecommender build-server</code></li>
+  <li>Follow from the "Install Tomcat" step above under <a href="#demo">Demo</a>.</li>
+</ol>
+
+</section>
+
+</section>
+
+<section id="performance"><title>Performance</title>
+
+<section><title>Runtime Performance</title>
+
+<p>The more data you give Taste, the better. Though Taste is designed for performance, you will undoubtedly run into
+performance issues at some point. For best results, consider using the following commad-line flags to your JVM:</p>
+
+<ul>
+  <li><code>-server</code>: Enables the server VM, which is generally appropriate for long-running,
+  computation-intensive applications.</li>
+  <li><code>-Xms1024m -Xmx1024m</code>: Make the heap as big as possible -- a gigabyte doesn't hurt when dealing
+  with millions of preferences. Taste will generally use as much memory as you give it for caching, which helps
+  performance. Set the initial and max size to the same value to avoid wasting time growing the
+  heap, and to avoid having the JVM run minor collections to avoid growing the heap, which will clear
+  cached values.</li>
+  <li><code>-da -dsa</code>: Disable all assertions.</li>
+  <li><code>-XX:+UseParallelGC</code> (multi-processor machines only): Use a GC algorithm designed to take
+  advantage of multiple processors, and designed for throughput. This is a default in J2SE 5.0.</li>
+  <li><code>-XX:-DisableExplicitGC</code>: Disable calls to <code>System.gc()</code>. These calls can only
+  hurt in the presence of modern GC algorithms; they may force Taste to remove cached data needlessly.
+  This flag isn't needed if you're sure your code and third-party code you use doesn't call this method.</li>
+</ul>
+
+<p>Also consider the following tips:</p>
+
+<ul>
+  <li>Use <code>CachingRecommender</code> on top of your custom <code>Recommender</code> implementation.</li>
+  <li>When using <code>JDBCDataModel</code>, make sure you've taken basic steps to optimize the table storing
+  preference data. Create a primary key on the user ID and item ID columns, and an index on them. Set them to
+  be non-null. And so on. Tune your database for lots of concurrent reads! When using JDBC,
+  the database is almost always the bottleneck. Plenty of memory and caching are even more important.</li>
+
+  <li>Also, pooling database connections is essential to performance. If using a J2EE container, it probably
+  provides a way to configure connection pools. If you are creating your own <code>DataSource</code> directly,
+  try wrapping it in <code>org.apache.mahout.cf.taste.impl.model.jdbc.ConnectionPoolDataSource</code></li>
+  <li>See MySQL-specific notes on performance in the javadoc for
+  <code>MySQLJDBCDataModel</code>.</li>
+</ul>
+
+</section>
+
+<section><title>Algorithm Performance: Which One Is Best?</title>
+
+<p>There is no right answer; it depends on your data, your application, environment, and performance needs.
+Taste provides the building blocks from which you can construct the best <code>Recommender</code> for your
+application. The links below provide research on this topic. You will probably need a bit of trial-and-error to find
+a setup that works best. The code sample above provides a good starting point.</p>
+
+<p>Fortunately, Taste provides a way to evaluate the accuracy of your <code>Recommender</code> on your own
+data, in <code>org.apache.mahout.cf.taste.eval</code>:</p>
+
+<pre>DataModel myModel = ...;
+RecommenderBuilder builder = new RecommenderBuilder() {
+    public Recommender buildRecommender(DataModel model) {
+      // build and return the Recommender to evaluate here
+    }
+  };
+RecommenderEvaluator evaluator =
+  new AverageAbsoluteDifferenceRecommenderEvaluator();
+double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+</pre>
+
+</section>
+
+</section>
+
+<section id="useful"><title>Useful Links</title>
+
+<p>You'll want to look at these packages too, which offer more algorithms and approaches that you
+may find useful:</p>
+
+<ul>
+  <li><a href="http://www.nongnu.org/cofi/">Cofi</a>: A Java-Based Collaborative Filtering Library</li>
+  <li><a href="http://eecs.oregonstate.edu/iis/CoFE/">CoFE</a></li>
+</ul>
+
+<p>Here's a handful of research papers that I've read and found particular useful:</p>
+
+<blockquote cite="http://research.microsoft.com/research/pubs/view.aspx?tr_id=166"><p>J.S. Breese, D. Heckerman
+ and C. Kadie, "<a href="http://research.microsoft.com/research/pubs/view.aspx?tr_id=166">Empirical Analysis of
+ Predictive Algorithms for Collaborative Filtering</a>,"
+ in Proceedings of the Fourteenth Conference on Uncertainity in Artificial Intelligence (UAI 1998),
+ 1998.</p></blockquote>
+<blockquote cite="http://www10.org/cdrom/papers/519/"><p>B. Sarwar, G. Karypis, J. Konstan and J. Riedl,
+ "<a href="http://www10.org/cdrom/papers/519/">Item-based collaborative filtering recommendation
+ algorithms</a>," in Proceedings of the Tenth International Conference on the World Wide Web (WWW 10),
+ pp. 285-295, 2001.</p></blockquote>
+<blockquote cite="http://doi.acm.org/10.1145/192844.192905"><p>P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl,
+ "<a href="http://doi.acm.org/10.1145/192844.192905">GroupLens: an open architecture for
+ collaborative filtering of netnews</a>," in Proceedings of the 1994 ACM conference on Computer Supported Cooperative
+ Work (CSCW 1994), pp. 175-186, 1994.</p></blockquote>
+<blockquote cite="http://www.grouplens.org/papers/pdf/algs.pdf"><p>J.L. Herlocker, J.A. Konstan,
+ A. Borchers and J. Riedl, "<a href="http://www.grouplens.org/papers/pdf/algs.pdf">An algorithmic framework for
+ performing collaborative filtering</a>," in Proceedings of the 22nd annual international ACM SIGIR Conference
+ on Research and Development in Information Retrieval (SIGIR 99), pp. 230-237, 1999.</p></blockquote>
+
+<blockquote cite="http://materialobjects.com/cf/MovieRecommender.pdf"><p>Clifford Lyon,
+ "<a href="http://materialobjects.com/cf/MovieRecommender.pdf">Movie Recommender</a>,"
+ CSCI E-280 final project, Harvard University, 2004.</p></blockquote>
+<blockquote cite="http://www.daniel-lemire.com/fr/abstracts/SDM2005.html"><p>Daniel Lemire, Anna Maclachlan,
+ "<a href="http://www.daniel-lemire.com/fr/abstracts/SDM2005.html">Slope One Predictors for Online Rating-Based
+ Collaborative Filtering</a>," Proceedings of SIAM Data Mining (SDM '05), 2005.</p></blockquote>
+<blockquote cite="http://www.daniel-lemire.com/fr/documents/publications/racofi_nrc.pdf"><p>
+ Michelle Anderson, Marcel Ball, Harold Boley, Stephen Greene, Nancy Howse, Daniel Lemire and Sean McGrath,
+ "<a href="http://www.daniel-lemire.com/fr/documents/publications/racofi_nrc.pdf">RACOFI: A Rule-Applying Collaborative
+ Filtering System</a>," Proceedings of COLA '03, 2003.</p></blockquote>
+
+<p>These links will take you to all the collaborative filtering reading you could ever want!</p>
+
+<ul>
+ <li><a href="http://www.paulperry.net/notes/cf.asp">Paul Perry's notes</a></li>
+ <li><a href="http://jamesthornton.com/cf/">James Thornton's collaborative filtering resources</a></li>
+ <li><a href="http://www.daniel-lemire.com/blog/">Daniel Lemire's blog</a> which frequently covers collaborative filtering topics</li>
+</ul>
+
+</section>
+</body>
+</document>
\ No newline at end of file