You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2014/10/02 23:23:43 UTC

svn commit: r924454 - in /websites/staging/mahout/trunk/content: ./ users/recommender/intro-cooccurrence-spark.html

Author: buildbot
Date: Thu Oct  2 21:23:43 2014
New Revision: 924454

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Oct  2 21:23:43 2014
@@ -1 +1 @@
-1629066
+1629072

Modified: websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html (original)
+++ websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html Thu Oct  2 21:23:43 2014
@@ -552,26 +552,35 @@ runtime for a user.</p>
 <p>The query for recommendations will be a mix of values meant to match one of your indicators. The query can be constructed 
 from user history and values derived from context (category being viewed for instance) or special precalculated data 
 (popularity rank for instance). This blending of indicators allows for creating many flavors or recommendations to fit 
-a very wide variety of circumstances. It allows recommendations to be made for items with no usage data and even allows 
-for gracefully degrading recommendations based on how much user history is available. </p>
+a very wide variety of circumstances.</p>
 <p>With the right mix of indicators developers can construct a single query that works for completely new items and new users 
-while working well for items with lots of interactions and users with many recorded actions. In other words adding in content and intrinsic 
-indicators allows developers to create a solution for the "cold-start" problem that gracefully improves with more user history
+while working well for items with lots of interactions and users with many recorded actions. In other words by adding in content and intrinsic 
+indicators developers can create a solution for the "cold-start" problem that gracefully improves with more user history
 and as items have more interactions. It is also possible to create a completely content-based recommender that personalizes 
 recommendations.</p>
 <h2 id="example-with-3-indicators">Example with 3 Indicators</h2>
-<p>You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and this is most easily done by using text files as described above. The data that is processed by these jobs is considered the <strong>training data</strong>. You will need some amount of user history in your recs query. It is typical to use the most recent user history but need not be exactly what is in the training set, which may include more historical data. Keeping the user history for query purposes could be done with a database by referencing some history from a users table. In the example above the two collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other descriptive metadata). </p>
-<p>We will need to create 1 indicator from the primary action (purchase) 1 cross-indicator from the secondary action (view) and 1 content-indicator for (tags). We'll have to run <em>spark-itemsimilarity</em> once and <em>spark-rowsimilarity</em> once.</p>
-<p>We have described how to create the indicator and cross-indicator for purchase and view (the <a href="#multiple-actions">How to use Multiple User 
+<p>You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and 
+this is most easily done by using text files as described above. The data that is processed by these jobs is considered the 
+training data. You will need some amount of user history in your recs query. It is typical to use the most recent user history 
+but need not be exactly what is in the training set, which may include a greater volume of historical data. Keeping the user 
+history for query purposes could be done with a database by storing it in a users table. In the example above the two 
+collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other 
+descriptive metadata). </p>
+<p>We will need to create 1 cooccurrence indicator from the primary action (purchase) 1 cross-action cooccurrence indicator 
+from the secondary action (view) 
+and 1 content indicator (tags). We'll have to run <em>spark-itemsimilarity</em> once and <em>spark-rowsimilarity</em> once.</p>
+<p>We have described how to create the collaborative filtering indicator and cross-indicator for purchase and view (the <a href="#multiple-actions">How to use Multiple User 
 Actions</a> section) but tags will be a slightly different process. We want to use the fact that 
 certain items have tags similar to the ones associated with a user's purchases. This is not a collaborative filtering indicator 
-but rather a "content" or "metadata" type indicator since you are not using other users' tag viewing history, only the 
+but rather a "content" or "metadata" type indicator since you are not using other users' history, only the 
 individual that you are making recs for. This means that this method will make recommendations for items that have 
 no collaborative filtering data, as happens with new items in a catalog. New items may have tags assigned but no one
- has purchased or viewed them yet. </p>
-<p>We could have treated viewing tags as a collaborative filtering cross-indicator by recording other users tag viewing history and that would probably give better results but here we are trying to illustrate recommending without CF data and using content-indicators. In the final query we will mix all 3 indicators.</p>
+ has purchased or viewed them yet. In the final query we will mix all 3 indicators.</p>
 <h2 id="content-indicator">Content Indicator</h2>
-<p>To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a content or metadata indicator. They are used when you want to find items that are similar to other items by using their content or metadata, not by which users interacted with them.</p>
+<p>To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find 
+items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a 
+content or metadata indicator. They are used when you want to find items that are similar to other items by using their 
+content or metadata, not by which users interacted with them.</p>
 <p>For this we need input of the form:</p>
 <div class="codehilite"><pre><span class="n">itemID</span><span class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span class="n">list</span><span class="o">-</span><span class="n">of</span><span class="o">-</span><span class="n">tags</span>
 <span class="p">...</span>
@@ -585,7 +594,10 @@ no collaborative filtering data, as happ
 </pre></div>
 
 
-<p>We'll use <em>spark-rowimilairity</em> because we are looking for similar rows, which encode items in this case. As with the indicator and cross-indicator we use the --omitStrength option. The strengths created are probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling are finished we no longer need the strengths. We will get an indicator matrix of the form:</p>
+<p>We'll use <em>spark-rowimilairity</em> because we are looking for similar rows, which encode items in this case. As with the 
+collaborative filtering indicator and cross-indicator we use the --omitStrength option. The strengths created are 
+probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling 
+is finished we no longer need the strengths. We will get an indicator matrix of the form:</p>
 <div class="codehilite"><pre><span class="n">itemID</span><span class="o">&lt;</span><span class="n">tab</span><span class="o">&gt;</span><span class="n">list</span><span class="o">-</span><span class="n">of</span><span class="o">-</span><span class="n">item</span> <span class="n">IDs</span>
 <span class="p">...</span>
 </pre></div>
@@ -598,13 +610,12 @@ no collaborative filtering data, as happ
 </pre></div>
 
 
-<p>We now have three indicators, two collaborative filtering type and one content type. Notice that purchase, view, and tags can all be recorded for users and so can be used in a recommendations query.</p>
+<p>We now have three indicators, two collaborative filtering type and one content type.</p>
 <h2 id="unified-recommender-query">Unified Recommender Query</h2>
 <p>The actual form of the query for recommendations will vary depending on your search engine but the intent is the same. 
-For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. 
-This will allow matches from any indicator where AND queries require that an item have some similarity to all indicator 
-fields.</p>
-<p>We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". We take the user's history that corresponds to each indicator and create a query of the form:</p>
+For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. </p>
+<p>We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". 
+We take the user's history that corresponds to each indicator and create a query of the form:</p>
 <div class="codehilite"><pre><span class="n">Query</span><span class="o">:</span>
   <span class="n">field</span><span class="o">:</span> <span class="n">purchase</span><span class="o">;</span> <span class="n">q</span><span class="o">:</span><span class="n">user</span><span class="s1">&#39;s-purchase-history</span>
 <span class="s1">  field: view; q:user&#39;</span><span class="n">s</span> <span class="n">view</span><span class="o">-</span><span class="n">history</span>
@@ -612,7 +623,8 @@ fields.</p>
 </pre></div>
 
 
-<p>The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to the ones the user has already purchased. </p>
+<p>The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to 
+the ones the user has already purchased. </p>
 <p>This is only an example and not necessarily the optimal way to create recs. It illustrates how business decisions can be 
 translated into recommendations. This technique can be used to skew recommendations towards intrinsic indicators also. 
 For instance you may want to put personalized popular item recs in a special place in the UI. Create a popularity indicator