You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by pa...@apache.org on 2014/08/29 19:44:12 UTC
svn commit: r1621351 - /mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Author: pat
Date: Fri Aug 29 17:44:12 2014
New Revision: 1621351

URL: http://svn.apache.org/r1621351
Log:
fixed some code markup

Modified:
    mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext?rev=1621351&r1=1621350&r2=1621351&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext Fri Aug 29 17:44:12 2014
@@ -12,88 +12,86 @@ Mahout's mapreduce version of itemsimila
 *spark-itemsimilarity* also extends the notion of cooccurrence to cross-cooccurrence, in other words the Spark version will account for multi-modal interactions and create cross-indicator matrices allowing users to make use of much more data in creating recommendations or similar item lists.
 
 
-```
-spark-itemsimilarity Mahout 1.0-SNAPSHOT
-Usage: spark-itemsimilarity [options]
-
-Input, output options
-  -i <value> | --input <value>
-        Input path, may be a filename, directory name, or comma delimited list of 
-        HDFS supported URIs (required)
-  -i2 <value> | --input2 <value>
-        Secondary input path for cross-similarity calculation, same restrictions 
-        as "--input" (optional). Default: empty.
-  -o <value> | --output <value>
-        Path for output, any local or HDFS supported URI (required)
-
-Algorithm control options:
-  -mppu <value> | --maxPrefs <value>
-        Max number of preferences to consider per user (optional). Default: 500
-  -m <value> | --maxSimilaritiesPerItem <value>
-        Limit the number of similarities per item to this number (optional). 
-        Default: 100
-
-Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
-
-Input text file schema options:
-  -id <value> | --inDelim <value>
-        Input delimiter character (optional). Default: "[,\t]"
-  -f1 <value> | --filter1 <value>
-        String (or regex) whose presence indicates a datum for the primary item 
-        set (optional). Default: no filter, all data is used
-  -f2 <value> | --filter2 <value>
-        String (or regex) whose presence indicates a datum for the secondary item 
-        set (optional). If not present no secondary dataset is collected
-  -rc <value> | --rowIDPosition <value>
-        Column number (0 based Int) containing the row ID string (optional). 
-        Default: 0
-  -ic <value> | --itemIDPosition <value>
-        Column number (0 based Int) containing the item ID string (optional). 
-        Default: 1
-  -fc <value> | --filterPosition <value>
-        Column number (0 based Int) containing the filter string (optional). 
-        Default: -1 for no filter
-
-Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used
-
-File discovery options:
-  -r | --recursive
-        Searched the -i path recursively for files that match --filenamePattern 
-        (optional), default: false
-  -fp <value> | --filenamePattern <value>
-        Regex to match in determining input files (optional). Default: filename 
-        in the --input option or "^part-.*" if --input is a directory
-
-Output text file schema options:
-  -rd <value> | --rowKeyDelim <value>
-        Separates the rowID key from the vector values list (optional). Default: 
-\t"
-  -cd <value> | --columnIdStrengthDelim <value>
-        Separates column IDs from their values in the vector values list (optional). 
-        Default: ":"
-  -td <value> | --elementDelim <value>
-        Separates vector element values in the values list (optional). Default: " "
-  -os | --omitStrength
-        Do not write the strength to the output files (optional), Default: false.
-        This option is used to output indexable data for creating a search engine 
-        recommender.
-
-Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
-
-Spark config options:
-  -ma <value> | --master <value>
-        Spark Master URL (optional). Default: "local". Note that you can specify 
-        the number of cores to get a performance improvement, for example "local[4]"
-  -sem <value> | --sparkExecutorMem <value>
-        Max Java heap available as "executor memory" on each node (optional). 
-        Default: 4g
-
-General config options:
-  -rs <value> | --randomSeed <value>
-        
-  -h | --help
-        prints this usage text
-```
+    spark-itemsimilarity Mahout 1.0-SNAPSHOT
+    Usage: spark-itemsimilarity [options]
+    
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited list of 
+            HDFS supported URIs (required)
+      -i2 <value> | --input2 <value>
+            Secondary input path for cross-similarity calculation, same restrictions 
+            as "--input" (optional). Default: empty.
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mppu <value> | --maxPrefs <value>
+            Max number of preferences to consider per user (optional). Default: 500
+      -m <value> | --maxSimilaritiesPerItem <value>
+            Limit the number of similarities per item to this number (optional). 
+            Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
+    
+    Input text file schema options:
+      -id <value> | --inDelim <value>
+            Input delimiter character (optional). Default: "[,\t]"
+      -f1 <value> | --filter1 <value>
+            String (or regex) whose presence indicates a datum for the primary item 
+            set (optional). Default: no filter, all data is used
+      -f2 <value> | --filter2 <value>
+            String (or regex) whose presence indicates a datum for the secondary item 
+            set (optional). If not present no secondary dataset is collected
+      -rc <value> | --rowIDPosition <value>
+            Column number (0 based Int) containing the row ID string (optional). 
+            Default: 0
+      -ic <value> | --itemIDPosition <value>
+            Column number (0 based Int) containing the item ID string (optional). 
+            Default: 1
+      -fc <value> | --filterPosition <value>
+            Column number (0 based Int) containing the filter string (optional). 
+            Default: -1 for no filter
+    
+    Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match --filenamePattern 
+            (optional), default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: filename 
+            in the --input option or "^part-.*" if --input is a directory
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). Default: 
+    \t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list (optional). 
+            Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: false.
+            This option is used to output indexable data for creating a search engine 
+            recommender.
+    
+    Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can specify 
+            the number of cores to get a performance improvement, for example "local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node (optional). 
+            Default: 4g
+    
+    General config options:
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
 
 This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations.
 
@@ -103,139 +101,123 @@ See ItemSimilarityDriver.scala in Mahout
 
 If all defaults are used the input can be as simple as:
 
-```
-userID1,itemID1
-userID2,itemID2
-...
-```
+    userID1,itemID1
+    userID2,itemID2
+    ...
 
 With the command line:
 
-```
-bash$ mahout spark-itemsimilarity --input in-file --output out-dir
-```
+
+    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
+
 
 This will use the "local" Spark context and will output the standard text version of a DRM
 
-```itemID1<tab>itemID2:value2<space>itemID10:value10...
-```
+    itemID1<tab>itemID2:value2<space>itemID10:value10...
 
 ###More Complex Input
 
 For input of the form:
 
-```
-u1,purchase,iphone
-u1,purchase,ipad
-u2,purchase,nexus
-u2,purchase,galaxy
-u3,purchase,surface
-u4,purchase,iphone
-u4,purchase,galaxy
-u1,view,iphone
-u1,view,ipad
-u1,view,nexus
-u1,view,galaxy
-u2,view,iphone
-u2,view,ipad
-u2,view,nexus
-u2,view,galaxy
-u3,view,surface
-u3,view,nexus
-u4,view,iphone
-u4,view,ipad
-u4,view,galaxy
-```
+    u1,purchase,iphone
+    u1,purchase,ipad
+    u2,purchase,nexus
+    u2,purchase,galaxy
+    u3,purchase,surface
+    u4,purchase,iphone
+    u4,purchase,galaxy
+    u1,view,iphone
+    u1,view,ipad
+    u1,view,nexus
+    u1,view,galaxy
+    u2,view,iphone
+    u2,view,ipad
+    u2,view,nexus
+    u2,view,galaxy
+    u3,view,surface
+    u3,view,nexus
+    u4,view,iphone
+    u4,view,ipad
+    u4,view,galaxy
 
 ###Command Line
 
 
 Use the following options can be used:
 
-```
-bash$ mahout spark-itemsimilarity \
-	--input in-file \     # where to look for data
-    --output out-path \   # root dir for output
-    --master masterUrl \  # URL of the Spark master server
-    --filter1 purchase \  # word that flags input for the primary action
-    --filter2 view \      # word that flags input for the secondary action
-    --itemIDPosition 2 \  # column that has the item ID
-    --rowIDPosition 0 \   # column that has the user ID
-    --filterPosition 1    # column that has the filter word
-```
+    bash$ mahout spark-itemsimilarity \
+    	--input in-file \     # where to look for data
+        --output out-path \   # root dir for output
+        --master masterUrl \  # URL of the Spark master server
+        --filter1 purchase \  # word that flags input for the primary action
+        --filter2 view \      # word that flags input for the secondary action
+        --itemIDPosition 2 \  # column that has the item ID
+        --rowIDPosition 0 \   # column that has the user ID
+        --filterPosition 1    # column that has the filter word
+
 
 
 ###Output
 
 The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating cross-cooccurrence so a primary indicator matrix and cross-indicator matrix will be created
 
-```
-out-path
-  |-- indicator-matrix - TDF part files
-  \-- cross-indicator-matrix - TDF part-files
+    out-path
+      |-- indicator-matrix - TDF part files
+      \-- cross-indicator-matrix - TDF part-files
 
-```
 The indicator matrix will contain the lines:
 
-```
-galaxy\tnexus:1.7260924347106847
-ipad\tiphone:1.7260924347106847
-nexus\tgalaxy:1.7260924347106847
-iphone\tipad:1.7260924347106847
-surface
-```
+    galaxy\tnexus:1.7260924347106847
+    ipad\tiphone:1.7260924347106847
+    nexus\tgalaxy:1.7260924347106847
+    iphone\tipad:1.7260924347106847
+    surface
 
 The cross-indicator matrix will contain:
 
-```
-iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
-ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
-nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
-galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
-surface\tsurface:4.498681156950466 nexus:0.6795961471815897
-```
+    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
+    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
+    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
+    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
+    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
 
 ###Log File Input
 
 A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form:
 
-```
-2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
-2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
-2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
-2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
-2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
-2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
-2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
-2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
-2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
-2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
-2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
-2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
-2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
-```
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
 
 Can be parsed with the following CLI and run on the cluster producing the same output as the above example.
 
-```
-bash$ mahout spark-itemsimilarity \
-    --input in-file \
-    --output out-path \
-    --master spark://sparkmaster:4044 \
-    --filter1 purchase \
-    --filter2 view \
-    --inDelim "\t" \
-    --itemIDPosition 4 \
-    --rowIDPosition 1 \
-    --filterPosition 2 \
-```
+    bash$ mahout spark-itemsimilarity \
+        --input in-file \
+        --output out-path \
+        --master spark://sparkmaster:4044 \
+        --filter1 purchase \
+        --filter2 view \
+        --inDelim "\t" \
+        --itemIDPosition 4 \
+        --rowIDPosition 1 \
+        --filterPosition 2 \
 
 ##2. spark-rowsimilarity
 
@@ -245,79 +227,77 @@ One significant output option is --omitS
 
 The command line interface is:
 
-```
-spark-rowsimilarity Mahout 1.0-SNAPSHOT
-Usage: spark-rowsimilarity [options]
-
-Input, output options
-  -i <value> | --input <value>
-        Input path, may be a filename, directory name, or comma delimited list 
-        of HDFS supported URIs (required)
-  -i2 <value> | --input2 <value>
-        Secondary input path for cross-similarity calculation, same restrictions 
-        as "--input" (optional). Default: empty.
-  -o <value> | --output <value>
-        Path for output, any local or HDFS supported URI (required)
-
-Algorithm control options:
-  -mo <value> | --maxObservations <value>
-        Max number of observations to consider per row (optional). Default: 500
-  -m <value> | --maxSimilaritiesPerRow <value>
-        Limit the number of similarities per item to this number (optional). 
-        Default: 100
-
-Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
-
-Output text file schema options:
-  -rd <value> | --rowKeyDelim <value>
-        Separates the rowID key from the vector values list (optional). 
-        Default: "\t"
-  -cd <value> | --columnIdStrengthDelim <value>
-        Separates column IDs from their values in the vector values list 
-        (optional). Default: ":"
-  -td <value> | --elementDelim <value>
-        Separates vector element values in the values list (optional). 
-        Default: " "
-  -os | --omitStrength
-        Do not write the strength to the output files (optional), Default: 
-        false.
-This option is used to output indexable data for creating a search engine 
-recommender.
-
-Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
-
-File discovery options:
-  -r | --recursive
-        Searched the -i path recursively for files that match 
-        --filenamePattern (optional), Default: false
-  -fp <value> | --filenamePattern <value>
-        Regex to match in determining input files (optional). Default: 
-        filename in the --input option or "^part-.*" if --input is a directory
-
-Spark config options:
-  -ma <value> | --master <value>
-        Spark Master URL (optional). Default: "local". Note that you can 
-        specify the number of cores to get a performance improvement, for 
-        example "local[4]"
-  -sem <value> | --sparkExecutorMem <value>
-        Max Java heap available as "executor memory" on each node (optional). 
-        Default: 4g
-
-General config options:
-  -rs <value> | --randomSeed <value>
-        
-  -h | --help
-        prints this usage text
-```
+    spark-rowsimilarity Mahout 1.0-SNAPSHOT
+    Usage: spark-rowsimilarity [options]
+    
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited list 
+            of HDFS supported URIs (required)
+      -i2 <value> | --input2 <value>
+            Secondary input path for cross-similarity calculation, same restrictions 
+            as "--input" (optional). Default: empty.
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mo <value> | --maxObservations <value>
+            Max number of observations to consider per row (optional). Default: 500
+      -m <value> | --maxSimilaritiesPerRow <value>
+            Limit the number of similarities per item to this number (optional). 
+            Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). 
+            Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list 
+            (optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). 
+            Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: 
+            false.
+    This option is used to output indexable data for creating a search engine 
+    recommender.
+    
+    Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match 
+            --filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: 
+            filename in the --input option or "^part-.*" if --input is a directory
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can 
+            specify the number of cores to get a performance improvement, for 
+            example "local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node (optional). 
+            Default: 4g
+    
+    General config options:
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
 See RowSimilarityDriver.scala in Mahout's spark module if you want to customize the code. 
 
 #3. Creating a Recommender
 
 One significant output option for the spark-itemsimilarity job is --omitStrength. This is a tab-delimited file containing a itemID token followed by a space delimited string of tokens of the form:
 
-```
-itemID<tab>itemsIDs-from-the-indicator-matrix
-```
+    itemID<tab>itemsIDs-from-the-indicator-matrix
+
 
 To create a cooccurrence type collaborative filtering recommender using a search engine simply index this output created with --omitStrength. Then at runtime query the indexed data with the current user's history of the primary action on the index field that contains the primary indicator tokens. The result will be an ordered list of itemIDs as recommendations.
 
@@ -329,9 +309,9 @@ Optionally the query can contain the use
 
 In this case the indicator-matrix and the cross-indicator-matrix should be combined and indexed as two fields. The data will be of the form:
 
-```
-itemID, itemIDs-from-indicator-matrix, itemIDs-from-cross-indicator-matrix
-```
+
+    itemID, itemIDs-from-indicator-matrix, itemIDs-from-cross-indicator-matrix
+
 
 Now the query will have one string of the user's primary action history and a second of the user's secondary action history against two fields in the index.
 
@@ -342,3 +322,16 @@ It is probably better to index the two (
 #4. Using *spark-rowsimilarity* with Text Data
 
 Another use case for these jobs is in finding similar textual content. For instance given the content of a blog post, which other posts are similar. In this case the columns are tokenized words and the rows are documents. Since LLR is being used there is no need to attach TF-IDF weights to the tokens&mdash;they will not be used. The Apache [Lucene](http://lucene.apache.org) project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description) documents.
+
+
+
+
+
+
+
+
+
+
+
+
+