You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/08/20 23:23:29 UTC

spark-itemsimilarity output

Got a question on Twitter about this so here’s the answer:

Hadoop itemsimilarity takes pairs of interactions as input (user id<tab>item id<tab>strength) and outputs pairs (item id1<tab>item id2<tab>strength). Furthermore the output is only above the diagonal so redundant pairs are not output.

spark-itemsimilarity takes the same input but outputs basically a drm in text form including redundant pairs and sorts each row by LLR strength. It also preserves the IDs that were input even if they were strings. For example, using the default delimiters the output might be: (itemID1<tab>itemID2:strength2<space>itemID3:strength3…) This change of format is because:

1) one primary use of spark-itemsimilarity is with a search engine to create a recommender and this format (minus the strengths) is easily indexed.
2) the output is basically a sorted list of similar items for each item. This is a better format where the user expects to show a list of similar items
3) it is assumed that preserving the application IDs is a benefit.
4) the various parts of the output can be parsed with simple string “split” methods available in all languages.

If anyone can see a need to reproduce the old hadoop type output please speak up.